Search explained T3DD15

41

Transcript of Search explained T3DD15

Page 1: Search explained T3DD15
Page 2: Search explained T3DD15

Search explained

Page 3: Search explained T3DD15

My Name is Hans Höchtl

Technical director @ Onedrop Solutions

PHP, Java, Ruby Developer

Participation in TYPO3 Solr

Page 4: Search explained T3DD15

SELECT * FROM mytable WHERE field LIKE „%searchword%“

SELECT * FROM mytable WHERE field SOUNDS LIKE

„searchword“

Page 5: Search explained T3DD15

Appearance of a word inside a text can be determined easily.

But is it relevant?

Page 6: Search explained T3DD15

Relevance is subjective and depends on the judgement of users.

We use „scoring“ to predict relevance.

Page 7: Search explained T3DD15

Scoring is computed by a function applied on our indexed documents using the search term as input parameter.

Page 8: Search explained T3DD15

TF-IDF Term frequency-inverse document frequency

BM25Okapi BM25 - Best Matching

DFRDivergence from randomness

and many more

Page 9: Search explained T3DD15

All those scoring calculations should fulfill these two requirements:

1. PrecisionAre the results relevant to the user?

2. RecallHave we found all relevant content in the index?

Page 10: Search explained T3DD15

How to store documents for efficient computing of scoring?

Page 11: Search explained T3DD15

Vector Space Model Default in Solr, Elasticsearch

Document: A vector of terms

Term: A „word“ inside a document

Each unique term is a dimension

Page 12: Search explained T3DD15

Vector Space Model

The best match is the narrowest angle between query and document

Page 13: Search explained T3DD15

Document 1

„unique unique bag“

Document 2

„unique bag bag“

Query

unique bagunique

bagv(d1)

v(q)

v(d2)

Page 14: Search explained T3DD15

The calculation of the cosine of the angle between the vectors is much easier than the calculation of the angle itself. (CPU cycles)

Page 15: Search explained T3DD15

Where d2 * q is the intersection (dot product) of the document and the query vectors.

||q|| is the norm vector of q

Page 16: Search explained T3DD15

A cosine value of zero means that the query and document vector are orthogonal and have no match.

Page 17: Search explained T3DD15

TF-IDF

Regarding the vector space model (VSM) the weight of the vector is now represented for a document d as:

Term frequencyInverse document frequency

Page 18: Search explained T3DD15

TF-IDF

Now we have everything together to calculate the similarity between documents using TF-IDF:

Page 19: Search explained T3DD15

TF-IDF

PROs CONs

- Simple model based on linear algebra

- Term weights not binary - Allows computing a

continuous degree of similarity between queries and documents

- Allows ranking of documents according to their possible relevance

- Allows partial matching

- Long documents have poor similarity values (small scalar and large dimensionality)

- Search keywords must precisely match terms

- Missing semantic sensitivity - Order of terms in document

not taken into account - Terms are usually not

statistically independent (as this model states)

Page 20: Search explained T3DD15

TF-IDF - The Lucene way

Coord: Boosts documents that match more of the search terms (multiple words) => 3/4 vs 4/4

Norm: Length normalization boosts fields that are shorter

Page 21: Search explained T3DD15

TF-IDF - Multiple fields

TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.

Page 22: Search explained T3DD15

TF-IDF - Multiple fields

TF-IDF expects a document to be just one field containing text. But in reality we have semi-structured documents containing fields like author, subtitle, etc.

Page 23: Search explained T3DD15

TF-IDF - Multiple fields

Solr Solution: DisMax Query Parser (Maximum Disjunction)

Searchterm: „my funny house“

Documents matching query in

field title Documents matching

query in field subtitle

Documents matching query in

field content

TF-IDF calculated for every field independently. Score of a document is the highest score of the field scoring values.

Page 24: Search explained T3DD15

Natural languages

Adjectives, Adverbs, Nouns, Verbs, Conjunctions, Prepositions, Predicates, Compounds, Plurals, Past tense, Declination, Semantics, etc.

Page 25: Search explained T3DD15

Language families

Indo-European languages

Sino-Tibetan languages

Page 26: Search explained T3DD15

TF-IDF Problem

Only exakt Term matches are considered a hit.

„Car“ is not the same term as „Cars“

Page 27: Search explained T3DD15

Handling human languages (Analyzers)

Tokenizers:Splits a stream of characters into a series of tokens.

Filters:The generated tokens are passed through a series of filters that add, change or remove tokens.

Page 28: Search explained T3DD15

Index Analyzers vs. Query Analyzers

Index Analyzers:Perform their analysis chain on the token stream during indexation. The generated tokens will be indexed.

Query Analyzers:Perform their analysis chain on the entered search query during query execution. Otherwise the query would hit just an exact match.

Beware of Synonyms!

Page 29: Search explained T3DD15

Available analyzers

Solr (https://goo.gl/TXEjZK) Language best practices (https://goo.gl/11O2Qz)

Elasticsearch (https://goo.gl/QR1IYb) Language best practices (https://goo.gl/6FQt7A)

Page 30: Search explained T3DD15

FieldTypes

Solr and Elasticsearch use fieldTypes assigned to fields for defining the analyzer chain that should be performed

Page 31: Search explained T3DD15

Let’s take a look in the configuration of TYPO3 Solr and Neos Elasticsearch

Page 32: Search explained T3DD15

Let’s test the analyzer chain

Solr and Elasticsearch

Page 33: Search explained T3DD15

Display score calculation

Solr: /solr/core_de/select?q=test&debugQuery=1

Elasticsearch: /_explain instead of /_search

Page 34: Search explained T3DD15

Let’s take a look at0.51602894 = (MATCH) sum of: 0.51602894 = (MATCH) max of: 0.51602894 = (MATCH) weight(content:sony^40.0 in 5) [DefaultSimilarity], result of: 0.51602894 = fieldWeight in 5, product of: 2.0 = tf(freq=4.0), with freq of: 4.0 = termFreq=4.0 3.3025851 = idf(docFreq=4, maxDocs=50) 0.078125 = fieldNorm(doc=5) 0.16512926 = (MATCH) weight(keywords:sony^2.0 in 5) [DefaultSimilarity], result of: 0.16512926 = score(doc=5,freq=1.0 = termFreq=1.0 ), product of: 0.05 = queryWeight, product of: 2.0 = boost 3.3025851 = idf(docFreq=4, maxDocs=50) 0.0075698276 = queryNorm 3.3025851 = fieldWeight in 5, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 3.3025851 = idf(docFreq=4, maxDocs=50) 1.0 = fieldNorm(doc=5)

Page 35: Search explained T3DD15

Product-Codes

„AS1134-B“

„131555813“

„EOS 500D“

„13 S24 36-G“

Page 36: Search explained T3DD15

Product-Codes

Index the code in multiple fields to have different analyzers and boost them from strict to fuzzy.

Make use of N-Grams, EdgeN-Grams, WordDelimiter, Trim, etc.

Page 37: Search explained T3DD15

Use the knowledge you gain from your customers to improve your search, … like Google does.

Page 38: Search explained T3DD15

- Use Google Analytics during index time (preAddModifyDocuments hook)

- Use recency of news (boostfunction)

- Analyze the search behavior of your customers (popularity of pages)

- Track search result clicks

Page 39: Search explained T3DD15

Some more interesting thinks

- Facets

- Spellchecking

- Phonetics

- Spatial

Page 40: Search explained T3DD15
Page 41: Search explained T3DD15

Thank you

Mail: [email protected] or [email protected]: @hhoechtlBlog: http://blog.1drop.de