You know, for search
-
Upload
peter-van-der-weerd -
Category
Software
-
view
228 -
download
0
Transcript of You know, for search
![Page 1: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/1.jpg)
De Bitmanager, 2016
You Know, for Search
Peter van der Weerd
![Page 2: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/2.jpg)
De Bitmanager, 2016
Who am I?
• Peter van der Weerd
• Search specialist
• Self employed Bitmanager
• Enormous span of control
![Page 3: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/3.jpg)
De Bitmanager, 2016
Search
• Common sense:
Easy
Solved
![Page 4: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/4.jpg)
De Bitmanager, 2016
Yeah, true…
• Install ES
• Fill it with some data
• And \o/: we can search
![Page 5: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/5.jpg)
De Bitmanager, 2016
But…
• Are the users satisfied?
• Many people struggle with sub-optimal search results.
![Page 6: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/6.jpg)
De Bitmanager, 2016
Search as a toolbox
• It consists of 1 or more(!) tools to find what you need
Searchbox
Faceting (intersecting)
Sorting
More like this
Not more like this (this is not what I mean)
Etc…
![Page 7: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/7.jpg)
De Bitmanager, 2016
Search at Booking
• Destination based (city, region, airport, etc)
• AutocompleteResults in max 5 destinations, query per keystroke
• DisambiguationShow a partioned result that enables peopleto choose a destination
![Page 8: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/8.jpg)
De Bitmanager, 2016
Autocomplete in action
![Page 9: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/9.jpg)
De Bitmanager, 2016
Disambiguation in action
![Page 10: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/10.jpg)
De Bitmanager, 2016
Scoring
![Page 11: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/11.jpg)
De Bitmanager, 2016
Scoring
• Lucene scores in general like: tf * idf
• Tf = term frequencythe more matched terms, the more important
• Idf = inverse document frequencyThe more matched documents for the term, the less important
![Page 12: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/12.jpg)
De Bitmanager, 2016
Term frequency
• Used to give more importance to relative high occurring terms.
• Scoring examples for ‘house’
House
The house
The little house on the prairie
The little house on the prairie blah blah blah
score
![Page 13: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/13.jpg)
De Bitmanager, 2016
Inverse document frequency
• Prefers less frequent tokens.
• Useless on single token queries: it is only usedto relative score multiple tokens
• Examples:
house
little
on
the
score
![Page 14: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/14.jpg)
De Bitmanager, 2016
Drawback of idf
• Other example…
Pekela
Haarlem
Amsterdam
Paris
• Booking switched off idf, but could have useddf instead…
score
![Page 15: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/15.jpg)
De Bitmanager, 2016
When does idf work
• Idf typically work for large text-like queries.
• The documents *must* be evenly distributedover shards(or use dfs_query_then_fetch)
![Page 16: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/16.jpg)
De Bitmanager, 2016
Is tf * idf enough?
• Well, no…
• What to deliver on a query for ‘Paris’?
The city (ehm, the are several cities Paris)
Airports?
Hotels? Which one? There are 1000’s of them.
• Even worse:What to deliver for query ‘p’ or ‘pa’?
![Page 17: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/17.jpg)
De Bitmanager, 2016
Record boost
• Based on
Popularity
From where booked
Language
o Same (doc language == site language)
o Local translations
o English
oMismatch
![Page 18: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/18.jpg)
De Bitmanager, 2016
+ or x?
• Boosts are implemented by adding
• Intuitive justification:
Language could be seen as yet another (implicit!) search term
Same for popularity: people ar typical notsearching for impopular things
• Example (from an english site):amsterdam->amsterdam english popular
![Page 19: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/19.jpg)
De Bitmanager, 2016
But wait…
• How big should the record-boost be?
0..1? 100?
• Lucene score might vary heavely,sometimes more then 10x different
• So lets take 10 as max record-boost
But now the recordboost might out-weight smaller scores
• Argggggg….
![Page 20: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/20.jpg)
De Bitmanager, 2016
Score ranges
• Difficult to tinker with:
For instance use a stemmed token with boost 0.5house^1.0 vs houses^0.5
What if the Lucene score is more than 2 timeshigher than the stem itself?
• We are doing entity search vs text search
![Page 21: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/21.jpg)
De Bitmanager, 2016
Different scorers
Title Score:default Score:BM25 Score:custom
House 1.22 0.77 1.20
The house 0.76 0.61 1.10
The little house on the prairie
0.46 0.39 1.05
Querying for ‘house’:
![Page 22: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/22.jpg)
De Bitmanager, 2016
Normalizing scores
• Goal: each term is scored around 1.0
Base score 1.0
Tf is normalized between 0 .. 0.2 and added to the base score
Idf is normalized between 0 .. 0.2 and added to the base score
Giving a score varying between 1 and 1.4 per term(sometimes we don’t use idf)
![Page 23: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/23.jpg)
De Bitmanager, 2016
Language boosting
• Same language or english: +0.7
• Local language: +0.3(Roma vs Rome in an English site)
• Mismatched language: -0.3
![Page 24: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/24.jpg)
De Bitmanager, 2016
About N-grams
• For auto-complete: left-edge N-Grams
• Rome:romeromror
![Page 25: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/25.jpg)
De Bitmanager, 2016
About N-grams
• When a user types ‘ro’…
Rome
Ródos
Rotterdam
Etc
• Score depends on percentage of match(or Levenshtein distance)
score
![Page 26: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/26.jpg)
De Bitmanager, 2016
Original approach
• Multiple fields (name, city, region, etc)
• Combining them by a weighted dismax query
![Page 27: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/27.jpg)
De Bitmanager, 2016
Dismax query
• More subtle way of combining scores.
• Score = max + (sum - max) * tieBreaker
In words: the max plus a percentage of the others
• Edge cases:
Tiebreaker=0Score is the max. score
Tiebreaker=1Score is the sum of all the individual scores(same behavior as boolean or)
![Page 28: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/28.jpg)
De Bitmanager, 2016
Dismax example
• Q= the houseSuppose S[the] = 0.8, S[house]=1.2
• Scores for different tiebreakers:
Bool score (tiebreaker=1): 2.0
Max score (tiebreaker=0): 1.2
Score with tiebreaker=0.1: 1.28this makes documents containing ‘the house’ a little bit more important than ‘house’ only.
![Page 29: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/29.jpg)
De Bitmanager, 2016
Difficulties
• Lack of context
• Hard to create a reliable scoring model
![Page 30: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/30.jpg)
De Bitmanager, 2016
Different approach
• Canonical name: Hotel V Frederiksplein, Amsterdam, Noord-Holland, Netherlands
• Self name (indexed)
Hotel V Frederiksplein
• Rest (indexed)
Amsterdam, Noord-Holland, Netherlands
![Page 31: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/31.jpg)
De Bitmanager, 2016
Weighting fields
• All fields are equal but some fields are more equal than others…
Self name is most important
Other names (like the city where a hotel resides) are less important
• Dismax over self name and other
![Page 32: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/32.jpg)
De Bitmanager, 2016
Payload
• Small piece of information that is added toevery occurrence
• Basically a byte[]
![Page 33: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/33.jpg)
De Bitmanager, 2016
Nowadays: payloads
• We need more information per occurrence of a token:
Length of the original token
Self-name or other location info
Type of the name (hotel, city, landmark, etc)
• All the above info is encoded in a 32 bit integer, and indexed as a payload
![Page 34: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/34.jpg)
De Bitmanager, 2016
Dismax vs payload
• With fieldinfo in the payload we can simulatedismax behavior
• We query only 1 index-field (instead of 5)
• Context: easier to do advanced scoring: all info is in 1 scorer.
• Payloads *are* possible in ElasticSearch, but more difficult to use
![Page 35: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/35.jpg)
De Bitmanager, 2016
Search
• Difficult
• Sensitive equilibrium
• Impossible to serve them all
![Page 36: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/36.jpg)
De Bitmanager, 2016
Suits
![Page 37: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/37.jpg)
De Bitmanager, 2016
Suits
• Reasons for people to wear a suit mightinclude:
Hiding the fact that you cannot trust them
Hiding their incompetence
etc
![Page 38: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/38.jpg)
De Bitmanager, 2016
Combining fields
• To prevent double counting, a dismax is adviced.
• The fact that a term occurs in both the title as the abstract doesn’t make it roughly twice as important.
But it does make it somewhat more important
![Page 39: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/39.jpg)
De Bitmanager, 2016
Combining fields
• Intuitive reaction: query terms in each others neighborhood are more important…
• Example: search for a book:chamber secrets rowling
• Expected top result:Harry Potter and the Chamber of Secrets/J.K. Rowling
![Page 40: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/40.jpg)
De Bitmanager, 2016
Combining fields
"_score": 2.0767038,"author": "De Bitmanager","title": "Excerpt book","abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,"author": "J.K. Rowling","title": "Harry Potter and the Chamber of Secrets","abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom."
• More important if in the same field?
![Page 41: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/41.jpg)
De Bitmanager, 2016
Combining fields
• But: we get an excerpt book that contains the requested
(all terms were present in the abstract field)
• Phrases behave even worse
![Page 42: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/42.jpg)
De Bitmanager, 2016
Combining fields
• Suppose:
we have 2 fields: F1 and F2
2 query terms: qt1 and qt2
• Now we have choices how to combine…
![Page 43: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/43.jpg)
De Bitmanager, 2016
Combining fields
• (F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
this will prefer records where both terms are found in the same field
• (F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
this prefer behaves more like a there were no fields
![Page 44: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/44.jpg)
De Bitmanager, 2016
Combining fields
(F1:qt1 | F1:qt2) dismax (F2:qt1 | F2:qt2)
"_score": 2.0767038,"author": "De Bitmanager","title": "Excerpt book","abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
"_score": 1.2030121,"author": "J.K. Rowling","title": "Harry Potter and the Chamber of Secrets","abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom."
![Page 45: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/45.jpg)
De Bitmanager, 2016
Combining fields
(F1:qt1 dismax F2:qt1) | (F1:qt2 dismax F2:qt2)
"_score": 2.1447253,"author": "J.K. Rowling","title": "Harry Potter and the Chamber of Secrets","abstract": "Fresh torments and horrors arise, including an outrageously stuck-up
new professor, Gilderoy Lockheart, and a spirit named Moaning Myrtle who haunts the girls' bathroom."
"_score": 2.0767038,"author": "De Bitmanager","title": "Excerpt book","abstract": "Contains: Harry Potter and the Chamber of Secrets by J.K. Rowling"
![Page 46: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/46.jpg)
De Bitmanager, 2016
Combining fields
• Of course: way more possibilities.
See the multi-match query for examples
Most but not all possibilities can be done by hand(blending)
![Page 47: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/47.jpg)
De Bitmanager, 2016
Combining fields
• Different strategy:
Combine all fields as if they were one field
Do some re-scoring afterwards
Example:
o Search ‘rowling’ anywhere, score 1
o Search ‘potter’ anywhere, score 1
oCombine with additional queries to do a finishing touch
![Page 48: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/48.jpg)
De Bitmanager, 2016
Explain
• Always use explain (in debug mode)
• Did I already tell you to always use explain?
• Create a new application by first making explain part of your infrastructure
• At least expose the scores in debug mode.
![Page 49: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/49.jpg)
De Bitmanager, 2016
Suits: beware the logic rules…
• Cannot be reversed:
• The fact that I am not wearing a suit does notimply that:
I am trustworthy
I am competent
![Page 50: You know, for search](https://reader034.fdocuments.net/reader034/viewer/2022042619/58a1d9a71a28abb6678b5cb3/html5/thumbnails/50.jpg)
De Bitmanager, 2016
You Know, for Bits…
Peter @ bitmanager.nl