Instant search - A hands-on tutorial
-
Upload
ganesh-venkataraman -
Category
Engineering
-
view
488 -
download
2
Transcript of Instant search - A hands-on tutorial
Instant Search - A Hands-on TutorialACM SIGIR 2016
Ganesh Venkataraman, Viet Ha-Thuc, Dhruv Arya and Abhimanyu LadLinkedIn Search
1
Where to find informationCode - https://github.com/linkedin/instantsearch-tutorial
Wiki - https://github.com/linkedin/instantsearch-tutorial/wiki
Slack - https://instantsearchtutorial.slack.com/
Slides - will be on the slideshare and we will update the wiki/tweet
Twitter - #instantsearchtutorial (twitter.com/search)
3
The Plot● At the end of this tutorial, attendees should:
○ Understand the challenges/constraints faced while dealing with instant search (latency, tolerance to user errors) etc
○ Get a broad overview of the theoretical foundations behind:■ Indexing■ Query Processing■ Ranking and Blending (including personalization)
○ Understand open source options available to put together an ‘end-to-end’ instant search solution
○ Put together an end-to-end solution on their own (with some helper code)
4
What would graduation look like?
● Instant result solution built over stackoverflow data
● Built based on open source tools (elasticsearch, typeahead.js)
● Ability to experiment further to modify ranking/query construction
5
Agenda● Terminology and Background● Indexing & Retrieval
○ Instant Results○ Query Autocomplete
● Ranking● Hands on tutorial with data from stackoverflow
○ Index and search posts from stackoverflow○ Play around with ranking
7
Agenda● Terminology and Background● Indexing & Retrieval
○ Instant Results○ Query Autocomplete
● Ranking● Hands on tutorial with data from stackoverflow
○ Index and search xx posts from stackoverflow○ Play around with ranking
8
Terminology - Navigational Query● Queries where the information need can be satisfied by only one
result/document
12
When to display instant results vs query completion
● LinkedIn product decision○ when the confidence level is high enough for a
particular result, show the result● What is ‘high enough’ could be application specific and
not merely a function of score
14
Completing query vs instant results● “lin” => first degree connection with lots of common connections, same
company etc.● “link” => better off completing the query (even with possible suggestions for
verticals)
15
Terminology - Blending● Bringing results from different search verticals (news, web, answers etc)
16
Why Instant Search and why now?● Natural evolution of search● Users have gotten used to getting immediate feedback● Mobile devices => need to type less
18
Agenda● Terminology and Background● Indexing & Retrieval
○ Instant Results○ Query Autocomplete
● Ranking● Hands on tutorial with data from stackoverflow
○ Index and search xx posts from stackoverflow○ Play around with ranking
19
Instant Search at Scale
● Constraints (example: LinkedIn people search)○ Scale - ability to store and retrieve 100’s of Millions/Billions of
documents via prefix○ Fast - ability to return results quicker than typing speed○ Resilience to user errors ○ Personalized
20
Instant Search via Inverted Index
● Scaleable● Ability to form complex boolean queries● Open source availability (Lucene/Elasticsearch)● Easy to add metadata (payloads, forward index)
21
The Search Index
Inverted Index: Mapping from (search) terms to list of documents (they are present in)Forward Index: Mapping from documents to metadata about them
22
Candidate selection
● Posting lists○ “abraham” => {5, 7, 8, 23, 47, 101}○ “lincoln” => {7, 23, 101, 151}
● Query = “abraham AND lincoln”○ Retrieved set => {7, 23, 101}
24
Prefix indexing● Instant search, query != ‘abraham’● Queries = [‘a’, ‘ab’, … , ‘abraham’]● Need to index each prefix● Elasticsearch refers to this form of tokenization as ‘edge n-gram’● Issues
○ Bigger index○ Big posting list for short prefixes => much higher number of documents retrieved
25
Early Termination● We cannot ‘afford’ to retrieve and score all documents that match the query● We terminate posting list traversal when certain number of documents have
been retrieved ● We may miss out on recall
26
Static Rank● Order the posting lists so that documents with high (query independent) prior
probability of relevance appears first● Use application specific logic to rewrite query● Once the query has achieved a certain number of matches in the posting list,
we stop. This number of matches is referred to as “early termination limit”
27
Static Rank Example - People Search at LinkedIn
● Some factors that go into static rank computation○ Member popularity measure by profile views both
within and outside network○ Spam in person’s name○ Security and Spam. Downgrade profiles flagged by
LinkedIn’s internal security team○ Celebrities and Influencers
28
Resilience to Spelling errors● We focus on names as they can be (often) hard to get right (ex: “marissa
mayer” or “marissa meyer”?)● Names vs traditional spelling errors:
○ “program manager” vs “program manger” - only one of these is right○ “Mayer” vs “Meyer” - no clear source of truth
● Edit distance based approaches can be wrong both ways:○ “Mohamad” and “Muhammed” are 3 edits apart and yet plausible variants
○ “Jeff” and “Joff” are 1 edit distance apart, but highly unlikely to be plausible variants of the same name
30
Name Clusters - Two step clustering● Course level clustering
○ Uses double metaphone + some known heuristics○ Focus on recall
● Fine level clustering○ similarity function that takes into account Jaro-Winkler distance○ User session data
32
Overall approach for Name Clusters● Indexing
○ Store clusterID for each cluster in a separate field (say ‘NAMECLUSTERID’)○ ‘Cris’ and ‘chris’ in same name cluster CHRISID○ NAME:cris NAMECLUSTERID:chris
● Query processing○ user query = ‘chris’○ Rewritten query = ?NAME:chris ?NAMECLUSTERID:chris
● Ranking○ Different weights for ‘perfect match’ vs. ‘name cluster match’
33
Instant Results via Inverted Index - Some Takeaways
● Used for documents at very high scale● Use early termination ● Approach the problem as a combination of indexing/query processing/ranking
34
Agenda● Terminology and Background● Indexing & Retrieval
○ Instant Results○ Query Autocomplete
● Ranking● Hands on tutorial with data from stackoverflow
○ Index and search xx posts from stackoverflow○ Play around with ranking
35
Query Autocomplete - Problem Statement● Let q = w1, w2 . . . wk* represent
the query with k words, where the kth token is a prefix as denoted by the asterisk
● Goal: Find one or more relevant completions for the query
36
Trie● Used to store an associative array
where keys are strings
● Only certain keys and leaves are of interest
● Structure allows for only sharing of prefixes
● Representation not memory efficient
37An trie of words {space, spark, moth}
Finite State Transducers (FST)● Allows efficient retrieval of
completions at runtime
● Can fit entirely into RAM
● Useful when keys have commonalities to them, allowing better compression
● Lucene has support for FSTs*
FST for words: software, scala, scalding, spark
*Lucene FST implementation based on “Direct Construction of Minimal Acyclic Subsequential Transducers (2001)” by Stoyan Mihov, Denis Maurel 38
Query Autocomplete vs. Instant Results● For query autocomplete corpus of terms remains relatively constant, instant
results documents can be continuously added/removed
● Query autocomplete focuses only on prefix based retrieval whereas instant search results utilize complex query construction for retrieval
● Query autocomplete retrieval based off a dictionary hence index can be refreshed periodically instead of real time
39
Query Tagging● Segment query based on
recognized entities
● Annotate query with:○ Named Entity Tags○ Standardized Identifiers○ Related Entities ○ Additional Entity Specific Metadata
40
Data Processing● Break queries into recognized entities and individual tokens
● Past querylogs are parsed for recognized entities, tokens and fed into an fst for retrieval of candidate suggestions.
41
Retrieval● All candidate completions over increasingly longer suffixes of the query are
used to capture enough context
● Given a query like “linkedin sof*” we look completions for:○ sof*, linkedin sof*
● Candidates are then provided to the scoring phase.
42
Retrieval
● From the above FST, for the query “linkedin sof*” we retrieve the candidates:○ sof: [software developer, software engineer]○ linkedin sof: []
43
Payloads
● Each query autocomplete result can have a payload associated with it.
● A payload holds serialized data useful in scoring the autocomplete result
44
Fuzzy Matching● Use levenshtein automata constructed from
a word and maximum edit distance
● Based on the automaton and letters input to it, we decide whether to continue or not
● Ex. search for “dpark” (s/d being close on the keyboard) with edit distance 1 = [spark]
An index of {space, spark, moth} represented as a trie
46
Agenda● Terminology and Background● Indexing & Retrieval● Ranking
○ Ranking instant results○ Ranking query suggestions○ Blending
● Hands on tutorial with data from stackoverflow
50
Ranking Challenge
● Short query prefixes● Context beyond query
○ Personalized context○ Global context
■ Global popularity■ Trending
51
Hand-Tuned vs. Machine-Learned Ranking
● Hard to manually tune with very large number of features● Challenging to personalize
● LTR allows leveraging large volume of click data in an automated way
52
Agenda● Terminology and Background● Indexing & Retrieval● Ranking
○ Ranking instant results○ Ranking query suggestions○ Blending
● Hands on tutorial with data from stackoverflow
53
Features● Social Affinity (personalized features)
○ Network distance between searcher and result○ Connection Strength
■ Within the same company■ Common connections■ From the same school
56
Training Data● Log-based
○ Personalized○ Available in large quantity
● Position Bias○ Top-K randomization
58
Learning to Rank
▪ Pointwise: Reduce ranking to binary classification
LinkedIn Confidential ©2013 All Rights Reserved 59
+++-
+---
++--
Learning to Rank
▪ Pointwise: Reduce ranking to binary classification
LinkedIn Confidential ©2013 All Rights Reserved 60
+++-
+---
++--
Learning to Rank
▪ Pointwise: Reduce ranking to binary classification
LinkedIn Confidential ©2013 All Rights Reserved 61
+++-
+---
++--
Limitations▪ Relevant documents associated with different queries are put into the
same class
Learning to Rank
▪ Pairwise: Reduce ranking to classification of document pairs w.r.t. the same query– {(Q1, A>B), (Q2, C>D), (Q3, E>F)}
LinkedIn Confidential ©2013 All Rights Reserved 62
Learning to Rank
▪ Pairwise: Reduce ranking to classification of document pairs w.r.t. the same query– {(Q1, A>B), (Q2, C>D), (Q3, E>F)}
LinkedIn Confidential ©2013 All Rights Reserved 63
Learning to Rank▪ Pairwise
– Limitation: Does not differentiate inversions at top vs. bottom positions
LinkedIn Confidential ©2013 All Rights Reserved 64
Learning to Rank▪ Listwise
– Directly operate on ranked lists
– Optimize listwise objective function, e.g. IR metrics▪ Mean Average Precision (MAP)▪ Normalized Discounted Cumulative Gain (NDCG)
LinkedIn Confidential ©2013 All Rights Reserved 65
Agenda● Terminology and Background● Indexing & Retrieval● Ranking
○ Ranking vertical results○ Ranking query suggestions○ Blending
● Hands on tutorial with data from stackoverflow
66
Features
● Query Popularity○ Candidate completion q = s1, s2 … sk○ Likelihood q is a query in the query corpus, estimated by N-gram
language model
Pr(q) = Pr(s1, s2 … sk)
= Pr(s1) * Pr (s2|s1) … P(sk|sk-1)
67
Features
● Time-sensitive popularity [Shokouhi et al. SIGIR 12]
○ Trending query
○ Periodic Pattern
■ Weekend -> Disneyland○ Time-series: Forecasted frequencies
68
Agenda● Terminology and Background● Indexing & Retrieval● Ranking
○ Ranking instant results○ Ranking query suggestions○ Blending
● Hands on tutorial with data from stackoverflow
70
Blending Challenges
● Different verticals associate with different signals○ People: network distance○ Groups: time of the last edit○ Query suggestion: edit distance
● Even common features may not be equally predictive across verticals○ Popularity○ Text similarity
● Scores might not be comparable across verticals
73
Approaches
● Separate binary classifiers○ Pros
■ Handle vertical-specific features■ Handle common features with different predictive powers
○ Cons■ Need to calibrate output scores of multiple classifiers
75
Approaches● Learning-to-rank - Equal correlation assumption
○ Union feature schema and padding zeros to non-applicable features ○ Equal correlation assumption
f1 f2 f3
f1 f2 f4
People
Jobs
f1 f2 f3 f4=0
f1 f2 f3=0 f4
Model
76
Approaches
● Learning-to-rank - Equal correlation assumption○ Pros
■ Handle vertical-specific features■ Comparable output scores across verticals
○ Cons■ Assume common features are equally predictive of vertical relevance
77
Approaches
● Learning-to-rank - Without equal correlation assumption
f1 f2 f3
f4 f5 f6
People
Jobs
f1 f2 f3 0
0 0 0 f4
Model
0 0
f5 f6
People vertical features
Job vertical features78
Approaches
● Learning-to-rank - Without equal correlation assumption○ Pros
■ Handle vertical-specific features■ Without equal correlation assumption -> auto learn evidence-vertical
association ■ Comparable output scores across verticals
○ Cons■ The number of features is huge
● Overfitting● Require a huge amount of training data
79
Evaluation● “If you can’t measure it, you can’t improve it”● Metrics
○ Successful search rate○ Number of keystrokes per search: query length + clicked result rank
80
Take-Aways● Speed
○ Instant results: Early termination○ Autocompletion: FST
● Tolerance to spelling errors● Relevance: go beyond query prefix
○ Personalized context○ Global context
81
Agenda● Terminology and Background● Indexing & Retrieval● Ranking
○ Ranking instant results○ Ranking query suggestions○ Blending
● Hands on tutorial with data from stackoverflow
82
Dataset● Posts and Tags from stackoverflow.com
● Posts are questions posted by users and contains following attributes○ Title
○ Score
● Tags help identify a suitable category for the post and contain following attributes
○ Tag Name
○ Count
● Each post can have a maximum of five tags83
Assignments● Assignments available on Github
● Each assignment builds on a component of the end product
● Tests are provided at end of each assignment for validation
● Finished files available for reference (if needed)
● Raise hand if you need help or have a question
89
Take-Aways● Index should be used primarily for retrieval
● Data sources should be kept separate from the index
● Building an index is not instantaneous hence have replicas in production
● Real world indexes seldom can be stored in a single shard
92
Take-Aways● Make incremental additions
● Allow for relevance changes to be compared
● Document relevance changes
● Do side by side evaluations
94
Summary● Theoretical understanding of indexing, retrieval and ranking for instant search
results and query autocomplete
● Insights and learnings from linkedin.com case studies
● Working end-to-end implementation of query autocomplete and instant results with stackoverflow.com dataset
97