Instant search - A hands-on tutorial

98
Instant Search - A Hands-on Tutorial ACM SIGIR 2016 Ganesh Venkataraman, Viet Ha-Thuc, Dhruv Arya and Abhimanyu Lad LinkedIn Search 1

Transcript of Instant search - A hands-on tutorial

Instant Search - A Hands-on TutorialACM SIGIR 2016

Ganesh Venkataraman, Viet Ha-Thuc, Dhruv Arya and Abhimanyu LadLinkedIn Search

1

The Actors

2

Where to find informationCode - https://github.com/linkedin/instantsearch-tutorial

Wiki - https://github.com/linkedin/instantsearch-tutorial/wiki

Slack - https://instantsearchtutorial.slack.com/

Slides - will be on the slideshare and we will update the wiki/tweet

Twitter - #instantsearchtutorial (twitter.com/search)

3

The Plot● At the end of this tutorial, attendees should:

○ Understand the challenges/constraints faced while dealing with instant search (latency, tolerance to user errors) etc

○ Get a broad overview of the theoretical foundations behind:■ Indexing■ Query Processing■ Ranking and Blending (including personalization)

○ Understand open source options available to put together an ‘end-to-end’ instant search solution

○ Put together an end-to-end solution on their own (with some helper code)

4

What would graduation look like?

● Instant result solution built over stackoverflow data

● Built based on open source tools (elasticsearch, typeahead.js)

● Ability to experiment further to modify ranking/query construction

5

Final Output from hands on tutorial

6

Agenda● Terminology and Background● Indexing & Retrieval

○ Instant Results○ Query Autocomplete

● Ranking● Hands on tutorial with data from stackoverflow

○ Index and search posts from stackoverflow○ Play around with ranking

7

Agenda● Terminology and Background● Indexing & Retrieval

○ Instant Results○ Query Autocomplete

● Ranking● Hands on tutorial with data from stackoverflow

○ Index and search xx posts from stackoverflow○ Play around with ranking

8

Terminology - Query Autocomplete● Intention is to complete the user query

9

Terminology - Instant Results● Get the result to the user as they type the query

10

Terminology - Instant Answers● We will NOT be covering answers for this tutorial

11

Terminology - Navigational Query● Queries where the information need can be satisfied by only one

result/document

12

Terminology - Exploratory Queries● Multiple results can potentially satisfy users need

13

When to display instant results vs query completion

● LinkedIn product decision○ when the confidence level is high enough for a

particular result, show the result● What is ‘high enough’ could be application specific and

not merely a function of score

14

Completing query vs instant results● “lin” => first degree connection with lots of common connections, same

company etc.● “link” => better off completing the query (even with possible suggestions for

verticals)

15

Terminology - Blending● Bringing results from different search verticals (news, web, answers etc)

16

Blending on prefix

17

Why Instant Search and why now?● Natural evolution of search● Users have gotten used to getting immediate feedback● Mobile devices => need to type less

18

Agenda● Terminology and Background● Indexing & Retrieval

○ Instant Results○ Query Autocomplete

● Ranking● Hands on tutorial with data from stackoverflow

○ Index and search xx posts from stackoverflow○ Play around with ranking

19

Instant Search at Scale

● Constraints (example: LinkedIn people search)○ Scale - ability to store and retrieve 100’s of Millions/Billions of

documents via prefix○ Fast - ability to return results quicker than typing speed○ Resilience to user errors ○ Personalized

20

Instant Search via Inverted Index

● Scaleable● Ability to form complex boolean queries● Open source availability (Lucene/Elasticsearch)● Easy to add metadata (payloads, forward index)

21

The Search Index

Inverted Index: Mapping from (search) terms to list of documents (they are present in)Forward Index: Mapping from documents to metadata about them

22

The Posting List

23

Candidate selection

● Posting lists○ “abraham” => {5, 7, 8, 23, 47, 101}○ “lincoln” => {7, 23, 101, 151}

● Query = “abraham AND lincoln”○ Retrieved set => {7, 23, 101}

24

Prefix indexing● Instant search, query != ‘abraham’● Queries = [‘a’, ‘ab’, … , ‘abraham’]● Need to index each prefix● Elasticsearch refers to this form of tokenization as ‘edge n-gram’● Issues

○ Bigger index○ Big posting list for short prefixes => much higher number of documents retrieved

25

Early Termination● We cannot ‘afford’ to retrieve and score all documents that match the query● We terminate posting list traversal when certain number of documents have

been retrieved ● We may miss out on recall

26

Static Rank● Order the posting lists so that documents with high (query independent) prior

probability of relevance appears first● Use application specific logic to rewrite query● Once the query has achieved a certain number of matches in the posting list,

we stop. This number of matches is referred to as “early termination limit”

27

Static Rank Example - People Search at LinkedIn

● Some factors that go into static rank computation○ Member popularity measure by profile views both

within and outside network○ Spam in person’s name○ Security and Spam. Downgrade profiles flagged by

LinkedIn’s internal security team○ Celebrities and Influencers

28

Static Rank Case study - People Search at LinkedIn

29

Recall

Early termination limit

Resilience to Spelling errors● We focus on names as they can be (often) hard to get right (ex: “marissa

mayer” or “marissa meyer”?)● Names vs traditional spelling errors:

○ “program manager” vs “program manger” - only one of these is right○ “Mayer” vs “Meyer” - no clear source of truth

● Edit distance based approaches can be wrong both ways:○ “Mohamad” and “Muhammed” are 3 edits apart and yet plausible variants

○ “Jeff” and “Joff” are 1 edit distance apart, but highly unlikely to be plausible variants of the same name

30

LinkedIn Approach - Name clustersSolution touches indexing, query reformulation and ranking

31

Name Clusters - Two step clustering● Course level clustering

○ Uses double metaphone + some known heuristics○ Focus on recall

● Fine level clustering○ similarity function that takes into account Jaro-Winkler distance○ User session data

32

Overall approach for Name Clusters● Indexing

○ Store clusterID for each cluster in a separate field (say ‘NAMECLUSTERID’)○ ‘Cris’ and ‘chris’ in same name cluster CHRISID○ NAME:cris NAMECLUSTERID:chris

● Query processing○ user query = ‘chris’○ Rewritten query = ?NAME:chris ?NAMECLUSTERID:chris

● Ranking○ Different weights for ‘perfect match’ vs. ‘name cluster match’

33

Instant Results via Inverted Index - Some Takeaways

● Used for documents at very high scale● Use early termination ● Approach the problem as a combination of indexing/query processing/ranking

34

Agenda● Terminology and Background● Indexing & Retrieval

○ Instant Results○ Query Autocomplete

● Ranking● Hands on tutorial with data from stackoverflow

○ Index and search xx posts from stackoverflow○ Play around with ranking

35

Query Autocomplete - Problem Statement● Let q = w1, w2 . . . wk* represent

the query with k words, where the kth token is a prefix as denoted by the asterisk

● Goal: Find one or more relevant completions for the query

36

Trie● Used to store an associative array

where keys are strings

● Only certain keys and leaves are of interest

● Structure allows for only sharing of prefixes

● Representation not memory efficient

37An trie of words {space, spark, moth}

Finite State Transducers (FST)● Allows efficient retrieval of

completions at runtime

● Can fit entirely into RAM

● Useful when keys have commonalities to them, allowing better compression

● Lucene has support for FSTs*

FST for words: software, scala, scalding, spark

*Lucene FST implementation based on “Direct Construction of Minimal Acyclic Subsequential Transducers (2001)” by Stoyan Mihov, Denis Maurel 38

Query Autocomplete vs. Instant Results● For query autocomplete corpus of terms remains relatively constant, instant

results documents can be continuously added/removed

● Query autocomplete focuses only on prefix based retrieval whereas instant search results utilize complex query construction for retrieval

● Query autocomplete retrieval based off a dictionary hence index can be refreshed periodically instead of real time

39

Query Tagging● Segment query based on

recognized entities

● Annotate query with:○ Named Entity Tags○ Standardized Identifiers○ Related Entities ○ Additional Entity Specific Metadata

40

Data Processing● Break queries into recognized entities and individual tokens

● Past querylogs are parsed for recognized entities, tokens and fed into an fst for retrieval of candidate suggestions.

41

Retrieval● All candidate completions over increasingly longer suffixes of the query are

used to capture enough context

● Given a query like “linkedin sof*” we look completions for:○ sof*, linkedin sof*

● Candidates are then provided to the scoring phase.

42

Retrieval

● From the above FST, for the query “linkedin sof*” we retrieve the candidates:○ sof: [software developer, software engineer]○ linkedin sof: []

43

Payloads

● Each query autocomplete result can have a payload associated with it.

● A payload holds serialized data useful in scoring the autocomplete result

44

Fuzzy Matching - LinkedIn Autocomplete

45

Fuzzy Matching● Use levenshtein automata constructed from

a word and maximum edit distance

● Based on the automaton and letters input to it, we decide whether to continue or not

● Ex. search for “dpark” (s/d being close on the keyboard) with edit distance 1 = [spark]

An index of {space, spark, moth} represented as a trie

46

47

48

49

Suggestion = Spark

Agenda● Terminology and Background● Indexing & Retrieval● Ranking

○ Ranking instant results○ Ranking query suggestions○ Blending

● Hands on tutorial with data from stackoverflow

50

Ranking Challenge

● Short query prefixes● Context beyond query

○ Personalized context○ Global context

■ Global popularity■ Trending

51

Hand-Tuned vs. Machine-Learned Ranking

● Hard to manually tune with very large number of features● Challenging to personalize

● LTR allows leveraging large volume of click data in an automated way

52

Agenda● Terminology and Background● Indexing & Retrieval● Ranking

○ Ranking instant results○ Ranking query suggestions○ Blending

● Hands on tutorial with data from stackoverflow

53

Features● Text match

○ Match query terms with different fields on documents

54

Features● Document Quality

○ Global Popularity■ Celebrities

○ Spaminess

55

Features● Social Affinity (personalized features)

○ Network distance between searcher and result○ Connection Strength

■ Within the same company■ Common connections■ From the same school

56

Training Data● Human judgement● Challenge:

○ Personalization○ Scale

57

Training Data● Log-based

○ Personalized○ Available in large quantity

● Position Bias○ Top-K randomization

58

Learning to Rank

▪ Pointwise: Reduce ranking to binary classification

LinkedIn Confidential ©2013 All Rights Reserved 59

+++-

+---

++--

Learning to Rank

▪ Pointwise: Reduce ranking to binary classification

LinkedIn Confidential ©2013 All Rights Reserved 60

+++-

+---

++--

Learning to Rank

▪ Pointwise: Reduce ranking to binary classification

LinkedIn Confidential ©2013 All Rights Reserved 61

+++-

+---

++--

Limitations▪ Relevant documents associated with different queries are put into the

same class

Learning to Rank

▪ Pairwise: Reduce ranking to classification of document pairs w.r.t. the same query– {(Q1, A>B), (Q2, C>D), (Q3, E>F)}

LinkedIn Confidential ©2013 All Rights Reserved 62

Learning to Rank

▪ Pairwise: Reduce ranking to classification of document pairs w.r.t. the same query– {(Q1, A>B), (Q2, C>D), (Q3, E>F)}

LinkedIn Confidential ©2013 All Rights Reserved 63

Learning to Rank▪ Pairwise

– Limitation: Does not differentiate inversions at top vs. bottom positions

LinkedIn Confidential ©2013 All Rights Reserved 64

Learning to Rank▪ Listwise

– Directly operate on ranked lists

– Optimize listwise objective function, e.g. IR metrics▪ Mean Average Precision (MAP)▪ Normalized Discounted Cumulative Gain (NDCG)

LinkedIn Confidential ©2013 All Rights Reserved 65

Agenda● Terminology and Background● Indexing & Retrieval● Ranking

○ Ranking vertical results○ Ranking query suggestions○ Blending

● Hands on tutorial with data from stackoverflow

66

Features

● Query Popularity○ Candidate completion q = s1, s2 … sk○ Likelihood q is a query in the query corpus, estimated by N-gram

language model

Pr(q) = Pr(s1, s2 … sk)

= Pr(s1) * Pr (s2|s1) … P(sk|sk-1)

67

Features

● Time-sensitive popularity [Shokouhi et al. SIGIR 12]

○ Trending query

○ Periodic Pattern

■ Weekend -> Disneyland○ Time-series: Forecasted frequencies

68

Features

● Recency-based suggestion (Personalized feature)

69

Agenda● Terminology and Background● Indexing & Retrieval● Ranking

○ Ranking instant results○ Ranking query suggestions○ Blending

● Hands on tutorial with data from stackoverflow

70

Blending

71

Blending

72

Company Instant

Query Prefix

Federator

People Instant Query Autocompletion

Blender

Blending Challenges

● Different verticals associate with different signals○ People: network distance○ Groups: time of the last edit○ Query suggestion: edit distance

● Even common features may not be equally predictive across verticals○ Popularity○ Text similarity

● Scores might not be comparable across verticals

73

Approaches

● Separate binary classifiers

f1 f2 f3

f1 f2 f4

People

Jobs

Classifier1

Classifier2

74

Approaches

● Separate binary classifiers○ Pros

■ Handle vertical-specific features■ Handle common features with different predictive powers

○ Cons■ Need to calibrate output scores of multiple classifiers

75

Approaches● Learning-to-rank - Equal correlation assumption

○ Union feature schema and padding zeros to non-applicable features ○ Equal correlation assumption

f1 f2 f3

f1 f2 f4

People

Jobs

f1 f2 f3 f4=0

f1 f2 f3=0 f4

Model

76

Approaches

● Learning-to-rank - Equal correlation assumption○ Pros

■ Handle vertical-specific features■ Comparable output scores across verticals

○ Cons■ Assume common features are equally predictive of vertical relevance

77

Approaches

● Learning-to-rank - Without equal correlation assumption

f1 f2 f3

f4 f5 f6

People

Jobs

f1 f2 f3 0

0 0 0 f4

Model

0 0

f5 f6

People vertical features

Job vertical features78

Approaches

● Learning-to-rank - Without equal correlation assumption○ Pros

■ Handle vertical-specific features■ Without equal correlation assumption -> auto learn evidence-vertical

association ■ Comparable output scores across verticals

○ Cons■ The number of features is huge

● Overfitting● Require a huge amount of training data

79

Evaluation● “If you can’t measure it, you can’t improve it”● Metrics

○ Successful search rate○ Number of keystrokes per search: query length + clicked result rank

80

Take-Aways● Speed

○ Instant results: Early termination○ Autocompletion: FST

● Tolerance to spelling errors● Relevance: go beyond query prefix

○ Personalized context○ Global context

81

Agenda● Terminology and Background● Indexing & Retrieval● Ranking

○ Ranking instant results○ Ranking query suggestions○ Blending

● Hands on tutorial with data from stackoverflow

82

Dataset● Posts and Tags from stackoverflow.com

● Posts are questions posted by users and contains following attributes○ Title

○ Score

● Tags help identify a suitable category for the post and contain following attributes

○ Tag Name

○ Count

● Each post can have a maximum of five tags83

stackoverflow.com

Title

Tags

Score

84

stackoverflow.com

Question

Tags

ScoreTags & counts

85

The End Product

86

Search Query Input

Query Autocomplete

Instant Results

Tools

87

Architecture

88

Assignments● Assignments available on Github

● Each assignment builds on a component of the end product

● Tests are provided at end of each assignment for validation

● Finished files available for reference (if needed)

● Raise hand if you need help or have a question

89

Assignment 0Setting up the machine

90

Assignment 1Building Instant Search and Autocomplete Index

91

Take-Aways● Index should be used primarily for retrieval

● Data sources should be kept separate from the index

● Building an index is not instantaneous hence have replicas in production

● Real world indexes seldom can be stored in a single shard

92

Assignment 2Building the Mid-Tier

93

Take-Aways● Make incremental additions

● Allow for relevance changes to be compared

● Document relevance changes

● Do side by side evaluations

94

Assignment 3Visualizing the blended result set

95

Assignment 4Relevance Improvements

96

Summary● Theoretical understanding of indexing, retrieval and ranking for instant search

results and query autocomplete

● Insights and learnings from linkedin.com case studies

● Working end-to-end implementation of query autocomplete and instant results with stackoverflow.com dataset

97

98