Relevance Ranking

84
1 Relevance Ranking

description

Relevance Ranking. Introduction. In relational algebra, the response to a query is always an unordered set of qualifying tuples. Keyword queries are not precise Rate each document for how likely it is to satisfy the user’s information need. Present the results in a ranked list. - PowerPoint PPT Presentation

Transcript of Relevance Ranking

Page 1: Relevance Ranking

1

Relevance Ranking

Page 2: Relevance Ranking

2

Introduction

In relational algebra, the response to a query is always an unordered set of qualifying tuples.

Keyword queries are not precise Rate each document for how likely it is to satisfy

the user’s information need. Present the results in a ranked list.

Page 3: Relevance Ranking

3

Relevance ranking

Recall and Precision The Vector-Space Model

A broad class of ranking algorithms based on this model

Relevance Feedback and Rocchio’s Method Probabilistic Relevance Feedback Models Advanced Issues

Page 4: Relevance Ranking

4

Measures for a search engine How fast does it index

Number of documents/hour How fast does it search

Latency as a function of index size Expressiveness of query language

Ability to express complex information needs Uncluttered UI Is it free?

Page 5: Relevance Ranking

5

Measures for a search engine All of the preceding criteria are measurable The key measure: user happiness

What is this? Speed of response/size of index are factors But blindingly fast, useless answers won’t make a

user happy Need a way of quantifying user happiness

Page 6: Relevance Ranking

6

Happiness: elusive to measure

Most common proxy: relevance of search results

But how do you measure relevance? Relevant measurement requires 3

elements:1. A benchmark document collection

2. A benchmark suite of queries

3. A usually binary assessment of either Relevant or Nonrelevant for each query and each document Some work on more-than-binary, but not the

standard

Page 7: Relevance Ranking

7

Evaluating an IR system Note: the information need is translated into a

query Relevance is assessed relative to the information

need not the query E.g., Information need: I'm looking for information on

whether drinking red wine is more effective at reducing your risk of heart attacks than white wine.

Query: wine red white heart attack effective You evaluate whether the doc addresses the

information need, not whether it has these words

Page 8: Relevance Ranking

8

Standard relevance benchmarks TREC - National Institute of Standards and

Technology (NIST) has run a large IR test bed for many years

Human experts mark, for each query and for each doc, Relevant or Nonrelevant

Page 9: Relevance Ranking

9

Unranked retrieval evaluation:Precision and Recall Precision: fraction of retrieved docs that are

relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are

retrieved = P(retrieved|relevant)

Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)

Relevant Nonrelevant

Retrieved tp(true positive) fp(false positive)

Not Retrieved fn(false negative) tn(true negative)

Page 10: Relevance Ranking

10

Should we instead use the accuracy measure for evaluation? Given a query, an engine classifies each doc as

“Relevant” or “Nonrelevant” The accuracy of an engine: the fraction of

these classifications that are correct Accuracy is a commonly used evaluation

measure in machine learning classification work

Why is this not a very useful evaluation measure in IR?

Page 11: Relevance Ranking

11

Why not just use accuracy? How to build a 99.9999% accurate search

engine on a low budget….

People doing information retrieval want to find something and have a certain tolerance for junk.

Search for:

0 matching results found.

Page 12: Relevance Ranking

12

Precision/Recall

You can get high recall (but low precision) by retrieving all docs for all queries!

Recall is a non-decreasing function of the number of docs retrieved

In a good system, precision decreases as either the number of docs retrieved or recall increases This is not a theorem, but a result with strong

empirical confirmation

Page 13: Relevance Ranking

13

Evaluating ranked results Model

For each query q, an exhaustive set of relevant documents DqD is identified

A query q is submitted to the query system A ranked list of documents (d1, d2,…, dn) is returned,

Corresponding to the list, we can compute a 0/1 relevance list (r1, r2,…, rn) ri=1 if di Dq, and ri=0 if diDq

The recall at rank k1 is defined as

The precision at rank k is defined as

kii

q

rD

krecall1||

1)(

kiirk

kprecision1

1)(

Page 14: Relevance Ranking

14

Evaluating ranked results Example

Dq={d1, d5, d7, d10} Retrieved documents: (d1, d10, d15, d5, d4, d7, d22, d2) Then the relevance list: (1, 1, 0, 1, 0, 1, 0, 0)

Recall(2)=(1/4)*(1+1)=0.5 Recall(5)=(1/4)*(1+1+0+1+0)=0.75 Recall (6)=(1/4)*(1+1+0+1+0+1)=1

Precision(2)=(1/2)*(1+1)=1 Precision(5)=(1/5)*(1+1+0+1+0)=0.75 Precision(6)=(1/6)*(1+1+0+1+0+1)=2/3

Page 15: Relevance Ranking

15

Another figure of merit: Average precision The sum of the precision at each relevant hit

position in the response list, divided by the total number of relevant documents.

The average precision is 1 only if the engine retrieves all relevant documents and ranks them ahead of any irrelevant document.

||1

)(||

1

Dkk

q

kprecisionrD

precisionaverage

Page 16: Relevance Ranking

16

Average precision

Example Dq={d1, d5, d7, d10}

Retrieved documents: (d1, d10, d15, d5, d4, d7, d22, d2)

relevance list: (1, 1, 0, 1, 0, 1, 0, 0)

Average precision=(1/|4|)*(1+1+3/4+4/6)

Page 17: Relevance Ranking

17

Evaluation at large search engines

For a large corpus in rapid flux, such as the web, it is impossible to determine Dq. Recall is difficult to measure on the web

Search engines often use precision at top k, e.g., k = 10

. . . or measures that reward you more for getting rank 1 right than for getting rank 10 right.

Page 18: Relevance Ranking

18

Evaluation at large search engines Search engines also use non-relevance-

based measures. Clickthrough on first result

Not very reliable if you look at a single clickthrough … but pretty reliable in the aggregate.

A/B testing

Page 19: Relevance Ranking

19

A/B testing

Purpose: Test a single innovation Prerequisite: You have a large search engine up

and running. Have most users use old system Divert a small proportion of traffic (e.g., 1%) to

the new system that includes the innovation Evaluate with an “automatic” measure like

clickthrough on first result Now we can directly see if the innovation does

improve user happiness. Probably the evaluation methodology that large

search engines trust most

Page 20: Relevance Ranking

20

Vector space model

Page 21: Relevance Ranking

21

Problem with Boolean search: feast or famine Boolean queries often result in either too few (=0)

or too many (1000s) results. Query 1: “standard user dlink 650” → 200,000

hits Query 2: “standard user dlink 650 no card found”:

0 hits It takes skill to come up with a query that

produces a manageable number of hits. With a ranked list of documents it does not matter

how large the retrieved set is.

Page 22: Relevance Ranking

22

Scoring as the basis of ranked retrieval We wish to return in order the documents

most likely to be useful to the searcher How can we rank-order the documents in

the collection with respect to a query? Assign a score – say in [0, 1] – to each

document This score measures how well document

and query “match”.

Page 23: Relevance Ranking

23

Term-document count matrices Consider the number of occurrences of a

term in a document: Each document is a count vector in ℕv: a column

below

Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth

Antony 157 73 0 0 0 0

Brutus 4 157 0 1 0 0

Caesar 232 227 0 2 1 1

Calpurnia 0 10 0 0 0 0

Cleopatra 57 0 0 0 0 0

mercy 2 0 3 5 5 1

worser 2 0 1 1 1 0

Page 24: Relevance Ranking

24

The vector space model

Documents are represented as vectors in a multidimensional Euclidean space. Each axis in this space corresponds to a term. The coordinate of document d in the direction

corresponding to term t is determined by two quantities, Term frequency and Inverse document frequency.

Page 25: Relevance Ranking

25

Term frequency tf

The term frequency tft,d of term t in document d is defined as the number of times that t occurs in d.

We want to use tf when computing query-document match scores. But how? Raw term frequency is not what we want:

A document with 10 occurrences of the term is more relevant than a document with one occurrence of the term.

But not 10 times more relevant. Relevance does not increase proportionally with term

frequency.

Page 26: Relevance Ranking

26

Term frequency tf

Normalization is needed! There are lots of normalization method.

Normalize by using the sum of term counts Tft,d=n(d,t)/(summation of the term frequency in document d )

log frequency weight

Cornell Smart system uses the following equation to normalize the tf.

otherwisetdn

tdniftf dt ))),(log(1log(1

0),(0{,

otherwise 0,

0 t)n(d, ift),n(d, log 1 10

t,dtf

Page 27: Relevance Ranking

27

Document frequency

Rare terms are more informative than frequent terms Recall stop words

Consider a term in the query that is rare in the collection (e.g., arachnocentric)

A document containing this term is very likely to be relevant to the query arachnocentric

→ We want a high weight for rare terms like arachnocentric.

Page 28: Relevance Ranking

28

Collection vs. Document frequency

The collection frequency of t is the number of occurrences of t in the collection, counting multiple occurrences.

Example:

Which word is a better search term (and should get a higher weight)?

Word Collection frequency

Document frequency

insurance 10440 3997

try 10422 8760

Page 29: Relevance Ranking

29

Document frequency Consider a query term that is frequent in the

collection (e.g., high, increase, line) A document containing such a term is more likely

to be relevant than a document that doesn’t, but it’s not a sure indicator of relevance.

→ For frequent terms, we want positive weights for words like high, increase, and line, but lower weights than for rare terms.

We will use document frequency (df) to capture this in the score.

df ( N) is the number of documents that contain the term

Page 30: Relevance Ranking

30

Inverse document frequency

dft is the document frequency of t: the number of documents that contain t df is a measure of the informativeness of t

We define the idf (inverse document frequency) of t by

We use log N/dft instead of N/dft to “dampen” the effect of idf.

Again IDF used by SMART system

tt N/df log idf 10

tt df

Nidf

1log

Page 31: Relevance Ranking

31

tf-idf weighting The tf-idf weight of a term is the product of its tf

weight and its idf weight. wt,d=tft,d*idft

Best known weighting scheme in information retrieval Note: the “-” in tf-idf is a hyphen, not a minus sign! Alternative names: tf.idf, tf x idf Increases with the number of occurrences within a

document Increases with the rarity of the term in the collection

Page 32: Relevance Ranking

32

example

Collection size=9 Idft1=log((1+9)/1)=1

Idft2=log((1+9)/5)=0.301

Idft3=log((1+9)/6)=0.222

tft1,d1=1+(log(1+log100))=1.477

tft2,d2=1+(log(1+log10))=1.301 Therefore,

t1 t2 t3

d1 100 10 0

d2 0 5 10

d3 0 10 0

d4 0 0 100

d5 0 0 10

d6 0 0 8

d7 0 0 15

d8 0 7 10

d9 0 100 0

222.0*0,301.0*301.1,1*447.11d

Page 33: Relevance Ranking

33

cosine(query,document)

V

i i

V

i i

V

i ii

dq

dq

d

d

q

q

dq

dqdq

1

2

1

2

1),cos(

Dot product Unit vectors

qi is the tf-idf weight of term i in the querydi is the tf-idf weight of term i in the documentcos(q,d) is the cosine similarity of q and d … or,equivalently, the cosine of the angle between q and d.

The proximity measure between query and documents

Page 34: Relevance Ranking

34

Summary – vector space ranking

Represent the query as a weighted tf-idf vector

Represent each document as a weighted tf-idf vector

Compute the cosine similarity score for the query vector and each document vector

Rank documents with respect to the query by score

Return the top K (e.g., K = 10) to the user

Page 35: Relevance Ranking

35

Relevance Feedback The initial response from a search engine may

not satisfy the user’s information need The average Web query is only two words long. Users can rarely express their information need

within two words However, if the response list has at least some

relevant documents, sophisticated users can learn how to modify their queries by adding or negating additional keywords.

Relevance feedback automates this query refinement process.

Page 36: Relevance Ranking

36

Relevance Feedback : basic idea Relevance feedback: user feedback on

relevance of docs in initial set of results User issues a (short, simple) query The user marks some results as relevant or non-

relevant. The system computes a better representation of

the information need based on feedback. Relevance feedback can go through one or more

iterations.

Page 37: Relevance Ranking

37

Similar pages

Page 38: Relevance Ranking

38

Relevance Feedback: Example

Image search engine http://nayana.ece.ucsb.edu/imsearch/imsearch.html

Page 39: Relevance Ranking

39

Results for Initial Query

Page 40: Relevance Ranking

40

Relevance Feedback

Page 41: Relevance Ranking

41

Results after Relevance Feedback

Page 42: Relevance Ranking

42

Key concept: Centroid The centroid is the center of mass of a set of

points Recall that we represent documents as points

in a high-dimensional space Definition: Centroid

where C is a set of documents.

Cd

dC

C

||

1)(

Page 43: Relevance Ranking

43

Rocchio Algorithm The Rocchio algorithm uses the vector space

model to pick a relevance feed-back query Rocchio seeks the query qopt that maximizes

Tries to separate docs marked relevant and non-relevant

))](,cos())(,[cos(maxarg nrr

q

opt CqCqq

rjrj Cdj

nrCdj

ropt d

Cd

Cq

11

Page 44: Relevance Ranking

44

The Theoretically Best Query

x

x

xx

oo

o

Optimal query

x non-relevant documentso relevant documents

o

o

o

x x

xxx

x

x

x

x

x

x

x

x

x

Page 45: Relevance Ranking

45

Rocchio 1971 Algorithm (SMART)

Used in practice:

Dr = set of known relevant doc vectors Dnr = set of known irrelevant doc vectors

Different from Cr and Cnr

qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically)

New query moves toward relevant documents and away from irrelevant documents

nrjrj Ddj

nrDdj

rm d

Dd

Dqq

110

!

Page 46: Relevance Ranking

46

Relevance feedback on initial query

x

x

xx

oo

o

Revised query

x known non-relevant documentso known relevant documents

o

o

ox

x

x x

xx

x

x

xx

x

x

x

x

Initial query

Page 47: Relevance Ranking

47

Subtleties to note Tradeoff α vs. β/γ : If we have a lot of

judged documents, we want a higher β/γ is commonly set to zero

It’s harder for user to give negative feedback It’s also harder to use since relevant documents

can often form tight cluster, but non-relevant documents rarely do.

Some weights in query vector can go negative Negative term weights are ignored (set to 0)

Page 48: Relevance Ranking

48

Relevance Feedback in vector spaces

We can modify the query based on relevance feedback and apply standard vector space model.

Relevance feedback can improve recall and precision

Relevance feedback is most useful for increasing recall in situations where recall is important Users can be expected to review results and to

take time to iterate

Page 49: Relevance Ranking

49

Evaluation of relevance feedback strategies Use q0 and compute precision and recall graph

Use qm and compute precision recall graph Assess on all documents in the collection

Spectacular improvements, but … it’s cheating! Partly due to known relevant documents ranked higher Must evaluate with respect to documents not seen by user

Use documents in residual collection (set of documents minus those assessed relevant)

Measures usually then lower than for original query But a more realistic evaluation Relative performance can be validly compared

Page 50: Relevance Ranking

50

Evaluation of relevance feedback Most satisfactory – use two collections each

with their own relevance assessments q0 and user feedback from first collection

qm run on second collection and measured

Empirically, one round of relevance feedback is often very useful. Two rounds is sometimes marginally useful.

Page 51: Relevance Ranking

51

Evaluation: Caveat True evaluation of usefulness must compare

to other methods taking the same amount of time.

Alternative to relevance feedback: User revises and resubmits query.

Users may prefer revision/resubmission to having to judge relevance of documents.

There is no clear evidence that relevance feedback is the “best use” of the user’s time.

Page 52: Relevance Ranking

52

Pseudo relevance feedback Pseudo-relevance feedback automates the

“manual” part of true relevance feedback. Pseudo-relevance algorithm:

Retrieve a ranked list of hits for the user’s query Assume that the top k documents are relevant. Do relevance feedback (e.g., Rocchio)

Works very well on average But can go horribly wrong for some queries.

Page 53: Relevance Ranking

53

Relevance Feedback on the Web

Some search engines offer a similar/related pages feature (this is a trivial form of relevance feedback) Google (link-based) Stanford WebBase

But some don’t because it’s hard to explain to average user: Alltheweb Yahoo

Relevance feedback is not a commonly available feature on Web search engines. They are not patient enough to give their feedback to the

system!

Page 54: Relevance Ranking

54

Probabilistic Relevance Feedback Models The vector space model is operational

It gives a precise recipe for computing the relevance of a document with regard to the query.

It does not attempt to justify why the relevance should be defined.

Probabilistic models can be used to understand the behavior of practical IR systems.

Page 55: Relevance Ranking

55

Probabilistic Relevance Feedback Models Let R be a Boolean random variable that

represents the relevance of document d with regard to query q.

A reasonable order for ranking documents is their odds ratio for relevance

Page 56: Relevance Ranking

56

Probabilistic Relevance Feedback Models We assume that term occurrences are

independent given the query and the value of R Let {t} be the universe of terms, and xd,t{0,1}

reflects whether the term t appears in document d or not

We get

Page 57: Relevance Ranking

57

Probabilistic Relevance Feedback Models

Page 58: Relevance Ranking

58

Advanced Issues Spamming

In classical IR corpora, document authors were largely truthful about the propose of the document Terms found in the document were broadly

indicative of content. However, economic reality on the web and the

need to capture eyeballs has led to quite a different culture.

Page 59: Relevance Ranking

59

Spamming

Spammers The one who add popular query terms to

pages unrelated to those terms. For example, add the terms “Hawaii

vacation rental” in the page by making the font color the same as the background color.

Page 60: Relevance Ranking

60

Spamming In the early days of the Web

Search engines using many clues, such as font color, position to judge a page.

They guarded the secrets zealously, for knowing those secrets would enable spammers to beat the system again easily enough.

With the invention of hyperlink-based ranking techniques Spammers went through a setback phase The number and popularity of sites that cited a page

started to matter quite strongly in determining the rank of that page in response to a query

Page 61: Relevance Ranking

61

Advanced Issues

Titles, headings, metatages and anchor text The standard tf-idf framework treats all terms

uniformly. On the Web, valuable information may be lost this

way! Different idioms reflect different weights

Titles <title>…</title> Headings <h2>…</h2> Font modifiers

<strong>…</strong>, <font …>…</font> Most search engines respond to these different

idioms by assigning different weights

Page 62: Relevance Ranking

62

Titles, headings, metatages and anchor text Consider the Web’s rich hyperlink structure

Succinct descriptions of the content of a page v may often be found in the text of pages u that link to v.

The text in and near the “anchor” construct may be especially significant

In the world wide web worm system, McBryan built an index where page v, which had not been fetched by the crawler, would be indexed using anchor text on pages u that link to v Lycos and Google adopted the same approach

increasing their reach by almost a factor of two.

Page 63: Relevance Ranking

63

Titles, headings, metatages and anchor text (near) anchor text on u offers valuable

editorial judgment about v as well Among many authors of pages like u about

what terms to use as (near) anchor text is valuable in fighting spam and returning better-ranked responses.

Page 64: Relevance Ranking

64

Advanced Issues Ranking for complex queries including

phrases Many search systems permit the query contain

single words, phrases, and word inclusions and exclusions.

Explicit inclusions and exclusions are had Boolean predicates All responses must satisfy these

With operators and phrases, the query/documents can no longer be treated as ordinary points in vector space.

Page 65: Relevance Ranking

65

Ranking for complex queries including phrases Basic approaches

Positional index Construct two separate indices : one for single terms and

the other for phrases Regard the phrases as new dimensions added to the

vector space The problem is how to define a “phrase”

Manually Biword indexes Derived from the corpus itself using statistical techniques

Page 66: Relevance Ranking

66

Biword indexes Index every consecutive pair of terms in the

text as a phrase For example the text “Friends, Romans,

Countrymen” would generate the biwords friends romans romans countrymen

Each of these biwords is now a dictionary term Two-word phrase query-processing is now

immediate.

Page 67: Relevance Ranking

67

Longer phrase queries Longer phrases are divided into several biwords: stanford university palo alto can be broken into

the Boolean query on biwords:

stanford university AND university palo AND palo alto

Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.

Can have false positives!

Page 68: Relevance Ranking

68

Extended biwords Parse the indexed text and perform part-of-speech-

tagging (POST). Bucket the terms into (say) Nouns (N) and

articles/prepositions (X). Now deem any string of terms of the form NX*N to be an

extended biword. Each such extended biword is now made a term in the dictionary.

Example: catcher in the rye N X X N

Query processing: parse it into N’s and X’s Segment query into enhanced biwords Look up index

Page 69: Relevance Ranking

69

Statistical techniques

To decide whether “t1t2” occurs more often as a phrase than their individual rates of occurrence

Likelihood ratio test can be used Null hypothesis : t1 and t2 are independent

Alternative hypothesis : t1 and t2 are dependent

Page 70: Relevance Ranking

70

Likelihood ratio test is the entire parameter space 0 is the parameter space corresponding to the null

hypothesis 0 => 1≦ It is known that -2log is asymptotically χ2-distributed

with degree of freedom equal to the difference in the dimensions of 0 and

If occurrences of t1 and t2 are independent, is almost 1, and -2log is almost 0 The larger the value of -2log, the stronger the dependence.

);(

);(

max

max0

kpHp

kpHp

Page 71: Relevance Ranking

71

Likelihood ratio test

In the Bernoulli trial setting For null hypothesis

For alternative hypothesis

11100100

111001001110,01000403,0100 ),,;,,( kkkk ppppkkkkppppH

11100100 )())1(())1(())1)(1((),,;,( 212121211110,010021kkkk ppppppppkkkkppH

Page 72: Relevance Ranking

72

Likelihood ratio test

The maximum likelihood estimators For null hypothesis

p1= (k10+k11)/(k00+k01+k10+k11)

p2=(k01+k11) /(k00+k01+k10+k11)

For alternative hypothesis p00=k00/ (k00+k01+k10+k11) Other parameters can be gotten in similar way

Page 73: Relevance Ranking

73

Likelihood ratio test

We can sort the entries by -2log and compare the top rankers against a χ2 table to note the confidence with which the null hypothesis is violated.

Dunning reports the following top candidates from some common English text:

-2log phrase

271 The swiss

264 Can be

257 Previous year

167 Mineral water

Page 74: Relevance Ranking

74

Advanced Issues

Approximate string matching Even though a large fraction of web pages are

written in English, many other languages are also used.

Dialects of English or transliteration from other languages may make many word spellings nonuniform An exact character-by-character match between the

query terms entered by the user and the keys in the inverted index may miss relevant matches!

Page 75: Relevance Ranking

75

Approximate string matching

Two ways to reduce the problem Soundex

Use an aggressive conflation mechanism that will collapse variant spellings into the same token.

bradley, bartle, bartlett and brodley all share the soundex code B634.

N-grams approach Decompose terms into a sequence of n-gram

Page 76: Relevance Ranking

76

Soundex Class of heuristics to expand a query into

phonetic equivalents Language specific – mainly for names E.g., Herman Hermann

Invented for the U.S. census … in 1918

Page 77: Relevance Ranking

77

Soundex – typical algorithm Turn every token to be indexed into a 4-

character reduced form Do the same with query terms Build and search an index on the reduced

forms (when the query calls for a soundex match)

http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm#Top

Page 78: Relevance Ranking

78

Soundex – typical algorithm1. Retain the first letter of the word. 2. Change all occurrences of the following letters to '0'

(zero):  'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'.

3. Change letters to digits as follows: B, F, P, V 1 C, G, J, K, Q, S, X, Z 2 D,T 3 L 4 M, N 5 R 6

Page 79: Relevance Ranking

79

Soundex continued4. Remove all pairs of consecutive digits.

5. Remove all zeros from the resulting string.

6. Pad the resulting string with trailing zeros and return the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>.

E.g., Herman becomes H655.Will hermann generate the same code?

Page 80: Relevance Ranking

80

Soundex Soundex is the classic algorithm, provided by

most databases (Oracle, Microsoft, …) How useful is soundex?

Not very – for information retrieval Okay for “high recall” tasks (e.g., Interpol), though

biased to names of certain nationalities Zobel and Dart (1996) show that other

algorithms for phonetic matching perform much better in the context of IR

Page 81: Relevance Ranking

81

n-gram Enumerate all the n-grams in the query string

as well as in the lexicon Use the n-gram index to retrieve all lexicon

terms matching any of the query n-grams Threshold by number of matching n-grams

Page 82: Relevance Ranking

82

Example with trigrams Suppose the text is november

Trigrams are nov, ove, vem, emb, mbe, ber. The query is december

Trigrams are dec, ece, cem, emb, mbe, ber. So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized

measure of overlap?

Page 83: Relevance Ranking

83

One option – Jaccard coefficient A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is

Equals 1 when X and Y have the same elements and zero when they are disjoint

X and Y don’t have to be of the same size Always assigns a number between 0 and 1

Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match

YXYX /

Page 84: Relevance Ranking

84

N-gram

Looking up the inverted index now becomes a two-stage affair First, an index of n-grams is consulted to expand

each query term into a set of slightly distorted query terms

Then, all these terms are submitted to the regular index