Search A Basic Overview

SearchA Basic Overview

Debapriyo Majumdar

Data Mining – Fall 2014

Indian Statistical Institute Kolkata

October 20, 2014

2

Back in those daysOnce upon a time in the world, there were days without search engines

We had access to much smaller amount of information

Had to find information manually

3

Search engine

User needs some information

Assumption: the required information is present

somewhere

A search engine tries to bridge this gap

How: User “expresses” the information need – query Engine returns – list of documents, or by some better means

4

Search engine

User needs some information

Assumption: the required information is present

somewhere

A search engine tries to bridge this gap

Simplest model User submits query – a set of words (terms) Search engine returns documents “matching” the query Assumption: matching the query would satisfy the information need Modern search has come a long way from the simple model, but the

fundamentals are still required

5

Basic approachThis is in

Indian Statistical Institute,

Kolkata, IndiaStatistically flying is the safest mode of journey

Diwali is a huge festival

in India

India’s population is

huge

Thank god it is a holiday

This is autumn

There is no end of

learning

Documents contain terms Documents are

represented by terms present in them

Match queries and documents by terms

For simplicity: ignore positions, consider documents as “bag-of-words”

There may be many matching documents – need to rank them Query: india

statistics

6

Vector space model

Each term represents a dimension

Documents are vectors in the term-space

Term-document matrix: a very sparse matrix

Query is also a vector in the term-space

d1 d2 d3 d4 d5 q

diwali 1 0 0 0 0

india 1 0 0 1 1 1

flying 0 1 0 0 0

population 0 0 0 1 0

autumn 0 0 1 0 0

statistical 0 1 0 0 1 1

Similarity of each document d with the query q is measured by the cosine similarity (dot product normalized by norms of the vectors)

7

Scoring function: TF.iDF How important is a term t in a document d Approach: take two factors into account

– With what significance does t occur in d? [term frequency]– Does t occur in many other documents also? [document frequency]– Called TF.iDF: TF × iDF, has many variants for TF and iDF

Variants for TF(t, d)1. Number of times t occurs in d: freq(t, d)

2. Logarithmically scaled frequency: 1 + log(freq(t, d))

3. Augmented frequency: avoid bias towards longer documents

Inverse document frequency of t : iDF(t)

for all t in d; 0 otherwise

where N = total number of documentsDF(t) = number of documents in which t occurs

Half the score for just being present

Rest a function of frequency

8

BM25 Okapi IR system – Okapi BM25 If the query q = {q1, … , qn} where qi’s are words in the query

where

N = total number of documents

avgdl = average length of documents

k1 and b are optimized parameters, usually b = 0.75 and 1.2 ≤ k1 ≤ 2.0

BM25 exhibited better performance than TF.iDF in TREC consistently

9

Relevance Simple IR model: query, documents, returned results Relevant document: a document that satisfies the information

need expressed by the query– Merely matching query terms does not make a document relevant– Relevance is human perception, not a mathematical statement– User may want some statistics on population of India by the query

“india statistics” – The document “Indian Statistical Institute” matches the query terms,

but not relevant

To evaluate effectiveness of a system, we need for each query1. Given a result, an assessment of whether it is relevant

2. The set of all relevant results assessed (pre-validated)• If the second is available, it serves the purpose of the first as well

Measures: precision, recall, F-measure (harmonic mean of precision and recall)

10

Inverted index Standard representation:

document terms Inverted index: term documents For each term t, store the list of the

documents in which t occurs

This is in Indian

Statistical Institute,

Kolkata, India

Statistically flying is

the safest mode of journey

Diwali is a huge

festival in India

India’s population

is huge

Thank god it is a holiday This is

autumnThere is

no end of learning

1 2 3

45

6 7

diwali: d3

india: d2 d3 d7

flying: d1

population: d7

autumn: d4

statistical: d1 d2

Scores?

11

Inverted index Standard representation:

document terms Inverted index: term documents For each term t, store the list of the

documents in which t occurs

diwali: d3(0.5)

india: d2(0.7) d3(0.3) d7(0.4)

flying: d1(0.3)

population: d7(0.6)

autumn: d4(0.8)

statistical: d1(0.2) d2(0.5) Note: These scores are dummy, not by any formula

This is in Indian


Kolkata, India



Diwali is a huge

festival in India


is huge


autumnThere is

no end of learning

1 2 3

45

6 7

12

Positional index Just documents and scores follows bag of

words model Cannot perform proximity search or phrase

query search Positional inverted index: also store

position of each occurrence of term t in each document d where t occurs

diwali: d3(0.5):<1>

india: d2(0.7):<4,8> d3(0.3):<7> d7(0.4):<1>

flying: d1(0.3):<2>

population: d7(0.6):<2>

autumn: d4(0.8):<3>

statistical: d1(0.2):<1> d2(0.5):<5>

This is in Indian


Kolkata, India



Diwali is a huge

festival in India


is huge


autumnThere is

no end of learning

1 2 3

45

6 7

13

Pre-processing Removal of stopwords: of, the, and, …

– Modern search does not completely remove stopwords– Such words add meaning to sentences as well as queries

Stemming: words stem (root) of words– Statistics, statistically, statistical statistic (same root)– Loss of slight information (the form of the word also matters)– But unifies differently expressed queries on the same topic– Lemmatization: doing this properly with morphological analysis of

words

Normalization: unify equivalent words as much as possible– U.S.A, USA– Windows, windows

Stemming, lemmatization, normalization, synonym finding, all are important subfields on their own!!

14

Creating an inverted index For each document, write out pairs

(term, docid) Sort by term Group, compute DF

This is in Indian


Kolkata, India



Diwali is a huge

festival in India


is huge


autumnThere is

no end of learning

1 2 3

45

6 7Term docId

statistic 1

fly 1

safe 1

… …

india 2

statistic 2

india 3

… …

india 7

Term docId

india 2

india 3

india 7

… …

fly 1

safe 1

statistic 1

statistic 2

… …

Term docId docId docId

india (df=3) 2 3 7

fly (df=1) 1

statistic (df=2) 1 2

… …

15

Traditional architecture

Analysis (stemming, normalization, …)

Basic format conversion, parsing

Indexing

Core query processing(accessing index, ranking)

Index

Different types of documents

Query handler (query parsing)

Results handler (displaying results)

User

Query Results

Query Results

Query processing

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5

List 1 List 2 List 3

One pointer in each list

16

Pick the smallest doc id

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



17

doc 5 (0.6)


Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



18


doc 5 (0.6)

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



19


doc 5 (0.6)

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



20


doc 5 (0.6)

doc 10 (0.1)

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



21


doc 5 (0.6)

doc 10 (0.1)

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



22


doc 5 (0.6)

doc 10 (0.1)

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



23


doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



24


doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



25


doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



26


doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

doc 17 (1.6)

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



27


doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

doc 17 (1.6)

Merge

list

s so

rted

by

doc

iddoc 170.3

doc 50.6

doc 100.1

doc 210.2

doc 140.6

doc 170.7

doc 250.6

doc 170.6

doc 610.3

doc 780.5

doc 210.3

doc 650.1

doc 830.4

doc 380.6

doc 810.2

doc 910.1

doc 440.1

doc 830.9

doc 830.5



28

doc 5 (0.6)

doc 10 (0.1)

doc 14 (0.6)

doc 17 (1.6)

doc 21 (0.5)

doc 25 (0.6)

doc 38 (0.6)

doc 44 (0.1)

doc 61 (0.3)

doc 65 (0.1)

doc 78 (0.5)

doc 81 (0.2)

doc 83 (1.8)

doc 91 (0.1)

Merged list

stil

l sor

ted

by d

oc id

(Partial) sort doc 83 (1.8)

doc 17 (1.6)

Top-2Complexity?

klogn

MergeSimple and efficient, minimal overhead

Lists sorted by doc id

Merge

Merged list

But, have to scan the lists fully! 29

30

Top-k algorithms If there are millions of documents in the lists

– Can the ranking be done without accessing the lists fully?

Exact top-k algorithms (used more in databases)– Family of threshold algorithms (Ronald Fagin et al)– Threshold algorithm (TA)– No random access algorithm (NRA) [we will discuss, as an example]– Combined algorithm (CA)– Other follow up works

Inexact top-k algorithms– Exact top-k not required, the scores are only “crude” approximation of

“relevance” (human perception)– Several heuristics– Further reading: IR book by Manning, Raghavan and Schuetze, Ch. 7

NRA (No Random Access) Algorithm

lists

sort

ed b

y

score

doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1


Fagin’s NRA Algorithm:

read one doc from every list

31


lists

sort

ed b

y

score

doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1

Fagin’s NRA Algorithm: round 1

doc 83[0.9, 2.1]

doc 17[0.6, 2.1]

doc 25[0.6, 2.1]

Candidatesmin top-2 score: 0.6maximum score for unseen docs: 2.1

min-top-2 < best-score of candidates



current score

best-score

0.6 + 0.6 + 0.9 = 2.1

32


lists

sort

ed b

y

score


doc 17[1.3, 1.8]

doc 83[0.9, 2.0]

doc 25[0.6, 1.9]

doc 38[0.6, 1.8]

doc 78[0.5, 1.8]


doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1




0.5 + 0.6 + 0.7 = 1.8

33


lists

sort

ed b

y

score

doc 83[1.3, 1.9]

doc 17[1.3, 1.7]

doc 25[0.6, 1.5]

doc 78[0.5, 1.4]


doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1




no more new docs can get into top-2

but, extra candidates left in queue


0.4 + 0.6 + 0.3 = 1.3

34


doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1

lists

sort

ed b

y

score

doc 171.6

doc 83[1.3, 1.9]

doc 25[0.6, 1.4]





no more new docs can get into top-2

but, extra candidates left in queue


0.3 + 0.6 + 0.2 = 1.1

35


doc 250.6

doc 170.6

doc 830.9

doc 780.5

doc 380.6

doc 170.7

doc 830.4

doc 140.6

doc 610.3

doc 170.3

doc 50.6

doc 810.2

doc 210.2

doc 830.5

doc 650.1

doc 910.1

doc 210.3

doc 100.1

doc 440.1

lists

sort

ed b

y

score

doc 831.8

doc 171.6


Done!



no extra candidate in queue


0.2 + 0.5 + 0.1 = 0.8

36

More approaches: Periodically also perform random accesses on

documents to reduce uncertainty (CA) Sophisticated scheduling on lists Crude approximation: NRA may take a lot of

time to stop. Just stop after a while with approximate top-k – who cares if the results are perfect according to the scores?

37

References Primarily: IR Book by Manning, Raghavan and

Schuetze: http://nlp.stanford.edu/IR-book/

http://nlp.stanford.edu/IR-book/



Search A Basic Overview

Documents

Transcript of Search A Basic Overview