Search A Basic Overview
-
Upload
lewis-olsen -
Category
Documents
-
view
28 -
download
0
description
Transcript of Search A Basic Overview
SearchA Basic Overview
Debapriyo Majumdar
Data Mining – Fall 2014
Indian Statistical Institute Kolkata
October 20, 2014
2
Back in those daysOnce upon a time in the world, there were days without search engines
We had access to much smaller amount of information
Had to find information manually
3
Search engine
User needs some information
Assumption: the required information is present
somewhere
A search engine tries to bridge this gap
How: User “expresses” the information need – query Engine returns – list of documents, or by some better means
4
Search engine
User needs some information
Assumption: the required information is present
somewhere
A search engine tries to bridge this gap
Simplest model User submits query – a set of words (terms) Search engine returns documents “matching” the query Assumption: matching the query would satisfy the information need Modern search has come a long way from the simple model, but the
fundamentals are still required
5
Basic approachThis is in
Indian Statistical Institute,
Kolkata, IndiaStatistically flying is the safest mode of journey
Diwali is a huge festival
in India
India’s population is
huge
Thank god it is a holiday
This is autumn
There is no end of
learning
Documents contain terms Documents are
represented by terms present in them
Match queries and documents by terms
For simplicity: ignore positions, consider documents as “bag-of-words”
There may be many matching documents – need to rank them Query: india
statistics
6
Vector space model
Each term represents a dimension
Documents are vectors in the term-space
Term-document matrix: a very sparse matrix
Query is also a vector in the term-space
d1 d2 d3 d4 d5 q
diwali 1 0 0 0 0
india 1 0 0 1 1 1
flying 0 1 0 0 0
population 0 0 0 1 0
autumn 0 0 1 0 0
statistical 0 1 0 0 1 1
Similarity of each document d with the query q is measured by the cosine similarity (dot product normalized by norms of the vectors)
7
Scoring function: TF.iDF How important is a term t in a document d Approach: take two factors into account
– With what significance does t occur in d? [term frequency]– Does t occur in many other documents also? [document frequency]– Called TF.iDF: TF × iDF, has many variants for TF and iDF
Variants for TF(t, d)1. Number of times t occurs in d: freq(t, d)
2. Logarithmically scaled frequency: 1 + log(freq(t, d))
3. Augmented frequency: avoid bias towards longer documents
Inverse document frequency of t : iDF(t)
for all t in d; 0 otherwise
where N = total number of documentsDF(t) = number of documents in which t occurs
Half the score for just being present
Rest a function of frequency
8
BM25 Okapi IR system – Okapi BM25 If the query q = {q1, … , qn} where qi’s are words in the query
where
N = total number of documents
avgdl = average length of documents
k1 and b are optimized parameters, usually b = 0.75 and 1.2 ≤ k1 ≤ 2.0
BM25 exhibited better performance than TF.iDF in TREC consistently
9
Relevance Simple IR model: query, documents, returned results Relevant document: a document that satisfies the information
need expressed by the query– Merely matching query terms does not make a document relevant– Relevance is human perception, not a mathematical statement– User may want some statistics on population of India by the query
“india statistics” – The document “Indian Statistical Institute” matches the query terms,
but not relevant
To evaluate effectiveness of a system, we need for each query1. Given a result, an assessment of whether it is relevant
2. The set of all relevant results assessed (pre-validated)• If the second is available, it serves the purpose of the first as well
Measures: precision, recall, F-measure (harmonic mean of precision and recall)
10
Inverted index Standard representation:
document terms Inverted index: term documents For each term t, store the list of the
documents in which t occurs
This is in Indian
Statistical Institute,
Kolkata, India
Statistically flying is
the safest mode of journey
Diwali is a huge
festival in India
India’s population
is huge
Thank god it is a holiday This is
autumnThere is
no end of learning
1 2 3
45
6 7
diwali: d3
india: d2 d3 d7
flying: d1
population: d7
autumn: d4
statistical: d1 d2
Scores?
11
Inverted index Standard representation:
document terms Inverted index: term documents For each term t, store the list of the
documents in which t occurs
diwali: d3(0.5)
india: d2(0.7) d3(0.3) d7(0.4)
flying: d1(0.3)
population: d7(0.6)
autumn: d4(0.8)
statistical: d1(0.2) d2(0.5) Note: These scores are dummy, not by any formula
This is in Indian
Statistical Institute,
Kolkata, India
Statistically flying is
the safest mode of journey
Diwali is a huge
festival in India
India’s population
is huge
Thank god it is a holiday This is
autumnThere is
no end of learning
1 2 3
45
6 7
12
Positional index Just documents and scores follows bag of
words model Cannot perform proximity search or phrase
query search Positional inverted index: also store
position of each occurrence of term t in each document d where t occurs
diwali: d3(0.5):<1>
india: d2(0.7):<4,8> d3(0.3):<7> d7(0.4):<1>
flying: d1(0.3):<2>
population: d7(0.6):<2>
autumn: d4(0.8):<3>
statistical: d1(0.2):<1> d2(0.5):<5>
This is in Indian
Statistical Institute,
Kolkata, India
Statistically flying is
the safest mode of journey
Diwali is a huge
festival in India
India’s population
is huge
Thank god it is a holiday This is
autumnThere is
no end of learning
1 2 3
45
6 7
13
Pre-processing Removal of stopwords: of, the, and, …
– Modern search does not completely remove stopwords– Such words add meaning to sentences as well as queries
Stemming: words stem (root) of words– Statistics, statistically, statistical statistic (same root)– Loss of slight information (the form of the word also matters)– But unifies differently expressed queries on the same topic– Lemmatization: doing this properly with morphological analysis of
words
Normalization: unify equivalent words as much as possible– U.S.A, USA– Windows, windows
Stemming, lemmatization, normalization, synonym finding, all are important subfields on their own!!
14
Creating an inverted index For each document, write out pairs
(term, docid) Sort by term Group, compute DF
This is in Indian
Statistical Institute,
Kolkata, India
Statistically flying is
the safest mode of journey
Diwali is a huge
festival in India
India’s population
is huge
Thank god it is a holiday This is
autumnThere is
no end of learning
1 2 3
45
6 7Term docId
statistic 1
fly 1
safe 1
… …
india 2
statistic 2
india 3
… …
india 7
Term docId
india 2
india 3
india 7
… …
fly 1
safe 1
statistic 1
statistic 2
… …
Term docId docId docId
india (df=3) 2 3 7
fly (df=1) 1
statistic (df=2) 1 2
… …
15
Traditional architecture
Analysis (stemming, normalization, …)
Basic format conversion, parsing
Indexing
Core query processing(accessing index, ranking)
Index
Different types of documents
Query handler (query parsing)
Results handler (displaying results)
User
Query Results
Query Results
Query processing
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
16
Pick the smallest doc id
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
17
doc 5 (0.6)
Pick the smallest doc id
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
18
Pick the smallest doc id
doc 5 (0.6)
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
19
Pick the smallest doc id
doc 5 (0.6)
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
20
Pick the smallest doc id
doc 5 (0.6)
doc 10 (0.1)
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
21
Pick the smallest doc id
doc 5 (0.6)
doc 10 (0.1)
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
22
Pick the smallest doc id
doc 5 (0.6)
doc 10 (0.1)
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
23
Pick the smallest doc id
doc 5 (0.6)
doc 10 (0.1)
doc 14 (0.6)
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
24
Pick the smallest doc id
doc 5 (0.6)
doc 10 (0.1)
doc 14 (0.6)
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
25
Pick the smallest doc id
doc 5 (0.6)
doc 10 (0.1)
doc 14 (0.6)
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
26
Pick the smallest doc id
doc 5 (0.6)
doc 10 (0.1)
doc 14 (0.6)
doc 17 (1.6)
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
27
Pick the smallest doc id
doc 5 (0.6)
doc 10 (0.1)
doc 14 (0.6)
doc 17 (1.6)
Merge
list
s so
rted
by
doc
iddoc 170.3
doc 50.6
doc 100.1
doc 210.2
doc 140.6
doc 170.7
doc 250.6
doc 170.6
doc 610.3
doc 780.5
doc 210.3
doc 650.1
doc 830.4
doc 380.6
doc 810.2
doc 910.1
doc 440.1
doc 830.9
doc 830.5
List 1 List 2 List 3
One pointer in each list
28
doc 5 (0.6)
doc 10 (0.1)
doc 14 (0.6)
doc 17 (1.6)
doc 21 (0.5)
doc 25 (0.6)
doc 38 (0.6)
doc 44 (0.1)
doc 61 (0.3)
doc 65 (0.1)
doc 78 (0.5)
doc 81 (0.2)
doc 83 (1.8)
doc 91 (0.1)
Merged list
stil
l sor
ted
by d
oc id
(Partial) sort doc 83 (1.8)
doc 17 (1.6)
Top-2Complexity?
klogn
MergeSimple and efficient, minimal overhead
Lists sorted by doc id
Merge
Merged list
But, have to scan the lists fully! 29
30
Top-k algorithms If there are millions of documents in the lists
– Can the ranking be done without accessing the lists fully?
Exact top-k algorithms (used more in databases)– Family of threshold algorithms (Ronald Fagin et al)– Threshold algorithm (TA)– No random access algorithm (NRA) [we will discuss, as an example]– Combined algorithm (CA)– Other follow up works
Inexact top-k algorithms– Exact top-k not required, the scores are only “crude” approximation of
“relevance” (human perception)– Several heuristics– Further reading: IR book by Manning, Raghavan and Schuetze, Ch. 7
NRA (No Random Access) Algorithm
lists
sort
ed b
y
score
doc 250.6
doc 170.6
doc 830.9
doc 780.5
doc 380.6
doc 170.7
doc 830.4
doc 140.6
doc 610.3
doc 170.3
doc 50.6
doc 810.2
doc 210.2
doc 830.5
doc 650.1
doc 910.1
doc 210.3
doc 100.1
doc 440.1
List 1 List 2 List 3
Fagin’s NRA Algorithm:
read one doc from every list
31
NRA (No Random Access) Algorithm
lists
sort
ed b
y
score
doc 250.6
doc 170.6
doc 830.9
doc 780.5
doc 380.6
doc 170.7
doc 830.4
doc 140.6
doc 610.3
doc 170.3
doc 50.6
doc 810.2
doc 210.2
doc 830.5
doc 650.1
doc 910.1
doc 210.3
doc 100.1
doc 440.1
Fagin’s NRA Algorithm: round 1
doc 83[0.9, 2.1]
doc 17[0.6, 2.1]
doc 25[0.6, 2.1]
Candidatesmin top-2 score: 0.6maximum score for unseen docs: 2.1
min-top-2 < best-score of candidates
List 1 List 2 List 3
read one doc from every list
current score
best-score
0.6 + 0.6 + 0.9 = 2.1
32
NRA (No Random Access) Algorithm
lists
sort
ed b
y
score
Fagin’s NRA Algorithm: round 2
doc 17[1.3, 1.8]
doc 83[0.9, 2.0]
doc 25[0.6, 1.9]
doc 38[0.6, 1.8]
doc 78[0.5, 1.8]
Candidatesmin top-2 score: 0.9maximum score for unseen docs: 1.8
doc 250.6
doc 170.6
doc 830.9
doc 780.5
doc 380.6
doc 170.7
doc 830.4
doc 140.6
doc 610.3
doc 170.3
doc 50.6
doc 810.2
doc 210.2
doc 830.5
doc 650.1
doc 910.1
doc 210.3
doc 100.1
doc 440.1
List 1 List 2 List 3
min-top-2 < best-score of candidates
read one doc from every list
0.5 + 0.6 + 0.7 = 1.8
33
NRA (No Random Access) Algorithm
lists
sort
ed b
y
score
doc 83[1.3, 1.9]
doc 17[1.3, 1.7]
doc 25[0.6, 1.5]
doc 78[0.5, 1.4]
Candidatesmin top-2 score: 1.3maximum score for unseen docs: 1.3
doc 250.6
doc 170.6
doc 830.9
doc 780.5
doc 380.6
doc 170.7
doc 830.4
doc 140.6
doc 610.3
doc 170.3
doc 50.6
doc 810.2
doc 210.2
doc 830.5
doc 650.1
doc 910.1
doc 210.3
doc 100.1
doc 440.1
Fagin’s NRA Algorithm: round 3
List 1 List 2 List 3
min-top-2 < best-score of candidates
no more new docs can get into top-2
but, extra candidates left in queue
read one doc from every list
0.4 + 0.6 + 0.3 = 1.3
34
NRA (No Random Access) Algorithm
doc 250.6
doc 170.6
doc 830.9
doc 780.5
doc 380.6
doc 170.7
doc 830.4
doc 140.6
doc 610.3
doc 170.3
doc 50.6
doc 810.2
doc 210.2
doc 830.5
doc 650.1
doc 910.1
doc 210.3
doc 100.1
doc 440.1
lists
sort
ed b
y
score
doc 171.6
doc 83[1.3, 1.9]
doc 25[0.6, 1.4]
Candidatesmin top-2 score: 1.3maximum score for unseen docs: 1.1
Fagin’s NRA Algorithm: round 4
List 1 List 2 List 3
min-top-2 < best-score of candidates
no more new docs can get into top-2
but, extra candidates left in queue
read one doc from every list
0.3 + 0.6 + 0.2 = 1.1
35
NRA (No Random Access) Algorithm
doc 250.6
doc 170.6
doc 830.9
doc 780.5
doc 380.6
doc 170.7
doc 830.4
doc 140.6
doc 610.3
doc 170.3
doc 50.6
doc 810.2
doc 210.2
doc 830.5
doc 650.1
doc 910.1
doc 210.3
doc 100.1
doc 440.1
lists
sort
ed b
y
score
doc 831.8
doc 171.6
Candidatesmin top-2 score: 1.6maximum score for unseen docs: 0.8
Done!
Fagin’s NRA Algorithm: round 5
List 1 List 2 List 3
no extra candidate in queue
read one doc from every list
0.2 + 0.5 + 0.1 = 0.8
36
More approaches: Periodically also perform random accesses on
documents to reduce uncertainty (CA) Sophisticated scheduling on lists Crude approximation: NRA may take a lot of
time to stop. Just stop after a while with approximate top-k – who cares if the results are perfect according to the scores?
37
References Primarily: IR Book by Manning, Raghavan and
Schuetze: http://nlp.stanford.edu/IR-book/