1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
CS 572: Information Retrieval
Transcript of CS 572: Information Retrieval
CS 572: Information Retrieval
Lecture 9: Language Models for IR (cont’d)
Acknowledgments: Some slides in this lecture were adapted
from Chris Manning (Stanford) and Jin Kim (UMass’12)
2/10/2016 1CS 572: Information Retrieval. Spring 2016
2
New: IR based on Language Model (LM)
query
d1
d2
dn
…
Information need
document collection
generation
)|( dMQP 1dM
2dM
…
ndM• A common search heuristic is to use words
that you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!
• The LM approach directly exploits that idea!
Probabilistic Language Modeling
• Goal: compute the probability of a document, a sentence, or sequence of words:
P(W) = P(w1,w2,w3,w4,w5…wn)
• Related task: probability of an upcoming word:
P(w5|w1,w2,w3,w4)
• A model that computes either of these:
P(W) or P(wn|w1,w2…wn-1) is called a language model.
• Better: the grammar But language model or LM is standard
Evaluation: How good is our model?
• Does our language model prefer good sentences to bad ones?– Assign higher probability to “real” or “frequently observed”
sentences • Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.• We test the model’s performance on data we haven’t
seen.– A test set is an unseen dataset that is different from our
training set, totally unused.– An evaluation metric tells us how well our model does on
the test set.
Training on the test set
• We can’t allow test sentences into the training set
• We will assign it an artificially high probability when we set it in the test set
• “Training on the test set”
• Bad science!
5
Extrinsic evaluation of N-gram models
• Best evaluation for comparing models A and B
– Put each model in a task
• spelling corrector, speech recognizer, IR system
– Run the task, get an accuracy for A and for B
• How many misspelled words corrected properly
• How many relevant/non-relevant docs retrieved
– Compare accuracy for A and B
• Problematic!
– Time consuming (re-index docs/re-run search/user study) – can take days or weeks
– Difficult to pinpoint problems in complex system/task
Intrinsic Evaluation: Perplexity
– Bad approximation
• unless the test data looks just like the training data
• So generally only useful in pilot experiments
– But is helpful to think about.
• The Shannon Game:
– How well can we predict the next word?
– Unigrams are terrible at this game. (Why?)
• A better model of a text– is one which assigns a higher probability to the word
that actually occurs
I always order pizza with cheese and ____
The 33rd President of the US was ____
I saw a ____
mushrooms 0.1
pepperoni 0.1
anchovies 0.01
….
fried rice 0.0001
….
and 1e-100
Intrinsic Evaluation: Perplexity
Perplexity (formal definition)
Perplexity is the inverse probability of the test set, normalized by the number of words:
Chain rule:
For bigrams:
Minimizing perplexity is the same as maximizing probability
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence)PP(W ) = P(w1w2...wN )
-1
N
=1
P(w1w2...wN )N
Perplexity as branching factor
• Let’s suppose a sentence consisting of random digits
• What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?
Lower perplexity = better model
• Training 38 million words, test 1.5 million words, WSJ
N-gram Order
Unigram Bigram Trigram
Perplexity 962 170 109
The perils of overfitting
• N-grams only work well for word prediction if the test corpus looks like the training corpus
– In real life, it often doesn’t
– We need to train robust models that generalize!
– One kind of generalization: Zeros!
• Things that don’t ever occur in the training set
–But occur in the test set
Summary: Discounts for Smoothing
2/10/2016 CS 572: Information Retrieval. Spring 2016 13
Smoothing: Interpolation
2/10/2016 CS 572: Information Retrieval. Spring 2016 14
15
Smoothing: Basic Interpolation Model
• General formulation of the LM for IR
– The user has a document in mind, and generates the query from this document.
– The equation represents the probability that the document that the user had in mind was in fact this one.
Qt
dMtptpdpdQp ))|()()1(()(),(
general language model
individual-document model
Jelinek-Mercer Smoothing
2/10/2016 CS 572: Information Retrieval. Spring 2016 16
Dirichlet Smoothing
2/10/2016 CS 572: Information Retrieval. Spring 2016 17
How to set the lambdas?
• Use a held-out corpus
• Choose λs to maximize the probability of held-out data:– Fix the N-gram probabilities (on the training data)
– Then search for λs that give largest probability to held-out set(or lowest perplexity of test set)
Training DataHeld-Out
DataTest Data
Huge web-scale n-grams
• How to deal with, e.g., Google N-gram corpus
• Pruning– Only store N-grams with count > threshold.
• Remove singletons of higher-order n-grams
– Entropy-based pruning
• Efficiency– Efficient data structures like tries
– Bloom filters: approximate language models
– Store words as indexes, not strings• Use Huffman coding to fit large numbers of words into two bytes
– Quantize probabilities (4-8 bits instead of 8-byte float)
Smoothing for Web-scale N-grams
• “Stupid backoff” (Brants et al. 2007)
• No discounting, just use relative frequencies
20
S(wi |wi-k+1
i-1 ) =
count(wi-k+1
i )
count(wi-k+1
i-1 ) if count(wi-k+1
i ) > 0
0.4S(wi |wi-k+2
i-1 ) otherwise
ì
íïï
îïï
S(wi ) =count(wi )
N
N-gram Smoothing Summary
• Add-1 smoothing:
– OK for text categorization, not for language modeling
• The most commonly used method in NLP:
– Extended Interpolated Kneser-Ney (see textbood)
• For very large N-grams like the Web:
– Stupid backoff
• For IR: variants of interpolation, discriminative models (choose Lambda to maximize retrieval metrics, not perplexity)
21
Language Modeling Toolkits
• SRILM
– http://www.speech.sri.com/projects/srilm/
• KenLM
– https://kheafield.com/code/kenlm/
Google N-Gram Release, August 2006
…
Google N-Gram Release
• serve as the incoming 92
• serve as the incubator 99
• serve as the independent 794
• serve as the index 223
• serve as the indication 72
• serve as the indicator 120
• serve as the indicators 45
• serve as the indispensable 111
• serve as the indispensible 40
• serve as the individual 234
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Higher Order LMs for IR
2/10/2016 CS 572: Information Retrieval. Spring 2016 26
Models of Text Generation
2/10/2016 CS 572: Information Retrieval. Spring 2016 27
Ranking with Language Models
2/10/2016 CS 572: Information Retrieval. Spring 2016 28
Ranking with LMs: Main Components
• Query probability: what is the probability to generate the given query, given a language model?
• Document Probability: what is the probability to generate the given document, given a language model?
• Model Comparison: how ”close” are two language models?
2/10/2016 CS 572: Information Retrieval. Spring 2016 29
Ranking Using LMs: Multinomial
2/10/2016 CS 572: Information Retrieval. Spring 2016 30
Ranking with LMs: Multi-Bernoulli
2/10/2016 CS 572: Information Retrieval. Spring 2016 31
Score: Query Likelihood
2/10/2016 CS 572: Information Retrieval. Spring 2016 32
Score 2: Document Likelihood
2/10/2016 CS 572: Information Retrieval. Spring 2016 33
Score: Likelihood ratio (odds)
2/10/2016 CS 572: Information Retrieval. Spring 2016 34
Score: Model Comparison
2/10/2016 CS 572: Information Retrieval. Spring 2016 35
• Relative entropy between the two distributions
• Cost in bits of coding using Q when true distribution is P
Kullback-Leibler Divergence
)))(log()((
))(log()()(
iPiP
iQiPQPDi
KL
i
iPiPxPH ))(log()())((
36
Kullback-Leibler Divergence
i
KLiQ
iPiPQPD )
)(
)(log()()(
37
Two-stage Smoothing [Zhai & Lafferty 02]
2008 © ChengXiang Zhai38
c(w,d)
|d|
P(w|d) =+p(w|C)
+
Stage-1
-Explain unseen words
-Dirichlet prior(Bayesian)
Collection LM
(1-) + p(w|U)
Stage-2
-Explain noise in query
-2-component mixture
User background model
Can be approximated by p(w|C)
Structured Document Retrieval[Ogilvie & Callan 03]
2008 © ChengXiang Zhai39
1 2
1
11
...
( | , 1) ( | , 1)
( | , 1) ( | , 1)
m
m
i
i
m k
j i j
ji
Q q q q
p Q D R p q D R
s D D R p q D R
Title
Abstract
Body-Part1
Body-Part2
…
D
D1
D2
D3
Dk
-Want to combine different parts of a
document with appropriate weights
-Anchor text can be treated as a “part” of a
document
- Applicable to XML retrieval
“part selection” prob. Serves as weight for Dj
Can be trained using EM
Select Dj and generate a
query word using Dj
LMs for IR: Rules of Thumb
2/10/2016 CS 572: Information Retrieval. Spring 2016 40
41
LMs vs. vector space model (1)
LMs have some things in common with vector space models.
Term frequency is directed in the model. But it is not scaled in LMs.
Probabilities are inherently “length-normalized”. Cosine normalization does something similar for vector space.
Mixing document and collection frequencies has an effect similar to idf.
Terms rare in the general collection, but common in some documents will have a greater influence on the ranking.
42
LMs vs. vector space model (2)
LMs vs. vector space model: commonalities Term frequency is directly in the model.
Probabilities are inherently “length-normalized”.
Mixing document and collection frequencies has an effect similar to idf.
LMs vs. vector space model: differences LMs: based on probability theory
Vector space: based on similarity, a geometric/ linear algebra notion
Collection frequency vs. document frequency
Details of term frequency, length normalization etc.
43
Vector space (tf-idf) vs. LM
The language modeling approach always does better in these experiments . . . . . . but note that where the approach shows significant gains is at higher levels of recall.
44
• The main difference is whether “Relevance” figures explicitly in the model or not
– LM approach attempts to do away with modeling relevance
• LM approach assumes that documents and expressions of information problems are of the same type
• Computationally tractable, intuitively appealing
LM vs. Prob. Model for IR
45
• Problems of basic LM approach
– Assumption of equivalence between document and information problem representation is unrealistic
– Very simple models of language
– Relevance feedback is difficult to integrate, as are user preferences, and other general issues of relevance
– Can’t easily accommodate phrases, passages, Boolean operators
• Current extensions focus on putting relevance back into the model, etc.
LM vs. Prob. Model for IR
Ambiguity makes queries difficult
American Airlines?
or Alcoholics Anonymous?
46
• Clarity score ~ low ambiguity
• Cronen-Townsend et. al. SIGIR 2002
• Compare a language model – over the relevant documents for a query
– over all possible documents
• The more difference these are, the more clear the query is
• “programming perl” vs. “the”
Query Clarity
47
Clarity score
Vw coll wP
QwPQwP
)(
)|(log)|( scoreClarity 2
48
2008 © ChengXiang Zhai49
Predicting Query Difficulty [Cronen-Townsend et al. 02]
• Observations:– Discriminative queries tend to be easier– Comparison of the query model and the collection model can indicate how
discriminative a query is
• Method:– Define “query clarity” as the KL-divergence between an estimated query
model or relevance model and the collection LM
– An enriched query LM can be estimated by exploiting pseudo feedback (e.g., relevance model)
• Correlation between the clarity scores and retrieval performance
( | )( ) ( | ) log
( | )
Q
Q
w
p wclarity Q p w
p w Collection
Clarity scores on TREC-7 collection
50
Can use many more features
• http://www.slideshare.net/DavidCarmel/sigir12-tutorial-query-perfromance-prediction-for-ir
2/10/2016 CS 572: Information Retrieval. Spring 2016 51