Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata...

27
Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin (Maryland) and the course website for CS276A (Stanford)

Transcript of Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata...

Page 1: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Language Models for IR

Debapriyo Majumdar

Information Retrieval

Indian Statistical Institute Kolkata

Spring 2015

Credit for several slides to Jimmy Lin (Maryland) and the course website for CS276A (Stanford)

Page 2: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Standard Probabilistic IR

query

d1

d2

dn

Information need

document collection

matching

),|( dQRP

….

Page 3: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

IR based on Language Model (LM)

query

d1

d2

dn

Information need

document collection

generation

)|( dMQP 1dM

2dM

ndM A common search heuristic is to use words that

you expect to find in matching documents as your query – why, I saw Sergey Brin advocating that strategy on late night TV one night in my hotel room, so it must be good!

The LM approach directly exploits that idea!

Page 4: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Formal Language Model Traditional generative model: generates strings

– Finite state machines or regular grammars, etc

I wish

I wish

I wish I wish

I wish I wish I wish

I wish I wish I wish I wish

Does not generate:

wish I wish

Generates:

Page 5: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Stochastic Language Models An alphabet ∑ What is the probability of some string being

generated?

0.2 the

0.1 a

0.01 man

0.01 woman

0.03 said

0.02 likes

the man likes the woman

0.2 0.01 0.02 0.2 0.01

multiply?

Model M

P(the man likes the woman | M) = 0.00000008

Yes, if the words occur independently!

Probability?

Page 6: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Stochastic Language Models A statistical model for generating text Probability distribution over strings in a given

language

M

P ( | M ) = P ( | M )

P ( | M, )

P ( | M, )

P ( | M, )

Page 7: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Unigram and higher-order models

Unigram Language Models

Bigram (generally, n-gram) Language Models

Other Language Models– Grammar-based models (PCFGs), etc

= P ( ) P ( | ) P ( | ) P ( | )

P ( ) P ( ) P ( ) P ( )

P ( )

P ( ) P ( | ) P ( | ) P ( | )

Easy.Effective!

Page 8: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Using Language Models in IR Treat each document as the basis for a model (e.g.,

unigram statistics) Rank document d based on P(d | q) P(d | q) = P(q | d) × P(d) / P(q)

– P(q) is the same for all documents, so ignore– P(d) [the prior] is often treated as the same for all d

• But we could use criteria like authority, length, genre

– P(q | d) is the probability of q given d’s model

Very general formal approach

Page 9: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

The fundamental problem of LMs Usually we don’t know the model M

– But have a sample of text representative of that model

Estimate a language model from a sample Then compute the observation probability

P ( | M ( ) )

M

Page 10: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Ranking with Language Models Build a model for every document Rank document d based on P(MD | q)

Using Bayes’ Theorem

Same as ranking by P(q | MD)

)(

)()|()|(

qP

MPMqPqMP DD

D

P(q) is same for all documents; doesn’t change ranks

P(MD) [the prior] is assumed to be the same for all d

Page 11: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

What does it mean?Ranking by P(MD | q)…

Hey, what’s the probability this query came from you?

model1

Hey, what’s the probability that you generated this

query?model1

is the same as ranking by P(q | MD)

Hey, what’s the probability this query came from you?

model2

Hey, what’s the probability that you generated this

query?model2

Hey, what’s the probability this query came from you?

modeln

Hey, what’s the probability that you generated this

query?modeln

… …

Page 12: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Ranking Models?

Hey, what’s the probability that you generated this

query?model1

Ranking by P(q | MD)

Hey, what’s the probability that you generated this

query?model2

Hey, what’s the probability that you generated this

query?modeln

… is a model of document1

… is a model of document2

… is a model of documentn

… is the same as ranking documents

Page 13: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Building Document Models How do we build a language model for a document?

What’s in the urn?

M

Physical metaphor:

What colored balls and how many of each?

Page 14: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

P ( ) = 1/4

P ( ) = 1/4

Simple Approach Simply count the frequencies in the document =

maximum likelihood estimate

M P ( ) = 1/2

P(w | MS) = #(w,S) / |S|

Sequence S

#(w,S) = number of times w occurs in S|S| = length of S

Page 15: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Query generation probability (1) Ranking formula

The probability of producing the query given the language model of document d using MLE is:

Unigram assumption:Given a particular language model, the query terms occur independently

),( dttf

ddl

: language model of document d

: raw tf of term t in document d

: total number of tokens in document d

dM

)|()(

)|()(),(

dMQpdp

dQpdpdQp

Page 16: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Zero-Frequency Problem Suppose some event is not in our observation S

– Model will assign zero probability to that event

M P ( ) = 1/2 P ( ) = 1/4 P ( ) = 1/4

Sequence S

P ( )P ( ) =

= (1/2) (1/4) 0 (1/4) = 0

P ( ) P ( ) P ( )

!!

Page 17: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Why is this a bad idea? Modeling a document

– Just because a word didn’t appear doesn’t mean it’ll never appear…

– But safe to assume that unseen words are rare

Think of the document model as a topic– There are many documents that can be written about a single

topic– We’re trying to figure out what the model is based on just one

document

Practical effect: assigning zero probability to unseen words forces exact match– But partial matches are useful also!

Analogy: fishes in the sea

Page 18: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Smoothing

P(w)

w

Maximum Likelihood Estimate

wordsallofcountwofcount

ML wp )(

The solution: “smooth” the word probabilities

Smoothed probability distribution

Page 19: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

How do you smooth? Assign some small probability to unseen events

– But remember to take away “probability mass” from other events

Simplest example: for words you didn’t see, pretend you saw it once

Other more sophisticated methods:– Absolute discounting– Linear interpolation, Jelinek-Mercer– Dirichlet, Witten-Bell– Good-Turing– …

Lots of performance to be gotten out of smoothing!

Page 20: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Mixture model P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc)

Mixes the probability from the document with the general collection frequency of the word.

Correctly setting is very important A high value of lambda makes the search

“conjunctive-like” – suitable for short queries A low value is more suitable for long queries Can tune to optimize performance

– Perhaps make it dependent on document size (cf. Dirichlet prior or Witten-Bell smoothing)

Page 21: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Basic mixture model summary General formulation of the LM for IR

– The user has a document in mind, and generates the query from this document.

– The equation represents the probability that the document that the user had in mind was in fact this one.

general language model

individual-document model

Page 22: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

The main difference is whether “Relevance” figures explicitly in the model or not– LM approach attempts to do away with modeling relevance

LM approach asssumes that documents and expressions of information problems are of the same type

Computationally tractable, intuitively appealing

LM vs. Prob. Model for IR

Page 23: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Problems of basic LM approach– Assumption of equivalence between document and

information problem representation is unrealistic– Very simple models of language– Relevance feedback is difficult to integrate, as are user

preferences, and other general issues of relevance– Can’t easily accommodate phrases, passages, Boolean

operators Current extensions focus on putting relevance back

into the model, etc.

LM vs. Prob. Model for IR

Page 24: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Language models: pro & con Novel way of looking at the problem of text retrieval

based on probabilistic language modeling• Conceptually simple and explanatory• Formal mathematical model• Natural use of collection statistics, not heuristics (almost…)

– LMs provide effective retrieval and can be improved to the extent that the following conditions can be met

• Our language models are accurate representations of the data.• Users have some sense of term distribution.*

– *Or we get more sophisticated with translation model

Page 25: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Comparison With Vector Space There’s some relation to traditional tf.idf models:

– (unscaled) term frequency is directly in model– the probabilities do length normalization of term

frequencies– the effect of doing a mixture with overall collection

frequencies is a little like idf: terms rare in the general collection but common in some documents will have a greater influence on the ranking

Page 26: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Comparison With Vector Space Similar in some ways

– Term weights based on frequency– Terms often used as if they were independent– Inverse document/collection frequency used– Some form of length normalization useful

Different in others– Based on probability rather than similarity

• Intuitions are probabilistic rather than geometric

– Details of use of document length and term, document, and collection frequency differ

Page 27: Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

ResourcesJ.M. Ponte and W.B. Croft. 1998. A language modelling approach to information

retrieval. In SIGIR 21.

D. Hiemstra. 1998. A linguistically motivated probabilistic model of information retrieval. ECDL 2, pp. 569–584.

A. Berger and J. Lafferty. 1999. Information retrieval as statistical translation. SIGIR 22, pp. 222–229.

D.R.H. Miller, T. Leek, and R.M. Schwartz. 1999. A hidden Markov model information retrieval system. SIGIR 22, pp. 214–221.

[Several relevant newer papers at SIGIR 23–25, 2000–2002.]

Workshop on Language Modeling and Information Retrieval, CMU 2001. http://la.lti.cs.cmu.edu/callan/Workshops/lmir01/ .

The Lemur Toolkit for Language Modeling and Information Retrieval. http://www-2.cs.cmu.edu/~lemur/ . CMU/Umass LM and IR system in C(++), currently actively developed.