Probabilistic Information Retrieval - Sumit Bhatia

60
Introduction Probabilistic Ranking Principle The Binary Independence Model OKAPI Discussion Probabilistic Information Retrieval Sumit Bhatia July 16, 2009 Sumit Bhatia Probabilistic Information Retrieval 1/23

Transcript of Probabilistic Information Retrieval - Sumit Bhatia

Page 1: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Probabilistic Information Retrieval

Sumit Bhatia

July 16, 2009

Sumit Bhatia Probabilistic Information Retrieval 1/23

Page 2: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Overview

1 IntroductionInformation RetrievalIR ModelsProbability Basics

2 Probabilistic Ranking PrincipleDocument Ranking ProblemProbability Ranking Principle

3 The Binary Independence Model

4 OKAPI

5 Discussion

Sumit Bhatia Probabilistic Information Retrieval 2/23

Page 3: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Information Retrieval(IR) Process

1 User has some information needs

Sumit Bhatia Probabilistic Information Retrieval 3/23

Page 4: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Information Retrieval(IR) Process

1 User has some information needs

2 Information Need → Query using Query Representation

Sumit Bhatia Probabilistic Information Retrieval 3/23

Page 5: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Information Retrieval(IR) Process

1 User has some information needs

2 Information Need → Query using Query Representation

3 Documents → Document Representation

Sumit Bhatia Probabilistic Information Retrieval 3/23

Page 6: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Information Retrieval(IR) Process

1 User has some information needs

2 Information Need → Query using Query Representation

3 Documents → Document Representation

4 IR system matches the two representations to determine thedocuments that satisfy user’s information needs.

Sumit Bhatia Probabilistic Information Retrieval 3/23

Page 7: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Boolean Retrieval Model

Query = Boolean Expression of termsex. Mitra AND Giles

Sumit Bhatia Probabilistic Information Retrieval 4/23

Page 8: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Boolean Retrieval Model

Query = Boolean Expression of termsex. Mitra AND Giles

Document = Term-document Matrix

Aij = 1 iff i th term is present in j th document.

Sumit Bhatia Probabilistic Information Retrieval 4/23

Page 9: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Boolean Retrieval Model

Query = Boolean Expression of termsex. Mitra AND Giles

Document = Term-document Matrix

Aij = 1 iff i th term is present in j th document.

“Bag of words”

Sumit Bhatia Probabilistic Information Retrieval 4/23

Page 10: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Boolean Retrieval Model

Query = Boolean Expression of termsex. Mitra AND Giles

Document = Term-document Matrix

Aij = 1 iff i th term is present in j th document.

“Bag of words”

No Ranking

Sumit Bhatia Probabilistic Information Retrieval 4/23

Page 11: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Vector Space Model

Query = free text queryex. Mitra Giles

Sumit Bhatia Probabilistic Information Retrieval 5/23

Page 12: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Vector Space Model

Query = free text queryex. Mitra Giles

Query and Document → vectors in “term space”

Sumit Bhatia Probabilistic Information Retrieval 5/23

Page 13: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Vector Space Model

Query = free text queryex. Mitra Giles

Query and Document → vectors in “term space”

Cosine similarity between query and document vectorsindicates similarity

Sumit Bhatia Probabilistic Information Retrieval 5/23

Page 14: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Information Retrieval(IR) Process-Revisited

1 User has some information needs

2 Information Need → Query using Query Representation

3 Documents → Document Representation

4 IR system matches the two representations to determine thedocuments that satisfy user’s information needs.

Sumit Bhatia Probabilistic Information Retrieval 6/23

Page 15: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Information Retrieval(IR) Process-Revisited

1 User has some information needs

2 Information Need → Query using Query Representation

3 Documents → Document Representation

4 IR system matches the two representations to determine thedocuments that satisfy user’s information needs.

Problem!

Both Query and Document Representations are Uncertain

Sumit Bhatia Probabilistic Information Retrieval 6/23

Page 16: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Information RetrievalIR ModelsProbability Basics

Probability Basics

Chain Rule:

P(A,B) = P(A ∩ B) = P(A|B)P(B) = P(B |A)P(A)

Partition Rule:

P(B) = P(A,B) + P(A,B)

Bayes Rule:

P(A|B) = P(B|A)P(A)P(B) =

[

P(B|A)P

X∈{A,A} P(B|X )P(X )

]

P(A)

Sumit Bhatia Probabilistic Information Retrieval 7/23

Page 17: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Document Ranking ProblemProbability Ranking Principle

Document Ranking Problem

Problem Statement

Given a set of documents D = {d1, d2, . . . , dn} and a query q, inwhat order the subset of relevant documentsDr = {dr1, dr2 . . . , drm} should be returned to the user.

Sumit Bhatia Probabilistic Information Retrieval 8/23

Page 18: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Document Ranking ProblemProbability Ranking Principle

Document Ranking Problem

Problem Statement

Given a set of documents D = {d1, d2, . . . , dn} and a query q, inwhat order the subset of relevant documentsDr = {dr1, dr2 . . . , drm} should be returned to the user.

Hint: We want the best document to be at rank 1, second best tobe at rank 2 and so on.

Sumit Bhatia Probabilistic Information Retrieval 8/23

Page 19: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Document Ranking ProblemProbability Ranking Principle

Document Ranking Problem

Problem Statement

Given a set of documents D = {d1, d2, . . . , dn} and a query q, inwhat order the subset of relevant documentsDr = {dr1, dr2 . . . , drm} should be returned to the user.

Hint: We want the best document to be at rank 1, second best tobe at rank 2 and so on.

Solution

Rank by probability of relevance of the document w.r.t.information need (query).=⇒ by P(R = 1|d , q)

Sumit Bhatia Probabilistic Information Retrieval 8/23

Page 20: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Document Ranking ProblemProbability Ranking Principle

Probability Ranking Principle

Probability Ranking Principle (Rijsbergen, 1979)

If a reference retrieval system’s response to each request is aranking of the documents in the collection in order of decreasingprobability of relevance to the user who submitted the request,where the probabilities are estimated as accurately as possible onthe basis of whatever data have been made available to the systemfor this purpose, the overall effectiveness of the system to its userwill be the best that is obtainable on the basis of those data.

Sumit Bhatia Probabilistic Information Retrieval 9/23

Page 21: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Document Ranking ProblemProbability Ranking Principle

Probability Ranking Principle

Probability Ranking Principle (Rijsbergen, 1979)

If a reference retrieval system’s response to each request is aranking of the documents in the collection in order of decreasingprobability of relevance to the user who submitted the request,where the probabilities are estimated as accurately as possible onthe basis of whatever data have been made available to the systemfor this purpose, the overall effectiveness of the system to its userwill be the best that is obtainable on the basis of those data.

Observation 1: PRP maximizes the mean probability at rank k.

Sumit Bhatia Probabilistic Information Retrieval 9/23

Page 22: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Document Ranking ProblemProbability Ranking Principle

Probability Ranking Principle

Case 1: 1/0 Loss =⇒ No selection/retrieval costs.

Sumit Bhatia Probabilistic Information Retrieval 10/23

Page 23: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Document Ranking ProblemProbability Ranking Principle

Probability Ranking Principle

Case 1: 1/0 Loss =⇒ No selection/retrieval costs.

Bayes’ Optimal Decision Rule:

d is relevant iff P(R = 1|d , q) > P(R = 0|d , q)

Sumit Bhatia Probabilistic Information Retrieval 10/23

Page 24: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Document Ranking ProblemProbability Ranking Principle

Probability Ranking Principle

Case 1: 1/0 Loss =⇒ No selection/retrieval costs.

Bayes’ Optimal Decision Rule:

d is relevant iff P(R = 1|d , q) > P(R = 0|d , q)

Theorem 1

PRP is optimal, in the sense that it minimizes the expected loss(Bayes Risk) under 1/0 loss.

Sumit Bhatia Probabilistic Information Retrieval 10/23

Page 25: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Document Ranking ProblemProbability Ranking Principle

Probability Ranking Principle

Case 1: 1/0 Loss =⇒ No selection/retrieval costs.

Bayes’ Optimal Decision Rule:

d is relevant iff P(R = 1|d , q) > P(R = 0|d , q)

Theorem 1

PRP is optimal, in the sense that it minimizes the expected loss(Bayes Risk) under 1/0 loss.

Case 2: PRP with differential retrieval costs

C1.P(R = 1|d , q) + C0.P(R = 0|d , q) ≤ C1.P(R = 1|d ′, q) + C0.P(R =

0|d ′, q)

Sumit Bhatia Probabilistic Information Retrieval 10/23

Page 26: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model (BIM)

Assumptions:

1 Binary: documents are represented as binary incidence vectorsof terms. d = {d1, d2, . . . , dn}di = 1 iff term i is present in d , else it is 0.

1This is the assumption for PRP in general.

Sumit Bhatia Probabilistic Information Retrieval 11/23

Page 27: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model (BIM)

Assumptions:

1 Binary: documents are represented as binary incidence vectorsof terms. d = {d1, d2, . . . , dn}di = 1 iff term i is present in d , else it is 0.

2 Independence: terms occur in documents independent ofother documents.

1This is the assumption for PRP in general.

Sumit Bhatia Probabilistic Information Retrieval 11/23

Page 28: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model (BIM)

Assumptions:

1 Binary: documents are represented as binary incidence vectorsof terms. d = {d1, d2, . . . , dn}di = 1 iff term i is present in d , else it is 0.

2 Independence: terms occur in documents independent ofother documents.

3 Relevance of a document is independent of relevance of otherdocuments1

1This is the assumption for PRP in general.

Sumit Bhatia Probabilistic Information Retrieval 11/23

Page 29: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model (BIM)

Assumptions:

1 Binary: documents are represented as binary incidence vectorsof terms. d = {d1, d2, . . . , dn}di = 1 iff term i is present in d , else it is 0.

2 Independence: terms occur in documents independent ofother documents.

3 Relevance of a document is independent of relevance of otherdocuments1

Implications:

1 Many documents have the same representation.

2 No association between terms is considered.

1This is the assumption for PRP in general.

Sumit Bhatia Probabilistic Information Retrieval 11/23

Page 30: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model (BIM)

We wish to compute P(R |d , q).We do it in terms of term incidence vectors ~d and ~q.We thus compute P(R |~d , ~q).

Sumit Bhatia Probabilistic Information Retrieval 12/23

Page 31: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model (BIM)

We wish to compute P(R |d , q).We do it in terms of term incidence vectors ~d and ~q.We thus compute P(R |~d , ~q).Using Bayes’ Rule, we have:

P(R = 1|~d , ~q) =P(~d |R = 1, ~q) P(R = 1|~q)

P(~d |~q)(1)

P(R = 0|~d , ~q) =P(~d |R = 0, ~q) P(R = 0|~q)

P(~d |~q)(2)

Sumit Bhatia Probabilistic Information Retrieval 12/23

Page 32: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model (BIM)

We wish to compute P(R |d , q).We do it in terms of term incidence vectors ~d and ~q.We thus compute P(R |~d , ~q).Using Bayes’ Rule, we have:

P(R = 1|~d , ~q) =P(~d |R = 1, ~q) P(R = 1|~q)

P(~d |~q)(1)

P(R = 0|~d , ~q) =P(~d |R = 0, ~q) P(R = 0|~q)

P(~d |~q)(2)

Prior Relevance Probability

Sumit Bhatia Probabilistic Information Retrieval 12/23

Page 33: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Computing the Odd ratios, we get:

O(R |~d , ~q) =P(R = 1|~q)

P(R = 0|~q)×

P(~d |R = 1, ~q)

P(~d |R = 0, ~q)(3)

Sumit Bhatia Probabilistic Information Retrieval 13/23

Page 34: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Computing the Odd ratios, we get:

O(R |~d , ~q) =P(R = 1|~q)

P(R = 0|~q)×

P(~d |R = 1, ~q)

P(~d |R = 0, ~q)(3)

Document Independent!

Sumit Bhatia Probabilistic Information Retrieval 13/23

Page 35: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Computing the Odd ratios, we get:

O(R |~d , ~q) =P(R = 1|~q)

P(R = 0|~q)×

P(~d |R = 1, ~q)

P(~d |R = 0, ~q)(3)

Document Independent! What for the second term?

Sumit Bhatia Probabilistic Information Retrieval 13/23

Page 36: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Computing the Odd ratios, we get:

O(R |~d , ~q) =P(R = 1|~q)

P(R = 0|~q)×

P(~d |R = 1, ~q)

P(~d |R = 0, ~q)(3)

Document Independent! What for the second term?

Naive Bayes Assumption

Sumit Bhatia Probabilistic Information Retrieval 13/23

Page 37: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Computing the Odd ratios, we get:

O(R |~d , ~q) =P(R = 1|~q)

P(R = 0|~q)×

P(~d |R = 1, ~q)

P(~d |R = 0, ~q)(3)

Document Independent! What for the second term?

Naive Bayes Assumption

O(R |~d , ~q) ∝m

Πt=1

P(~dt |R = 1, ~q)

P(~dt |R = 0, ~q)(4)

Sumit Bhatia Probabilistic Information Retrieval 13/23

Page 38: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Observation 1: A term is either present in a document or

not.

Sumit Bhatia Probabilistic Information Retrieval 14/23

Page 39: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Observation 1: A term is either present in a document or

not.

O(R |~d , ~q) ∝m

Πt:dt=1

P(~dt = 1|R = 1, ~q)

P(~dt = 1|R = 0, ~q).

m

Πt:dt=0

P(~dt = 0|R = 1, ~q)

P(~dt = 0|R = 0, ~q)(5)

Sumit Bhatia Probabilistic Information Retrieval 14/23

Page 40: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Observation 1: A term is either present in a document or

not.

O(R |~d , ~q) ∝m

Πt:dt=1

P(~dt = 1|R = 1, ~q)

P(~dt = 1|R = 0, ~q).

m

Πt:dt=0

P(~dt = 0|R = 1, ~q)

P(~dt = 0|R = 0, ~q)(5)

document R = 1 R = 0

Term present dt = 1 pt ut

Term absent dt = 0 1 − pt 1 − ut

Sumit Bhatia Probabilistic Information Retrieval 14/23

Page 41: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Assumption: A term not in query is equally likey to occur in

relevant and non-relevant documents.

Sumit Bhatia Probabilistic Information Retrieval 15/23

Page 42: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Assumption: A term not in query is equally likey to occur in

relevant and non-relevant documents.

O(R |~d , ~q) ∝ Πt:dt=qt=1

pt

ut

. Πt:dt=0,qt=1

1 − pt

1 − ut

(6)

Sumit Bhatia Probabilistic Information Retrieval 15/23

Page 43: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Assumption: A term not in query is equally likey to occur in

relevant and non-relevant documents.

O(R |~d , ~q) ∝ Πt:dt=qt=1

pt

ut

. Πt:dt=0,qt=1

1 − pt

1 − ut

(6)

Manipulating:

O(R |~d , ~q) ∝ Πt:dt=qt=1

pt(1 − ut)

ut(1 − pt). Πt:qt=1

1 − pt

1 − ut

(7)

Sumit Bhatia Probabilistic Information Retrieval 15/23

Page 44: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

Assumption: A term not in query is equally likey to occur in

relevant and non-relevant documents.

O(R |~d , ~q) ∝ Πt:dt=qt=1

pt

ut

. Πt:dt=0,qt=1

1 − pt

1 − ut

(6)

Manipulating:

O(R |~d , ~q) ∝ Πt:dt=qt=1

pt(1 − ut)

ut(1 − pt). Πt:qt=1

1 − pt

1 − ut

(7)

Constant for a given query!

Sumit Bhatia Probabilistic Information Retrieval 15/23

Page 45: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

RSVd = log Πt:dt=qt=1

pt(1 − ut)

ut(1 − pt)

(8)

=∑

t:dt=qt=1

logpt(1 − ut)

ut(1 − pt)

(9)

Sumit Bhatia Probabilistic Information Retrieval 16/23

Page 46: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

RSVd = log Πt:dt=qt=1

pt(1 − ut)

ut(1 − pt)

(8)

=∑

t:dt=qt=1

logpt(1 − ut)

ut(1 − pt)

(9)

Docs R=1 R=0 Total

di = 1 s n-s n

di = 0 S-s (N-n)-(S-s) N-n

Total S N-S N

Sumit Bhatia Probabilistic Information Retrieval 16/23

Page 47: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Binary Independence Model

RSVd = log Πt:dt=qt=1

pt(1 − ut)

ut(1 − pt)

(8)

=∑

t:dt=qt=1

logpt(1 − ut)

ut(1 − pt)

(9)

Docs R=1 R=0 Total

di = 1 s n-s n

di = 0 S-s (N-n)-(S-s) N-n

Total S N-S N

substituting, we get:

RSVd =∑

t:dt=qt=1

log(s + 1

2)/(S − s + 12)

(n − s + 12)/(N − n − S + s + 1

2)(10)

Sumit Bhatia Probabilistic Information Retrieval 16/23

Page 48: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Observations

Probabilities for non-relevant documents can be approximatedby collection statistics.

=⇒ log(1 − ut)

ut

= log(N − n)

n≈ log

N

n= IDF !

Sumit Bhatia Probabilistic Information Retrieval 17/23

Page 49: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Observations

Probabilities for non-relevant documents can be approximatedby collection statistics.

=⇒ log(1 − ut)

ut

= log(N − n)

n≈ log

N

n= IDF !

It is not so simple for relevant documents /

– Estimating from known relevant documents (not alwaysknown)– Assuming pt = constant, equivalent to IDF weighting only

Sumit Bhatia Probabilistic Information Retrieval 17/23

Page 50: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

Observations

Probabilities for non-relevant documents can be approximatedby collection statistics.

=⇒ log(1 − ut)

ut

= log(N − n)

n≈ log

N

n= IDF !

It is not so simple for relevant documents /

– Estimating from known relevant documents (not alwaysknown)– Assuming pt = constant, equivalent to IDF weighting only

Difficulties in probability estimation and drastic assumptionsmakes achieving performance difficult

Sumit Bhatia Probabilistic Information Retrieval 17/23

Page 51: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

OKAPI Weighting Scheme

BIM does not consider term frequencies and document length.

BM25 weighting scheme (Okapi weighting) by was developedto build a probabilistic model sensitive to these quantities.

BM25 today is widely used and has shown good performancein a number of practical systems.

Sumit Bhatia Probabilistic Information Retrieval 18/23

Page 52: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

OKAPI Weighting Scheme

RSVd =∑

t∈q

{

logN

dft×

(k1 + 1)tftd

k1((1 − b) + b × (ld

lav)) + tftd

×(k3 + 1)tftqk3 + tftq

}

where:N is the total number of documents,dft is the document frequency, i.e.,number of documents that contain theterm t,tftd is the frequency of term t in document d ,tftq is the frequency of term t in query q,ld is the length of document d ,lav is the average length of documents,

k1, k3 and b are constants which are generally set to 2, 2 and .75

respectively.

Sumit Bhatia Probabilistic Information Retrieval 19/23

Page 53: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

What Next?

Similarity between terms and documents - is this sufficient?

Sumit Bhatia Probabilistic Information Retrieval 20/23

Page 54: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

What Next?

Similarity between terms and documents - is this sufficient?

JAVA: Coffee or Computer Language or Place?

Sumit Bhatia Probabilistic Information Retrieval 20/23

Page 55: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

What Next?

Similarity between terms and documents - is this sufficient?

JAVA: Coffee or Computer Language or Place?

Time and Location of user?

Sumit Bhatia Probabilistic Information Retrieval 20/23

Page 56: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

What Next?

Similarity between terms and documents - is this sufficient?

JAVA: Coffee or Computer Language or Place?

Time and Location of user?

Different users might want different documents for samequery?

Sumit Bhatia Probabilistic Information Retrieval 20/23

Page 57: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

What Next?

Maximum Marginal Relevance [CG98] – Rank documents soas to minimize similarity between returned documents

Sumit Bhatia Probabilistic Information Retrieval 21/23

Page 58: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

What Next?

Maximum Marginal Relevance [CG98] – Rank documents soas to minimize similarity between returned documents

Result Diversification [Wan09]– Rank documents so as to maximize mean relevance, given avariance level.– Variance here determines the risk the user is willing to take

Sumit Bhatia Probabilistic Information Retrieval 21/23

Page 59: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

References

Carbonell, Jaime and Goldstein, Jade, The use of MMR,

diversity-based reranking for reordering documents and

producing summaries, SIGIR, 1998, pp. 335–336.

Christopher D. Manning, Prabhakar Raghavan, and HinrichSchutze, Introduction to information retrieval, CambridgeUniversity Press, 2008.

Jun Wang, Mean-variance analysis: A new document ranking

theory in information retrieval, Advances in InformationRetrieval, 2009, pp. 4–16.

Sumit Bhatia Probabilistic Information Retrieval 22/23

Page 60: Probabilistic Information Retrieval - Sumit Bhatia

IntroductionProbabilistic Ranking Principle

The Binary Independence ModelOKAPI

Discussion

QUESTIONS???

Sumit Bhatia Probabilistic Information Retrieval 23/23