Classic IR Models Binary Independent Retrieval (BIR) model.
Transcript of Classic IR Models Binary Independent Retrieval (BIR) model.
Classic IR Models Binary Independent Retrieval (BIR) model.
The Probabilistic Model
1
Basic Assumptions
• There is an ideal answer set (relevant documents) for a given
user query.
• We do not know the description of the ideal set (its
properties).
• We have index terms that have semantics that can
characterize the properties of the ideal answer set.
• We make an initial guess about the ideal set at query time.
• We give the answer to the user and “hopefully” we get some
feedback that would allow us to further refine the description
of the ideal set.
2
Basic Assumptions, cont’d
• Given a query q and a document d, the PM will try to
estimate the probability that the user will find d
relevant to q.
• There is a subset of documents in the collection that
are relevant to the query q. We call that subset rel.
• There is also a subset that contains non-relevant
documents to q. We call that rel.
3
Basic Assumptions, cont’d
• Initially, documents and queries are
represented by binary weights of terms (1/0).
• The similarity of document d to the query q is
defined as the ration.
• Sim (d, q) = 𝑃(𝑟𝑒𝑙 |𝑑)
𝑃 𝑟𝑒𝑙 𝑑)
4
Probabilistic Model
• Assume:
– P (rel | ret )
– P (rel | ret)
• Discrimination: • If > 1 then, GOOD guess. (*1)
• A document is represented by terms.
• We relate probability to term occurrences in the document.
)|(
)|(
retrelP
retrelP
5
Probabilistic Model
• From Bayes Theorem:
• It becomes: (*2)
• And, (*3)
)(
)()|()|(
BP
APABPBAP
)(
)()|()|(
retP
relPrelretPretrelP
)(
)()|()|(
retP
relPrelretPretrelP
6
Cont’d
• From (*1, *2, and *3):
• Discrimination = (*4)
• is a constant for all documents for a given query.
• Since we assume that terms in a document are statistically independent, we can represent a document as a product of term probabilities.
)(
)(.
)|(
)|(
relP
relP
relretP
relretP
)(
)(
relP
relP
7
Cont’d
• From (*4)
•
•
• Discrimination = 𝑃 𝑡𝑖 𝑟𝑒𝑙)
𝑃 𝑡𝑖 𝑟𝑒𝑙) .𝑃(𝑟𝑒𝑙)
𝑃 (𝑟𝑒𝑙)𝑖
• Convert to logs: (*5)
)|()......|3().|2().|1()|( reltnPreltPreltPreltPrelretP
)|()......|2().|1()|( reltnPreltPreltPrelretp
)(
)(log
)|(
)|(log
relP
relP
reltP
reltP
i i
i
8
Cont’d
• Example:
• Q (t1, t2) assume that d2, d4 are relevant
• d2 and d4 are
relevant
• Plug the information above in formula (*5/*4)
t1 t2
d1 1 0
d2 0 1
d3 1 0
d4 1 1
d5 0 1
9
Example, cont’d
• , , ,
• ,
• Disc (d1) =
• Disc (d4) =
• Disc (d2) = 2 , Disc(d3) = , Disc (d5) =2
2
1)|( 1 reltP
3
1)|( 2 reltP1)|( 2 reltP
3
2)|( 1 reltP
5
2)( relP
5
3)( relP
2
1
3
2.
32
21
)(
)(.
)|1(
)|1(
relp
relP
reltP
reltP
21
211
)(
)(.
)|2(
)|2(.
)|1(
)|1(
relp
relP
reltP
reltP
reltP
reltP
32
)(
)(
relP
relP
10
Example, cont’d
• Ranking:
• d2
• d5
• d4
Threshold > 1
• d1
• d3
• d4 is ranked lower because
• it has a bad index term (t1).
• Why is t1 bad?
• Because it appears in non-relevant documents (d1, d3).
Relevant
Non-Relevant
11
Probabilistic Model, Problems
• How to estimate the initial values?
• How to get feedback about relevancy?
12
Probabilistic Model, Problems
• How to estimate the initial values?
• How to get feedback about relevancy?
• In practice, start with:
– P(ti | rel ) =0.5
– P(ti | rel ) =
• ni is the number of documents where ti occurs.
• N is the number of all documents in the collection.
• Now we rank documents that contain query terms.
N
ni
13
Training to Improve Ranking
• Vi is the number of documents with term ti that is initially retrieved by the PM (e.g. top r docs.).
• V is the number of all documents retrieved (top r).
• If Vi = 0, or V=N, then we have a problem.
• and 𝑃 𝑡𝑖 𝑟𝑒𝑙 =𝑛𝑖−𝑉𝑖+0.5
𝑁−𝑉+1
V
Vi ) rel | (t P i
VN
VnreltP
iii
)|(
1
5.0 ) rel | (t P i
V
Vi
Do we still need user feedback? 14
Training to Improve Ranking, cont’d
• We can further use the fraction 𝑛𝑖
𝑁 as an
adjustment factor.
• The above yields:
• 𝑃 𝑡𝑖 𝑟𝑒𝑙) = 𝑉𝑖+
𝑛𝑖𝑁
𝑉+1
• 𝑃 𝑡𝑖 𝑟𝑒𝑙) =𝑛𝑖−𝑉𝑖+
𝑛𝑖𝑁
𝑁−𝑉+1
15
Probabilistic Model Conclusion
• The need to guess the initial value.
• The frequency of terms is not taken into account.
• Term independence.
• Ultimately,
– Vector Space outperforms the Probabilistic Model.
16
Readings
• Chapter 4 (Optional)
• Chapter 6
– Sections: 6.2, 6.3, and 6.4.
• Chapter 7.
• Chapter 11
– Section 11.3 (The Binary Independence Model)
17