Classic IR Models Binary Independent Retrieval (BIR) model.

Classic IR Models Binary Independent Retrieval (BIR) model.

The Probabilistic Model

1

Basic Assumptions

• There is an ideal answer set (relevant documents) for a given

user query.

• We do not know the description of the ideal set (its

properties).

• We have index terms that have semantics that can

characterize the properties of the ideal answer set.

• We make an initial guess about the ideal set at query time.

• We give the answer to the user and “hopefully” we get some

feedback that would allow us to further refine the description

of the ideal set.

2

Basic Assumptions, cont’d

• Given a query q and a document d, the PM will try to

estimate the probability that the user will find d

relevant to q.

• There is a subset of documents in the collection that

are relevant to the query q. We call that subset rel.

• There is also a subset that contains non-relevant

documents to q. We call that rel.

3

Basic Assumptions, cont’d

• Initially, documents and queries are

represented by binary weights of terms (1/0).

• The similarity of document d to the query q is

defined as the ration.

• Sim (d, q) = 𝑃(𝑟𝑒𝑙 |𝑑)

𝑃 𝑟𝑒𝑙 𝑑)

4

Probabilistic Model

• Assume:

– P (rel | ret )

– P (rel | ret)

• Discrimination: • If > 1 then, GOOD guess. (*1)

• A document is represented by terms.

• We relate probability to term occurrences in the document.

)|(

)|(

retrelP

retrelP

5

Probabilistic Model

• From Bayes Theorem:

• It becomes: (*2)

• And, (*3)

)(

)()|()|(

BP

APABPBAP

)(

)()|()|(

retP

relPrelretPretrelP

)(

)()|()|(

retP

relPrelretPretrelP

6

Cont’d

• From (*1, *2, and *3):

• Discrimination = (*4)

• is a constant for all documents for a given query.

• Since we assume that terms in a document are statistically independent, we can represent a document as a product of term probabilities.

)(

)(.

)|(

)|(

relP

relP

relretP

relretP

)(

)(

relP

relP

7

Cont’d

• From (*4)

•

•

• Discrimination = 𝑃 𝑡𝑖 𝑟𝑒𝑙)

𝑃 𝑡𝑖 𝑟𝑒𝑙) .𝑃(𝑟𝑒𝑙)

𝑃 (𝑟𝑒𝑙)𝑖

• Convert to logs: (*5)

)|()......|3().|2().|1()|( reltnPreltPreltPreltPrelretP

)|()......|2().|1()|( reltnPreltPreltPrelretp

)(

)(log

)|(

)|(log

relP

relP

reltP

reltP

i i

i

8

Cont’d

• Example:

• Q (t1, t2) assume that d2, d4 are relevant

• d2 and d4 are

relevant

• Plug the information above in formula (*5/*4)

t1 t2

d1 1 0

d2 0 1

d3 1 0

d4 1 1

d5 0 1

9

Example, cont’d

• Ranking:

• d2

• d5

• d4

Threshold > 1

• d1

• d3

• d4 is ranked lower because

• it has a bad index term (t1).

• Why is t1 bad?

• Because it appears in non-relevant documents (d1, d3).

Relevant

Non-Relevant

11

Probabilistic Model, Problems

• How to estimate the initial values?

• How to get feedback about relevancy?

12

Probabilistic Model, Problems

• How to estimate the initial values?

• How to get feedback about relevancy?

• In practice, start with:

– P(ti | rel ) =0.5

– P(ti | rel ) =

• ni is the number of documents where ti occurs.

• N is the number of all documents in the collection.

• Now we rank documents that contain query terms.

N

ni

13

Training to Improve Ranking

• Vi is the number of documents with term ti that is initially retrieved by the PM (e.g. top r docs.).

• V is the number of all documents retrieved (top r).

• If Vi = 0, or V=N, then we have a problem.

• and 𝑃 𝑡𝑖 𝑟𝑒𝑙 =𝑛𝑖−𝑉𝑖+0.5

𝑁−𝑉+1

V

Vi ) rel | (t P i

VN

VnreltP

iii

)|(

1

5.0 ) rel | (t P i

V

Vi

Do we still need user feedback? 14

Training to Improve Ranking, cont’d

• We can further use the fraction 𝑛𝑖

𝑁 as an

adjustment factor.

• The above yields:

• 𝑃 𝑡𝑖 𝑟𝑒𝑙) = 𝑉𝑖+

𝑛𝑖𝑁

𝑉+1

• 𝑃 𝑡𝑖 𝑟𝑒𝑙) =𝑛𝑖−𝑉𝑖+

𝑛𝑖𝑁

𝑁−𝑉+1

15

Probabilistic Model Conclusion

• The need to guess the initial value.

• The frequency of terms is not taken into account.

• Term independence.

• Ultimately,

– Vector Space outperforms the Probabilistic Model.

16

Readings

• Chapter 4 (Optional)

• Chapter 6

– Sections: 6.2, 6.3, and 6.4.

• Chapter 7.

• Chapter 11

– Section 11.3 (The Binary Independence Model)

17

Classic IR Models Binary Independent Retrieval (BIR) model.

Documents

Transcript of Classic IR Models Binary Independent Retrieval (BIR) model.