C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked...

44
C.Watters csci6403 1 Classical IR Models

Transcript of C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked...

Page 1: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 1

Classical IR Models

Page 2: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 2

Goal

• Hit set of relevant documents

• Ranked set

• Best match

• Answer

Page 3: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 3

Models

• Boolean (based on set theory)– fuzzy logic– Extended Boolean

• Vector Space (based on algebra)– Latent semantic networks– Neural networks

• Probabilistic– Inference networks– Belief networks

• Hypertext

Page 4: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 4

Retrieval

• Ad hoc

• Repeated– filter – selective dissemination of information (SDI)– profile

• Browsing

Page 5: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 5

Index terms K={k1,…,kn}

Migration to Australia

This page introduces information about migrating to Australia (as a migrant or refugee), which means travelling to Australia with a visa that gives you the right to live permanently in Australia.

Please note: if you plan to visit Australia (that is, not stay permanently), and you want to work, please read the information about temporary entry.

Page 6: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 6

Index Term Weights

• For each index term ki in document dj a weight wi,j is assigned, (0..1)

• Generally assumed to be independent

• What does this tell us?– (0,1)– (0..1)

Page 7: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 7

Document as Set of Terms

• Document is represented by set of terms

• dj = {w1,j , w2,j , w3,j , …. wn,j }

• Where w1,j is the weight of term1in docj

• So ?? If – w1,j = 0

– w1,j = 1

– w1,j = .2

Page 8: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 8

Inverted File

• Term -> { occurrences}

• Organized for fast access by term

• Plus any extra information you need for your retrieval algorithm

• Size??

Page 9: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 9

Boolean

• Based on set theory using index terms– Term weights: wi,j = {0,1}– Document vector: dj = (0,1,0,…) – Boolean query: AND OR NOT– Q=t1 AND t2 OR t3

• Australia AND work AND papers• Australia AND visa• Australia OR visa

Page 10: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 10

Boolean Representation

• Sim(dj,q)={0,1}• Sample (t1= Australia t2= visa t3=outback)

– d1= (0,1,0)– d2=(1,1,0)– d3=(0,1,1)

• Australia and visa sim(d1,q)=• Australia or visa sim(d2,q)=• Australia not visa sim(d3,q)=

Page 11: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 11

Index Structure

• Australia:1,4,77

• Migrant: 1,5,87,97,123

• Visa:4,19, 55, 97

• Algorithm???

Page 12: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 12

Complex queries

• (red or blue) and (sedan or (suv and ford))

• Efficiency?

Page 13: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 13

Problems

• Misinterpretation of query by users• Mouse device• Binary weights used for index terms• Red BMW Convertible• Elimination of partial results• Binary results

– Document either fits or doesn’t

– Too few or too many results

Page 14: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 14

Dominance of this model

• Simple to implement

• Simple to use

• Examples?

Page 15: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 15

Vector Space Model

• Relax binary weight restriction

• Allow partial matches

• Provide ranking of results

• Goal: determine the degree of similarity between each document and the query

Page 16: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 16

• Given n possible index terms• For each document

• ith term in jth document• Has term weight in jth doc wi,j = [0..1]• Giving dj=(w1,j, w2,j,…wn,j )

• For each query term• kth term has a query weight• wk,q = [0..1]• Giving q=(w1,q ,w2,q ,…,wn,q)

Page 17: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 17

Page 18: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 18

• Calculate similarities

• Rank

• Use threshold

• Q=Heat (.8) Film(.3) H’wood(.5)

• Result / Order

• Boolean result?

Page 19: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 19

Index Term Weights

• Given a set of documents• Goals

– Find features that describe document X– Find features that differentiate doc X from Y

• IR treats documents as clusters (bags) of terms– Intra-cluster similarity– Inter-cluster dissimilarity

Page 20: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 20

Intra document term similarity

• Raw frequency of terms within the doc

• tf or term frequency factor

• Problems– Common words– Size of document

• Normalized tf, fi,j = freqi,j

• max( freqk,j )

Page 21: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 21

Inter Document Dissimilarity

• Measure frequency of terms across doc set

• idf or inverse document frequency

• idfi = log N

• ni

• N is number of documents

• ni is number of documents with term ki

• Dampens the effect of increases in set size

Page 22: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 22

Page 23: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 23

So

• Term frequency -> more is better

• Document frequency -> less is better

• Together accentuate difference

• Migrate 3 times (10 docs out of 500)

• Australia 5 times (400 docs out of 500)

Page 24: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 24

OK

• Use term weights to calculate

• Document to document similarity

• (more high weight terms in common)

• And

• Query to Document similarity

• (query terms are high weight terms in doc)

Page 25: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 25

Document-Document Similarity

Page 26: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 26

Example

• Document 1: Australia sample document – Australia weight .05– Migrate weight .56

• Document 2: Geese Migration– Geese weight .45– Migrate weight .55

Page 27: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 27

Vector Structure

• Doc1: .1, 0,0,.4, 0, 0, 0,.8,.7, 0,.7,.7

• Doc2: .1,.1,0,.1, 0,.8,.7,.9,.7,.1,.2,.3

• Doc3: .4,.1,0, 0,.9,.5,.5, 0, 0, 0,.9,.7

• Algorithm???

Page 28: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 28

Query Document Similarity

• Sim(D,Q)=SUM(wi,q* wi,d)

• So query = Australia (.5) Geese (.8)

• Sim(doc1,Q)=

• Sim(doc2,Q)=

Page 29: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 29

Doing Better!

• Augmented schemes

• Vector space similarity measures

Page 30: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 30

Query Term Weights

• Natural language query

• I am doing a paper on shipping for my class at Dalhousie. Are there any reports from this university on deep sea shipping.

• Frequency

• Part of speech

Page 31: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 31

Using Similarity: Partial Matches

• wi,jand wi,qthen sim(q,dj)=[0…1]

• Every document has a similarity value to every query

• E.g., Dalhousie shipping

• What does OR mean

• What does AND mean

• How to manage this

Page 32: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 32

Using Similarity: Ranking

• Order results by similarity value

• Dalhousie Shipping ??

• Query and documentTerm weights

Page 33: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 33

Similarity of Q to Docs(Normalize)

dj

q

sim(dj,q)=cosine

Page 34: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 34

Page 35: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 35

Page 36: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 36

So why do we need a vector???

Page 37: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 37

Page 38: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 38

Page 39: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 39

Other similarity measurements

Page 40: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 40

Cosine Similarity

C= Terms in common, A terms in i, and B terms in j

Page 41: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 41

Dice similarity Measure

C= Terms in common, A terms in i, and B terms in j

Page 42: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 42

Jaccard Similarity Measure

C= Terms in common, A terms in i, and B terms in j

Page 43: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 43

Vector Space Model

• Advantages– Allows partial matches– Allows ranking

• Disadvantages– Need whole doc set to determine weights– Extra computation– Terms are assumed to be independent

Page 44: C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.

C.Watters csci6403 44

NeoClassical Models****

• Probabilistic model• Boolean variations

– Fuzzy set model– Extended Boolean

• Vector space variations– Generalized vector space (term dependency)– Latent Semantic indexing– Neural net models