Relevance Feedback

38
Relevance Feedback Main Idea: Modify existing query based on relevance judgements Extract terms from relevant documents and add them to the query and/or re-weight the terms already in the query Two main approaches: Automatic (psuedo-relevance feedback) Users select relevant documents Users/system select terms from an automatically-generated list

description

Relevance Feedback. Main Idea: Modify existing query based on relevance judgements Extract terms from relevant documents and add them to the query and/or re-weight the terms already in the query Two main approaches: Automatic (psuedo-relevance feedback) Users select relevant documents - PowerPoint PPT Presentation

Transcript of Relevance Feedback

Page 1: Relevance Feedback

Relevance Feedback Main Idea:

Modify existing query based on relevance judgements Extract terms from relevant documents and add them to

the query and/or re-weight the terms already in the query

Two main approaches: Automatic (psuedo-relevance feedback) Users select relevant documents

Users/system select terms from an automatically-generated list

Page 2: Relevance Feedback

Relevance Feedback

Usually do both: expand query with new terms re-weight terms in query

There are many variations usually positive weights for terms from

relevant docs sometimes negative weights for terms

from non-relevant docs Remove terms ONLY in non-relevant

documents

Page 3: Relevance Feedback

Relevance Feedback for Vector Model

Crdj

CrNCrdj

CroptdjdjQ 11

Cr = Set of documents that are truly relevant to QN = Total number of documents

In the “ideal” case where we know the relevant Documents a priori

Page 4: Relevance Feedback

Rocchio Method

Dndj

DnDrdj

Dr djdjQQ ||||01

Qo is initial query. Q1 is the query after one iterationDr are the set of relevant docsDn are the set of irrelevant docs Alpha =1; Beta=.75, Gamma=.25 typically.

Other variations possible, but performance similar

Page 5: Relevance Feedback

Rocchio/Vector Illustration

Retrieval

Information

0.5

1.0

0 0.5 1.0

D1

D2

Q0

Q’

Q”

Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)

Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

Page 6: Relevance Feedback

Example Rocchio Calculation

)04.1,033.0,488.0,022.0,527.0,01.0,002.0,000875.0,011.0(

12

25.0

75.0

1

)950,.00.0,450,.00.0,500,.00.0,00.0,00.0,00.0(

)00.0,020,.00.0,025,.005,.00.0,020,.010,.030(.

)120,.100,.100,.025,.050,.002,.020,.009,.020(.

)120,.00.0,00.0,050,.025,.025,.00.0,00.0,030(.

121

1

2

1

new

new

Q

SRRQQ

Q

S

R

R

Relevantdocs

Non-rel doc

Original Query

Constants

Rocchio Calculation

Resulting feedback query

Page 7: Relevance Feedback

Rocchio Method

Rocchio automatically re-weights terms adds in new terms (from relevant docs)

have to be careful when using negative terms

Rocchio is not a machine learning algorithm Most methods perform similarly

results heavily dependent on test collection Machine learning methods are proving to

work better than standard IR approaches like Rocchio

Page 8: Relevance Feedback

Relevance feedback in Probabilistic Model

sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) )

P(ki | R) P(ki | R) Probabilities P(ki | R) and P(ki | R) ?

Estimates based on assumptions: P(ki | R) = 0.5 P(ki | R) = ni

Nwhere ni is the number of docs that contain ki

Use this initial guess to retrieve an initial ranking Improve upon this initial ranking

Page 9: Relevance Feedback

Improving the Initial Ranking sim(dj,q) ~

~ wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R)

Let V : set of docs initially retrieved Vi : subset of docs retrieved that contain ki

Reevaluate estimates: P(ki | R) = Vi V P(ki | R) = ni - Vi N - V

Repeat recursively

Relevance Feedback..

Page 10: Relevance Feedback

Improving the Initial Ranking sim(dj,q) ~ ~ wiq * wij * (log

P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R)

To avoid problems with V=1 and Vi=0: P(ki | R) = Vi + 0.5

V + 1 P(ki | R) = ni - Vi + 0.5

N - V + 1 Also,

P(ki | R) = Vi + ni/N V + 1

P(ki | R) = ni - Vi + ni/N N - V + 1

Relevance Feedback..

Page 11: Relevance Feedback

Using Relevance Feedback Known to improve results

in TREC-like conditions (no user involved)

What about with a user in the loop? How might you measure this?

Precision/Recall figures for the unseen documents need to be computed

Page 12: Relevance Feedback

Relevance Feedback Summary Iterative query modification can

improve precision and recall for a standing query

In at least one study, users were able to make good choices by seeing which terms were suggested for R.F. and selecting among them

Page 13: Relevance Feedback

Query ExpansionAdd terms that are closely related to the query termsto improve precision and recall. Two variants: Local only analyze the closeness among the set of documents that are returned Global Consider all the documents in the corpus a prioriHow to decide closely related terms? THESAURI!! -- Hand-coded thesauri (Roget and his brothers) -- Automatically generated thesauri --Correlation based (association, nearness) --Similarity based (terms as vectors in doc space) --Statistical (clustering techniques)

Page 14: Relevance Feedback

Correlation/Co-occurrence analysisCo-occurrence analysis: Terms that are related to terms in the original query may be

added to the query. Two terms are related if they have high co-occurrence in

documents.

Let n be the number of documents;

n1 and n2 be # documents containing terms t1 and t2,

m be the # documents having both t1 and t2

If t1 and t2 are independent

If t1 and t2 are correlated

mn

n

n

nn 21

mn

n

n

nn 21

Mea

sure

degree

of corre

lation

Page 15: Relevance Feedback

Association Clusters Let Mij be the term-document matrix

For the full corpus (Global) For the docs in the set of initial results (local) (also sometimes, stems are used instead of terms)

Correlation matrix C = MMT (term-doc Xdoc-term = term-term)

djtvdjtudj

ffCuv ,,

CuvCvvCuuCuvSuv

CuvSuv

Un-normalized Association Matrix

Normalized Association Matrix

Nth-Association Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

Page 16: Relevance Feedback

Example

d1d2d3d4d5d6d7

K1 2 1 0 2 1 1 0

K2 0 0 1 0 2 2 5

K3 1 0 3 0 4 0 0

11 4 6

4 34 11

6 11 26

1.0 0.097 0.193

0.097 1.0 0.224

0.193 0.224 1.0

Correlatio

n

Matrix

Norm

alized

Correlation

Matrix

1th Assoc Cluster for K2 is K3

4)12(34)22(11)11(4)12(

12 ssssS

Page 17: Relevance Feedback

Scalar clusters

Consider the normalized association matrix S

The “association vector” of term u Au is (Su1,Su2…Suk)

To measure neighborhood-induced correlation between terms:Take the cosine-theta between the association vectors of terms u and v

Even if terms u and v have low correlations, they may be transitively correlated (e.g. a term w has high correlation with u and v).

|||| AvAuAvAuSuv

Nth-scalar Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

Page 18: Relevance Feedback

Exampled1d2d3d4d5d6d7

K1 2 1 0 2 1 1 0

K2 0 0 1 0 2 2 5

K3 1 0 3 0 4 0 0

1.0 0.097 0.193

0.097 1.0 0.224

0.193 0.224 1.0

Normalized Correlation Matrix

AK1

USER(43): (neighborhood normatrix)

0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (1.0 0.09756097 0.19354838))

0: returned 1.0

0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (0.09756097 1.0 0.2244898))

0: returned 0.22647195

0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (0.19354838 0.2244898 1.0))

0: returned 0.38323623

0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (1.0 0.09756097 0.19354838))

0: returned 0.22647195

0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (0.09756097 1.0 0.2244898))

0: returned 1.0

0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (0.19354838 0.2244898 1.0))

0: returned 0.43570948

0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (1.0 0.09756097 0.19354838))

0: returned 0.38323623

0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (0.09756097 1.0 0.2244898))

0: returned 0.43570948

0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (0.19354838 0.2244898 1.0))

0: returned 1.0

1.0 0.226 0.383

0.226 1.0 0.435

0.383 0.435 1.0

Scalar (neighborhood)Cluster Matrix

1th Scalar Cluster for K2 is still K3

Page 19: Relevance Feedback

Metric Clusters

Let r(ti,tj) be the minimum distance (in terms of number of separating words) between ti and tj in any single document (infinity if they never occur together in a document) Define cluster matrix Suv= 1/r(ti,tj)

Nth-metric Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

R(ti,tj)

is al

so usef

ul

For proxim

ity queri

es

And phrase q

ueries

average..

Page 20: Relevance Feedback

Similarity Thesaurus

The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence. obtained by considering that the terms are

concepts in a concept space. each term is indexed by the documents in

which it appears. Terms assume the original role of

documents while documents are interpreted as indexing elements

Page 21: Relevance Feedback

Motivation

Ka KbQ

Kv

Ki

Kj

Page 22: Relevance Feedback

Similarity Thesaurus Inverse term frequency for document dj

To ki is associated a vector

Where

jj t

titf log

),....,,(k ,2,1,i Niii www

Terminology

t: number of terms in the collection

N: number of documents in the collection

Fi,j: frequency of occurrence of the term ki in the document dj

tj: vocabulary of document dj

itfj: inverse term frequency for document dj

N

lj

lil

li

jjij

ji

ji

itff

f

itff

f

w

1

22

,

,

,

,

,

))(max

5.05.0(

))(max

5.05.0(

Idea: It is no surprise if

Oxford dictionary

Mentions the word!

Page 23: Relevance Feedback

Similarity Thesaurus

The relationship between two terms ku and kv is computed as a correlation factor cu,v given by

The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection

Expensive

Possible to do incremental updates…

jd

jv,ju,vuvu, wwkkc

Simila

r to th

e scal

ar clu

sters

Idea, but fo

r the t

f/itf w

eightin

g

Defining th

e term

vector

Page 24: Relevance Feedback

three steps as follows: Represent the query in the concept space

used for representation of the index terms2 Based on the global similarity thesaurus,

compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q.

3 Expand the query with the top r ranked terms according to sim(q,kv)

Query expansion with Global Thesaurus

Page 25: Relevance Feedback

Query Expansion - step one To the query q is associated a

vector q in the term-concept space given by

where wi,q is a weight associated to the index-query pair[ki,q]

iqk

qi kwi

,q

Page 26: Relevance Feedback

Query Expansion - step two

Compute a similarity sim(q,kv) between each term kv and the user query q

where Cu,v is the correlation factor

qk

vu,qu,vv

u

cwkq)ksim(q,

Page 27: Relevance Feedback

Query Expansion - step three Add the top r ranked terms according to sim(q,kv) to

the original query q to form the expanded query q’ To each expansion term kv in the query q’ is

assigned a weight wv,q’ given by

The expanded query q’ is then used to retrieve new documents to the user

qkqu,

vq'v,

u

w

)ksim(q,w

Page 28: Relevance Feedback

Expansion terms must be low frequency terms However, it is difficult to cluster low

frequency terms Idea: Cluster documents into classes

instead and use the low frequency terms in these documents to define our thesaurus classes. This algorithm must produce small and

tight clusters.

Statistical Thesaurus formulation

Page 29: Relevance Feedback

A clustering algorithm (Complete Link)

This is document clustering algorithm with produces small and tight clusters Place each document in a distinct cluster. Compute the similarity between all pairs of

clusters. Determine the pair of clusters [Cu,Cv] with

the highest inter-cluster similarity. Merge the clusters Cu and Cv Verify a stop criterion. If this criterion is not

met then go back to step 2. Return a hierarchy of clusters.

Similarity between two clusters is defined as the minimum of similarities between all pair of inter-cluster documents

C1

D1 D2D3 D4

C2C3 C4

C1,3

0.99

C1,3,2

0.29

C1,3,2,4

0.00

Page 30: Relevance Feedback

Selecting the terms that compose each class

Given the document cluster hierarchy for the whole collection, the terms that compose each class of the global thesaurus are selected as follows Obtain from the user three parameters

TC: Threshold class NDC: Number of documents in class MIDF: Minimum inverse document

frequency

C1

D1 D2D3 D4

C2C3 C4

C1,3

0.99

C1,3,2

0.29

C1,3,2,4

0.00

Page 31: Relevance Feedback

Selecting the terms that compose each class Use the parameter TC as threshold value for

determining the document clusters that will be used to generate thesaurus classes

This threshold has to be surpassed by sim(Cu,Cv) if the documents in the clusters Cu and Cv are to be selected as sources of terms for a thesaurus class

Use the parameter NDC as a limit on the size of clusters (number of documents) to be considered.

A low value of NDC might restrict the selection to the smaller cluster Cu+v

Page 32: Relevance Feedback

Selecting the terms that compose each class Consider the set of document in each

document cluster pre-selected above. Only the lower frequency documents are

used as sources of terms for the thesaurus classes The parameter MIDF defines the minimum

value of inverse document frequency for any term which is selected to participate in a thesaurus class

Page 33: Relevance Feedback

Query Expansion based on a Statistical Thesaurus Use the thesaurus class to query

expansion. Compute an average term weight

wtc for each thesaurus class C

C

wwtc

C

1iCi,

Page 34: Relevance Feedback

Query Expansion based on a Statistical Thesaurus wtc can be used to compute a

thesaurus class weight wc as

5.0C

wtcWc

Page 35: Relevance Feedback

Query Expansion Sample

TC = 0.90 NDC = 2.00 MIDF = 0.2

sim(1,3) = 0.99sim(1,2) = 0.40sim(1,2) = 0.40sim(2,3) = 0.29sim(4,1) = 0.00sim(4,2) = 0.00sim(4,3) = 0.00

Doc1 = D, D, A, B, C, A, B, CDoc2 = E, C, E, A, A, DDoc3 = D, C, B, B, D, A, B, C, ADoc4 = A

C1

D1 D2D3 D4

C2C3 C4

C1,3

0.99

C1,3,2

0.29

C1,3,2,4

0.00

idf A = 0.0idf B = 0.3idf C = 0.12idf D = 0.12idf E = 0.60 q'=A B E

E

q= A E E

Page 36: Relevance Feedback

Query Expansion based on a Statistical Thesaurus Problems with this approach

initialization of parameters TC,NDC and MIDF

TC depends on the collection Inspection of the cluster hierarchy is

almost always necessary for assisting with the setting of TC.

A high value of TC might yield classes with too few terms

Page 37: Relevance Feedback

Conclusion

Thesaurus is a efficient method to expand queries

The computation is expensive but it is executed only once

Query expansion based on similarity thesaurus may use high term frequency to expand the query

Query expansion based on statistical thesaurus need well defined parameters

Page 38: Relevance Feedback

Using correlation for term change Low freq to Medium Freq

By synonym recognition High to medium frequency

By phrase recognition