Relevance Feedback

Relevance Feedback Main Idea:

Modify existing query based on relevance judgements Extract terms from relevant documents and add them to

the query and/or re-weight the terms already in the query

Two main approaches: Automatic (psuedo-relevance feedback) Users select relevant documents

Users/system select terms from an automatically-generated list

Relevance Feedback

Usually do both: expand query with new terms re-weight terms in query

There are many variations usually positive weights for terms from

relevant docs sometimes negative weights for terms

from non-relevant docs Remove terms ONLY in non-relevant

documents

Relevance Feedback for Vector Model

Crdj

CrNCrdj

CroptdjdjQ 11

Cr = Set of documents that are truly relevant to QN = Total number of documents

In the “ideal” case where we know the relevant Documents a priori

Rocchio Method

Dndj

DnDrdj

Dr djdjQQ ||||01

Qo is initial query. Q1 is the query after one iterationDr are the set of relevant docsDn are the set of irrelevant docs Alpha =1; Beta=.75, Gamma=.25 typically.

Other variations possible, but performance similar

Rocchio/Vector Illustration

Retrieval

Information

0.5

1.0

0 0.5 1.0

D1

D2

Q0

Q’

Q”

Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)

Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

Example Rocchio Calculation

)04.1,033.0,488.0,022.0,527.0,01.0,002.0,000875.0,011.0(

12

25.0

75.0

1

)950,.00.0,450,.00.0,500,.00.0,00.0,00.0,00.0(

)00.0,020,.00.0,025,.005,.00.0,020,.010,.030(.

)120,.100,.100,.025,.050,.002,.020,.009,.020(.

)120,.00.0,00.0,050,.025,.025,.00.0,00.0,030(.

121

1

2

1

new

new

Q

SRRQQ

Q

S

R

R

Relevantdocs

Non-rel doc

Original Query

Constants

Rocchio Calculation

Resulting feedback query

Rocchio Method

Rocchio automatically re-weights terms adds in new terms (from relevant docs)

have to be careful when using negative terms

Rocchio is not a machine learning algorithm Most methods perform similarly

results heavily dependent on test collection Machine learning methods are proving to

work better than standard IR approaches like Rocchio

Using Relevance Feedback Known to improve results

in TREC-like conditions (no user involved)

What about with a user in the loop? How might you measure this?

Precision/Recall figures for the unseen documents need to be computed

Relevance Feedback Summary Iterative query modification can

improve precision and recall for a standing query

In at least one study, users were able to make good choices by seeing which terms were suggested for R.F. and selecting among them

Query ExpansionAdd terms that are closely related to the query termsto improve precision and recall. Two variants: Local only analyze the closeness among the set of documents that are returned Global Consider all the documents in the corpus a prioriHow to decide closely related terms? THESAURI!! -- Hand-coded thesauri (Roget and his brothers) -- Automatically generated thesauri --Correlation based (association, nearness) --Similarity based (terms as vectors in doc space) --Statistical (clustering techniques)

Correlation/Co-occurrence analysisCo-occurrence analysis: Terms that are related to terms in the original query may be

added to the query. Two terms are related if they have high co-occurrence in

documents.

Let n be the number of documents;

n1 and n2 be # documents containing terms t1 and t2,

m be the # documents having both t1 and t2

If t1 and t2 are independent

If t1 and t2 are correlated

mn

n

n

nn 21

mn

n

n

nn 21

Mea

sure

degree

of corre

lation

Association Clusters Let Mij be the term-document matrix

For the full corpus (Global) For the docs in the set of initial results (local) (also sometimes, stems are used instead of terms)

Correlation matrix C = MMT (term-doc Xdoc-term = term-term)

djtvdjtudj

ffCuv ,,

CuvCvvCuuCuvSuv

CuvSuv

Un-normalized Association Matrix

Normalized Association Matrix

Nth-Association Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

Example

d1d2d3d4d5d6d7

K1 2 1 0 2 1 1 0

K2 0 0 1 0 2 2 5

K3 1 0 3 0 4 0 0

11 4 6

4 34 11

6 11 26

1.0 0.097 0.193

0.097 1.0 0.224

0.193 0.224 1.0

Correlatio

n

Matrix

Norm

alized

Correlation

Matrix

1th Assoc Cluster for K2 is K3

4)12(34)22(11)11(4)12(

12 ssssS

Scalar clusters

Consider the normalized association matrix S

The “association vector” of term u Au is (Su1,Su2…Suk)

To measure neighborhood-induced correlation between terms:Take the cosine-theta between the association vectors of terms u and v

Even if terms u and v have low correlations, they may be transitively correlated (e.g. a term w has high correlation with u and v).

|||| AvAuAvAuSuv

Nth-scalar Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

Exampled1d2d3d4d5d6d7

K1 2 1 0 2 1 1 0

K2 0 0 1 0 2 2 5

K3 1 0 3 0 4 0 0

1.0 0.097 0.193

0.097 1.0 0.224

0.193 0.224 1.0

Normalized Correlation Matrix

AK1

USER(43): (neighborhood normatrix)

0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (1.0 0.09756097 0.19354838))

0: returned 1.0

0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (0.09756097 1.0 0.2244898))

0: returned 0.22647195

0: (COSINE-METRIC (1.0 0.09756097 0.19354838) (0.19354838 0.2244898 1.0))

0: returned 0.38323623

0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (1.0 0.09756097 0.19354838))

0: returned 0.22647195

0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (0.09756097 1.0 0.2244898))

0: returned 1.0

0: (COSINE-METRIC (0.09756097 1.0 0.2244898) (0.19354838 0.2244898 1.0))

0: returned 0.43570948

0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (1.0 0.09756097 0.19354838))

0: returned 0.38323623

0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (0.09756097 1.0 0.2244898))

0: returned 0.43570948

0: (COSINE-METRIC (0.19354838 0.2244898 1.0) (0.19354838 0.2244898 1.0))

0: returned 1.0

1.0 0.226 0.383

0.226 1.0 0.435

0.383 0.435 1.0

Scalar (neighborhood)Cluster Matrix

1th Scalar Cluster for K2 is still K3

Metric Clusters

Let r(ti,tj) be the minimum distance (in terms of number of separating words) between ti and tj in any single document (infinity if they never occur together in a document) Define cluster matrix Suv= 1/r(ti,tj)

Nth-metric Cluster for a term tu is the set of terms tv such that Suv are the n largest values among Su1, Su2,….Suk

R(ti,tj)

is al

so usef

ul

For proxim

ity queri

es

And phrase q

ueries

average..

Similarity Thesaurus

The similarity thesaurus is based on term to term relationships rather than on a matrix of co-occurrence. obtained by considering that the terms are

concepts in a concept space. each term is indexed by the documents in

which it appears. Terms assume the original role of

documents while documents are interpreted as indexing elements

Motivation

Ka KbQ

Kv

Ki

Kj

Similarity Thesaurus Inverse term frequency for document dj

To ki is associated a vector

Where

jj t

titf log

),....,,(k ,2,1,i Niii www

Terminology

t: number of terms in the collection

N: number of documents in the collection

Fi,j: frequency of occurrence of the term ki in the document dj

tj: vocabulary of document dj

itfj: inverse term frequency for document dj

N

lj

lil

li

jjij

ji

ji

itff

f

itff

f

w

1

22

,

,

,

,

,

))(max

5.05.0(

))(max

5.05.0(

Idea: It is no surprise if

Oxford dictionary

Mentions the word!

Similarity Thesaurus

The relationship between two terms ku and kv is computed as a correlation factor cu,v given by

The global similarity thesaurus is built through the computation of correlation factor Cu,v for each pair of indexing terms [ku,kv] in the collection

Expensive

Possible to do incremental updates…

jd

jv,ju,vuvu, wwkkc

Simila

r to th

e scal

ar clu

sters

Idea, but fo

r the t

f/itf w

eightin

g

Defining th

e term

vector

three steps as follows: Represent the query in the concept space

used for representation of the index terms2 Based on the global similarity thesaurus,

compute a similarity sim(q,kv) between each term kv correlated to the query terms and the whole query q.

3 Expand the query with the top r ranked terms according to sim(q,kv)

Query expansion with Global Thesaurus

Query Expansion - step one To the query q is associated a

vector q in the term-concept space given by

where wi,q is a weight associated to the index-query pair[ki,q]

iqk

qi kwi

,q

Query Expansion - step two

Compute a similarity sim(q,kv) between each term kv and the user query q

where Cu,v is the correlation factor

qk

vu,qu,vv

u

cwkq)ksim(q,

Query Expansion - step three Add the top r ranked terms according to sim(q,kv) to

the original query q to form the expanded query q’ To each expansion term kv in the query q’ is

assigned a weight wv,q’ given by

The expanded query q’ is then used to retrieve new documents to the user

qkqu,

vq'v,

u

w

)ksim(q,w

Expansion terms must be low frequency terms However, it is difficult to cluster low

frequency terms Idea: Cluster documents into classes

instead and use the low frequency terms in these documents to define our thesaurus classes. This algorithm must produce small and

tight clusters.

Statistical Thesaurus formulation

A clustering algorithm (Complete Link)

This is document clustering algorithm with produces small and tight clusters Place each document in a distinct cluster. Compute the similarity between all pairs of

clusters. Determine the pair of clusters [Cu,Cv] with

the highest inter-cluster similarity. Merge the clusters Cu and Cv Verify a stop criterion. If this criterion is not

met then go back to step 2. Return a hierarchy of clusters.

Similarity between two clusters is defined as the minimum of similarities between all pair of inter-cluster documents

C1

D1 D2D3 D4

C2C3 C4

C1,3

0.99

C1,3,2

0.29

C1,3,2,4

0.00

Selecting the terms that compose each class

Given the document cluster hierarchy for the whole collection, the terms that compose each class of the global thesaurus are selected as follows Obtain from the user three parameters

TC: Threshold class NDC: Number of documents in class MIDF: Minimum inverse document

frequency

C1

D1 D2D3 D4

C2C3 C4

C1,3

0.99

C1,3,2

0.29

C1,3,2,4

0.00

Selecting the terms that compose each class Use the parameter TC as threshold value for

determining the document clusters that will be used to generate thesaurus classes

This threshold has to be surpassed by sim(Cu,Cv) if the documents in the clusters Cu and Cv are to be selected as sources of terms for a thesaurus class

Use the parameter NDC as a limit on the size of clusters (number of documents) to be considered.

A low value of NDC might restrict the selection to the smaller cluster Cu+v

Selecting the terms that compose each class Consider the set of document in each

document cluster pre-selected above. Only the lower frequency documents are

used as sources of terms for the thesaurus classes The parameter MIDF defines the minimum

value of inverse document frequency for any term which is selected to participate in a thesaurus class

Query Expansion based on a Statistical Thesaurus Use the thesaurus class to query

expansion. Compute an average term weight

wtc for each thesaurus class C

C

wwtc

C

1iCi,

Query Expansion based on a Statistical Thesaurus wtc can be used to compute a

thesaurus class weight wc as

5.0C

wtcWc

Query Expansion Sample

TC = 0.90 NDC = 2.00 MIDF = 0.2

sim(1,3) = 0.99sim(1,2) = 0.40sim(1,2) = 0.40sim(2,3) = 0.29sim(4,1) = 0.00sim(4,2) = 0.00sim(4,3) = 0.00

Doc1 = D, D, A, B, C, A, B, CDoc2 = E, C, E, A, A, DDoc3 = D, C, B, B, D, A, B, C, ADoc4 = A

C1

D1 D2D3 D4

C2C3 C4

C1,3

0.99

C1,3,2

0.29

C1,3,2,4

0.00

idf A = 0.0idf B = 0.3idf C = 0.12idf D = 0.12idf E = 0.60 q'=A B E

E

q= A E E

Query Expansion based on a Statistical Thesaurus Problems with this approach

initialization of parameters TC,NDC and MIDF

TC depends on the collection Inspection of the cluster hierarchy is

almost always necessary for assisting with the setting of TC.

A high value of TC might yield classes with too few terms

Conclusion

Thesaurus is a efficient method to expand queries

The computation is expensive but it is executed only once

Query expansion based on similarity thesaurus may use high term frequency to expand the query

Query expansion based on statistical thesaurus need well defined parameters

Using correlation for term change Low freq to Medium Freq

By synonym recognition High to medium frequency

By phrase recognition

Relevance Feedback

Documents

Transcript of Relevance Feedback