Slides Chap05

8/13/2019 Slides Chap05

1/104

Modern Information Retrieval

Chapter 5Relevance Feedback and

Query Expansion

Introduction

A Framework for Feedback Methods

Explicit Relevance Feedback

Explicit Feedback Through Clicks

Implicit Feedback Through Local Analysis

Implicit Feedback Through Global Analysis

Trends and Research Issues

Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 1


2/104

Introduction

Most users find it difficult to formulate queries that arewell designed for retrieval purposes

Yet, most users often need to reformulate their queries

to obtain the results of their interestThus, the first query formulation should be treated as an initial

attempt to retrieve relevant information

Documents initially retrieved could be analyzed for relevance andused to improve initial query



3/104

Introduction

The process of query modification is commonly referredas

relevance feedback, when the user provides information on

relevant documents to a query, orquery expansion, when information related to the query is used

to expand it

We refer to both of them as feedback methodsTwo basic approaches of feedback methods:

explicit feedback, in which the information for query

reformulation is provided directly by the users, and

implicit feedback, in which the information for query

reformulation is implicitly derived by the system



4/104

A Framework for Feedback Methods



5/104

A Framework

Consider a set of documentsDr that are known to berelevant to the current queryq

In relevance feedback, the documents inDr are used to

transformqinto a modified queryqmHowever, obtaining information on documents relevantto a query requires the direct interference of the user

Most users are unwilling to provide this information, particularly inthe Web



6/104

A Framework

Because of this high cost, the idea of relevancefeedback has been relaxed over the years

Instead of asking the users for the relevant documents,

we could:Look at documents they have clicked on; or

Look at terms belonging to the top documents in the result set

In both cases, it is expect that the feedback cycle willproduce results of higher quality



7/104

A Framework

Afeedback cycleis composed of two basic steps:Determine feedback information that is either related or expected

to be related to the original query qand

Determine how to transform query qto take this informationeffectively into account

The first step can be accomplished in two distinct ways:

Obtain the feedback information explicitly from the users

Obtain the feedback information implicitly from the query results

or from external sources such as athesaurus



8/104

A Framework

In anexplicit relevance feedbackcycle, the feedbackinformation is provided directly by the users

However, collecting feedback information is expensive

and time consumingIn the Web,user clickson search results constitute anew source of feedback information

A click indicate a document that is of interest to the userin the context of the current query

Notice that a click does not necessarily indicate a document that

is relevant to the query



9/104

Explicit Feedback Information



10/104

A Framework

In animplicit relevance feedbackcycle, the feedbackinformation is derived implicitly by the system

There are two basic approaches for compiling implicit

feedback information:local analysis, which derives the feedback information from the

top ranked documents in the result set

global analysis, which derives the feedback information fromexternal sources such as a thesaurus



11/104

Implicit Feedback Information



12/104




13/104


In a classic relevance feedback cycle, the user ispresented with a list of the retrieved documents

Then, the user examines them and marks those that

are relevantIn practice, only the top 10 (or 20) ranked documentsneed to be examined

The main idea consists ofselecting important terms from the documents that have been

identified as relevant, and

enhancing the importance of these terms in a new queryformulation


E li i R l F db k


14/104


Expected effect: the new query will be moved towardsthe relevant docs and away from the non-relevant ones

Early experiments have shown good improvements in

precision for small test collectionsRelevance feedback presents the followingcharacteristics:

it shields the user from the details of the query reformulationprocess (all the user has to provide is a relevance judgement)

it breaks down the whole searching task into a sequence of small

steps which are easier to grasp



15/104

The Rocchio Method



16/104

The Rocchio Method

Documents identified as relevant (to a given query)have similarities among themselves

Further, non-relevant docs have term-weight vectors

which are dissimilar from the relevant documentsThe basic idea of the Rocchio Method is to reformulatethe query such that it gets:

closer to the neighborhood of the relevant documents in thevector space, and

away from the neighborhood of the non-relevant documents


Th R hi M h d


17/104

The Rocchio Method

Let us define terminology regarding the processing of agiven queryq, as follows:

Dr: set ofrelevantdocuments among the documents retrieved

Nr: number of documents in setDr

Dn: set ofnon-relevantdocs among the documents retrieved

Nn: number of documents in setDn

Cr: set of relevant docs among all documents in the collectionN: number of documents in the collection

, , : tuning constants


Th R hi M th d


18/104

The Rocchio Method

Consider that the setCr is known in advanceThen, the best query vector for distinguishing therelevant from the non-relevant docs is given by

qopt= 1

|Cr|

djCr

dj 1N |Cr|

djCr

dj

where

|Cr|refers to the cardinality of the setCrdj is a weighted term vector associated with document dj, and

qopt is the optimal weighted term vector for query q


Th R hi M th d


19/104

The Rocchio Method

However, the setCr is not known a prioriTo solve this problem, we can formulate an initial queryand to incrementally change the initial query vector


Th R hi M th d


20/104

The Rocchio Method

There are three classic and similar ways to calculatethe modified queryqm as follows,

Standard_Rocchio: qm = q +

Nr

djDr

dj

Nn

djDn

dj

Ide_Regular: qm = q +

djDr

dj

djDn

dj

Ide_Dec_Hi: qm = q +

djDr

dj max_rank(Dn)

wheremax_rank(Dn)is the highest rankednon-relevant doc


Th R hi M th d


21/104

The Rocchio Method

Three different setups of the parameters in the Rocchioformula are as follows:

= 1, proposed by Rocchio

=== 1, proposed by Ide= 0, which yields apositivefeedback strategy

The current understanding is that the three techniques yield

similar resultsThe main advantages of the above relevance feedbacktechniques are simplicity and good results

Simplicity: modified term weights are computed directly from theset of retrieved documents

Good results: the modified query vector does reflect a portion of

the intended query semantics (observed experimentally)



22/104

Relevance Feedback for the Probabilistic

Model


A Probabilistic Method


23/104


The probabilistic model ranks documents for a queryqaccording to the probabilistic ranking principle

The similarity of a documentdj to a queryqin the

probabilistic model can be expressed as

sim(dj , q)

kiqkidj

log

P(ki|R)1

P(ki

|R)

+ log1 P(ki|R)

P(ki|R)

where

P(ki

|R)stands for the probability of observing the term ki in the

setRof relevant documents

P(ki|R)stands for the probability of observing the term ki in thesetRof non-relevant docs




24/104


Initially, the equation above cannot be used becauseP(ki|R)andP(ki|R)are unknownDifferent methods for estimating these probabilities

automatically were discussed in Chapter 3With user feedback information, these probabilities areestimated in a slightly different way

For the initial search (when there are no retrieveddocuments yet), assumptions often made include:

P(ki|R)is constant for all termski (typically 0.5)the term probability distributionP(ki|R)can be approximated bythe distribution in the whole collection




25/104


These two assumptions yield:

P(ki|R) = 0.5 P(ki|R) = niN

wherenistands for the number of documents in thecollection that contain the termki

Substituting into similarity equation, we obtain

siminitial(dj , q) =

kiqkidj

logN ni

ni

For the feedback searches, the accumulated statisticson relevance are used to evaluateP(ki|R)andP(ki|R)




26/104


Letnr,ibe the number of documents in setDr thatcontain the termki

Then, the probabilitiesP(ki|R)andP(ki|R)can beapproximated by

P(ki|R) = nr,iNr

P(ki|R) = ni nr,iN

Nr

Using these approximations, the similarity equation canrewritten as

sim(dj , q) =

kiqkidj

log

nr,i

Nr nr,i + logN Nr (ni nr,i)

ni nr,i




27/104


Notice that here, contrary to the Rocchio Method, noquery expansion occurs

The same query terms are reweighted using feedback

information provided by the userThe formula above poses problems for certain smallvalues ofNr andnr,i

For this reason, a 0.5 adjustment factor is often addedto the estimation ofP(ki|R)andP(ki|R):

P(ki|R) = nr,i+ 0.5Nr+ 1

P(ki|R) = ni nr,i+ 0.5NNr+ 1




28/104


The main advantage of this feedback method is thederivation of new weights for the query terms

The disadvantages include:

document term weights arenottaken into account during thefeedback loop;

weights of terms in the previous query formulations are

disregarded; andno query expansion is used (the same set of index terms in the

original query is reweighted over and over again)

Thus, this method doesnotin general operate aseffectively as the vector modification methods



29/104

Evaluation of Relevance Feedback




30/104


Consider the modified query vectorqm produced byexpandingqwith relevant documents, according to theRocchio formula

Evaluation ofqm

:

Compare the documents retrieved byqm with the set of relevant

documents forq

In general, the results show spectacular improvements

However, a part of this improvement results from the higher ranks

assigned to the relevant docs used to expand qinto qm

Since the user has seen these docs already, such evaluation is

unrealistic


The Residual Collection


31/104

The Residual Collection

A more realistic approach is to evaluateqm consideringonly theresidual collection

We call residual collection the set of all docs minus the set of

feedback docs provided by the user

Then, the recall-precision figures forqm tend to be lowerthan the figures for the original query vector q

This is not a limitation because the main purpose of theprocess is to compare distinct relevance feedbackstrategies



32/104





33/104


Web search engine users not only inspect the answersto their queries, they also click on them

The clicks reflect preferences for particular answers inthe context of a given query

They can be collected in large numbers withoutinterfering with the user actions

The immediate question is whether they also reflectrelevance judgements on the answers

Under certain restrictions, the answer is affirmative as

we now discuss


Eye Tracking


34/104

Eye Tracking

Clickthrough data provides limited information on theuser behavior

One approach to complement information on userbehavior is to useeye tracking devices

Such commercially available devices can be used todetermine the area of the screen the user is focussed in

The approach allows correctly detecting the area of thescreen of interest to the user in 60-90% of the cases

Further, the cases for which the method does not workcan be determined


Eye Tracking


35/104

y g

Eye movements can be classified in four types:fixations, saccades, pupil dilation, and scan paths

Fixationsare a gaze at a particular area of the screenlasting for 200-300 milliseconds

This time interval is large enough to allow effective braincapture and interpretation of the image displayed

Fixations are the ocular activity normally associatedwith visual information acquisition and processing

That is, fixations are key to interpreting user behavior


Relevance Judgements


36/104

g

To evaluate thequalityof the results, eye tracking is notappropriate

This evaluation requires selecting a set of test queriesand determining relevance judgements for them

This is also the case if we intend to evaluate the qualityof the signal produced by clicks


User Behavior


37/104

User Behavior

Eye tracking experiments have shown that users scanthe query results from top to bottom

The users inspect the first and second results rightaway, within the second or third fixation

Further, they tend to scan the top 5 or top 6 answersthoroughly, before scrolling down to see other answers


User Behavior


38/104

Percentage of times each one of the top results wasviewed and clicked on by a user, for 10 test tasks and29 subjects (Thorsten Joachimset al)


User Behavior
http://portal.acm.org/citation.cfm?id=1229179.1229181http://portal.acm.org/citation.cfm?id=1229179.1229181http://portal.acm.org/citation.cfm?id=1229179.1229181


39/104

We notice that the users inspect the top 2 answersalmost equally, but they click three times more in thefirst

This might be indicative of a user bias towards thesearch engine

That is, that the users tend to trust the search engine in

recommending a top result that is relevant


User Behavior


40/104

This can be better understood by presenting testsubjects with two distinct result sets:

the normal ranking returned by the search engine and

a modified ranking in which the top 2 results have their positionsswapped

Analysis suggest that the user displays atrust biasinthe search engine that favors the top result

That is, the position of the result has great influence onthe users decision to click on it


Clicks as a Metric of Preferences


41/104

Thus, it is clear that interpreting clicks as a directindicative of relevance is not the best approach

More promising is to interpret clicks as a metric ofuserpreferences

For instance, a user can look at a result and decide toskip it to click on a result that appears lower

In this case, we say that the user prefers the resultclicked on to the result shown upper in the ranking

This type of preference relation takes into account:

the results clicked on by the user

the results that were inspected and not clicked on


Clicks within a Same Query


42/104

To interpret clicks as user preferences, we adopt thefollowing definitions

Given a ranking functionR(qi, dj), letrk be thekth ranked result

That is,r1, r2, r3stand for the first, the second, and the third topresults, respectively

Further, let

rk indicate that the user has clicked on thekth result

Define a preference functionrk > rkn,0< k

n < k, that states

that, according to the click actions of the user, thekth top result is

preferrable to the (k n)th result




43/104

To illustrate, consider the following example regardingthe click behavior of a user:

r1 r2

r3 r4

r5 r6 r7 r8 r9

r10

This behavior does not allow us to make definitivestatements about the relevance of resultsr3,r5, andr10

However, it does allow us to make statements on the

relative preferences of this userTwo distinct strategies to capture the preferencerelations in this case are as follows.

Skip-Above: ifrk thenrk > rkn, for allrkn that was not clickedSkip-Previous: if

rk andrk1 has not been clicked then rk > rk1




44/104

To illustrate, consider again the following exampleregarding the click behavior of a user:

r1 r2

r3 r4

r5 r6 r7 r8 r9

r10

According to the Skip-Above strategy, we have:

r3> r2; r3> r1

And, according to the Skip-Previous strategy, we have:

r3> r2

We notice that the Skip-Above strategy produces morepreference relations than the Skip-Previous strategy




45/104

Empirical results indicate that user clicks are inagreement with judgements on the relevance of resultsin roughly 80% of the cases

Both the Skip-Above and the Skip-Previous strategies produce

preference relations

If we swap the first and second results, the clicks still reflect

preference relations, for both strategies

If we reverse the order of the top 10 results, the clicks still reflect

preference relations, for both strategies

Thus, the clicks of the users can be used as a strong

indicative of personal preferences

Further, they also can be used as a strong indicative ofthe relative relevance of the results for a given query


Clicks within a Query Chain


46/104

The discussion above was restricted to the context of asingle query

However, in practice, users issue more than one queryin their search for answers to a same task

The set of queries associated with a same task can beidentified inlive query streams

This set constitute what is referred to as a query chainThe purpose of analysing query chains is to producenew preference relations




47/104

To illustrate, consider that two result sets in a samequery chain led to the following click actions:

r1 r2 r3 r4 r5 r6 r7 r8 r9 r10s1

s2 s3 s4

s5 s6 s7 s8 s9 s10

where

rj refers to an answer in the first result set

sj refers to an answer in the second result set

In this case, the user only clicked on the second and

fifth answers of the second result set




48/104

Two distinct strategies to capture the preference relations

in this case, are as follows

Top-One-No-Click-Earlier: ifsk | sk thensj > r1, forj10.

Top-Two-No-Click-Earlier: ifsk | sk thensj > r1 andsj > r2, forj10

According the first strategy, the following preferences are

produced by the click of the user on results2:

s1> r1; s2> r1; s3> r1; s4> r1; s5> r1; . . .

According the second strategy, we have:

s1> r1; s2> r1; s3> r1; s4> r1; s5> r1; . . .

s1> r2; s2> r2; s3> r2; s4> r2; s5> r2; . . .




49/104

We notice that the second strategy produces twicemore preference relations than the first

These preference relations must be compared with therelevance judgements of the human assessors

The following conclusions were derived:

Both strategies produce preference relations in agreement with

the relevance judgements in roughly 80% of the casesSimilar agreements are observed even if we swap the first and

second results

Similar agreements are observed even if we reverse the order of

the results




50/104

These results suggest:

The users provide negative feedback on whole result sets (by not

clicking on them)

The users learn with the process and reformulate better queries

on the subsequent iterations



51/104

Click-based Ranking


Click-based Ranking


52/104

Click through information can be used to improve theranking

This can be done bylearninga modified rankingfunction from click-based preferences

One approach is to use support vector machines(SVMs) to learn the ranking function


Click-based Ranking


53/104

In this case, preference relations are transformed intoinequalities among weighted term vectors representingthe ranked documents

These inequalities are then translated into an SVM

optimization problem

The solution of this optimization problem is the optimalweights for the document terms

The approach proposes the combination of differentretrieval functions with different weights



54/104

Implicit Feedback Through Local Analysis


Local analysis


55/104

Local analysis consists in deriving feedback informationfrom the documents retrieved for a given query q

This is similar to a relevance feedback cycle but donewithout assistance from the user

Two local strategies are discussed here: localclusteringand localcontext analysis



56/104

Local Clustering


Local Clustering


57/104

Adoption of clustering techniques for query expansionhas been a basic approach in information retrieval

The standard procedure is to quantify term correlationsand then use the correlated terms for query expansion

Term correlations can be quantified by using globalstructures, such as association matrices

However, global structures might not adapt well to thelocal context defined by the current query

To deal with this problem,local clusteringcan beused, as we now discuss


Association Clusters


58/104

For a given queryq, let

D: local document set, i.e., set of documents retrieved byq

N: number of documents in Dl

Vl: local vocabulary, i.e., set of all distinct words in Dlfi,j: frequency of occurrence of a term ki in a documentdj DlM=[mij]: term-document matrix withVl rows andNl columns

mij=fi,j : an element of matrixM

MT: transpose ofM

The matrix

C=MMT

is a local term-term correlation matrix




59/104

Each elementcu,vC expresses a correlation

between termsku andkv

This relationship between the terms is based on theirjoint co-occurrences inside documents of the collection

Higher the number of documents in which the two termsco-occur, stronger is this correlation

Correlation strengths can be used to define localclusters of neighbor terms

Terms in a same cluster can then be used for queryexpansion

We consider three types of clusters here: associationclusters,metric clusters, andscalar clusters.




60/104

An association cluster is computed from a localcorrelation matrix C

For that, we re-define the correlation factors cu,vbetween any pair of termsku andkv, as follows:

cu,v =djDl

fu,j fv,j

In this case the correlation matrix is referred to as alocal association matrix

The motivation is that terms that co-occur frequently

inside documents have a synonymity association




61/104

The correlation factorscu,v and the association matrixC are said to be unnormalized

An alternative is to normalize the correlation factors:

cu,v = cu,v

cu,u+cv,v cu,vIn this case the association matrix C is said to be

normalized




62/104

Given a local association matrix C, we can use it tobuild local association clusters as follows

LetCu(n)be a function that returns thenlargest factorscu,v

C, wherev varies over the set of local terms and

v=uThen,Cu(n)defines a local association cluster, aneighborhood, around the termku

Given a queryq, we are normally interested in findingclusters only for the|q|query termsThis means that such clusters can be computedefficiently at query time


Metric Clusters


63/104

Association clusters do not take into account where theterms occur in a document

However, two terms that occur in a same sentence tendto be more correlated

Ametric clusterre-defines the correlation factorscu,vas a function of their distances in documents


Metric Clusters

L ( ) b f i h h


64/104

Letku(n, j)be a function that returns thenthoccurrence of termku in documentdj

Further, letr(ku(n, j), kv(m, j))be a function thatcomputes the distance between

thenthoccurrence of term ku in documentdj; and

themthoccurrence of term kv in documentdj

We define,

cu,v = djDl

nm

1

r(ku(n, j), kv(m, j))

In this case the correlation matrix is referred to as alocalmetric matrix


Metric Clusters

N ti th t if k d k i di ti t d t


65/104

Notice that ifku andkv are in distinct documents wetake their distance to be infinity

Variations of the above expression forcu,v have been

reported in the literature, such as1/r2(ku(n, j), kv(m, j))

The metric correlation factorcu,v quantifies absoluteinverse distances and is said to be unnormalized

Thus, the local metric matrixC

is said to beunnormalized


Metric Clusters

A lt ti i t li th l ti f t


66/104

An alternative is to normalize the correlation factor

For instance,

cu,v = cu,v

total number of [ku, kv]pairs considered

In this case the local metric matrix C is said to benormalized


Scalar Clusters

The correlation between two local terms can also be


67/104

The correlation between two local terms can also bedefined by comparing the neighborhoods of the twoterms

The idea is that two terms with similar neighborhoods

have some synonymity relationship

In this case we say that the relationship is indirect or induced by

the neighborhood

We can quantify this relationship comparing the neighborhoods of

the terms through a scalar measure

For instance, the cosine of the angle between the two vectors is a

popular scalar similarity measure


Scalar Clusters

Let


68/104

Let

su =(cu,x1 , su,x2 , . . . , su,xn): vector of neighborhood correlation

values for the termku

sv =(cv,y1 , cv,y2 , . . . , cv,ym): vector of neighborhood correlation

values for termkv

Define,

cu,v =

su

sv|su| |sv|

In this case the correlation matrix C is referred to as alocal scalar matrix


Scalar Clusters

The local scalar matrix C is said to be induced by the


69/104

The local scalar matrix C is said to be induced by theneighborhood

LetCu(n)be a function that returns thenlargestcu,vvalues in a local scalar matrix C,v

=u

Then,Cu(n)defines a scalar cluster around term ku


Neighbor Terms

Terms that belong to clusters associated to the query


70/104

Terms that belong to clusters associated to the queryterms can be used to expand the original query

Such terms are called neighbors of the query terms andare characterized as follows

A termkv that belongs to a clusterCu(n), associatedwith another termku, is said to be aneighborofku

Often, neighbor terms represent distinct keywords thatare correlated by the current query context


Neighbor Terms

Consider the problem of expanding a given user query q


71/104

Consider the problem of expanding a given user query qwith neighbor terms

One possibility is to expand the query as follows

For each termkuq, selectmneighbor terms from theclusterCu(n)and add them to the queryThis can be expressed as follows:

qm=q {kv|kvCu(n), kuq}Hopefully, the additional neighbor termskv will retrieve

new relevant documents


Neighbor Terms

The set C (n) might be composed of terms obtained


72/104

The setCu(n)might be composed of terms obtainedusing correlation factors normalized and unnormalized

Query expansion is important because it tends toimprove recall

However, the larger number of documents to rank alsotends to lower precision

Thus, query expansion needs to be exercised with greatcare and fine tuned for the collection at hand



73/104

Local Context Analysis



The local clustering techniques are based on the set of


74/104

The local clustering techniques are based on the set of

documents retrieved for a query

A distinct approach is to search for term correlations inthe whole collection

Global techniques usually involve the building of athesaurus that encodes term relationships in the wholecollection

The terms are treated as concepts and the thesaurus isviewed as a concept relationship structure

The building of a thesaurus usually considers the use of

small contexts and phrase structures



Local context analysis is an approach that combines


75/104

Local context analysis is an approach that combines

global and local analysis

It is based on the use ofnoun groups, i.e., a singlenoun, two nouns, or three adjacent nouns in the text

Noun groups selected from the top ranked documentsare treated as document concepts

However, instead of documents, passages are used fordetermining term co-occurrences

Passages are text windows of fixed size



More specifically, the local context analysis procedure


76/104

o e spec ca y, t e oca co te t a a ys s p ocedu e

operates in three steps

First, retrieve the topnranked passages using the original query

Second, for each concept cin the passages compute the

similaritysim(q, c)between the whole query qand the conceptc

Third, the topmranked concepts, according tosim(q, c), are

added to the original queryq

A weight computed as 1 0.9 i/m is assigned toeach conceptc, where

i: position ofcin the concept rankingm: number of concepts to add to q

The terms in the original queryqmight be stressed by

assigning a weight equal to 2 to each of themChap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 76


Of these three steps, the second one is the most


77/104

p ,

complex and the one which we now discuss

The similaritysim(q, c)between each conceptcand theoriginal queryqis computed as follows

sim(q, c) =

kiq

+ log(f(c,ki)idfc)logn

idfi

wherenis the number of top ranked passagesconsidered



The functionf(c, ki)quantifies the correlation between


78/104

f ( , i) q

the conceptcand the query termki and is given by

f(c, ki) =n

j=1

pfi,j

pfc,j

where

pfi,j is the frequency of termki in thej-th passage; andpfc,j is the frequency of the conceptcin thej-th passage

Notice that this is the correlation measure defined for

association clusters, but adapted for passages



The inverse document frequency factors are computed


79/104

q y p

as

idfi = max(1,log10N/npi

5 )

idfc = max(1,log10N/npc5

)

where

Nis the number of passages in the collection

npi is the number of passages containing the term ki; and

npc is the number of passages containing the concept c

Theidfi factor in the exponent is introduced toemphasize infrequent query terms



The procedure above for computingsim(q, c)is a


80/104

non-trivial variant of tf-idf ranking

It has been adjusted for operation with TREC data anddid not work so well with a different collection

Thus, it is important to have in mind that tuning mightbe required for operation with a different collection



81/104

Implicit Feedback Through Global Analysis


Global Context Analysis

The methods of local analysis extract information from


82/104

the local set of documents retrieved to expand the query

An alternative approach is to expand the query usinginformation from the whole set of documentsa

strategy usually referred to asglobal analysisprocedures

We distinguish two global analysis procedures:

Query expansion based on a similarity thesaurus

Query expansion based on a statistical thesaurus



83/104

Query Expansion based on a SimilarityThesaurus


Similarity Thesaurus

We now discuss a query expansion model based on a


84/104

globalsimilarity thesaurusconstructed automatically

The similarity thesaurus is based on term to termrelationships rather than on a matrix of co-occurrence

Special attention is paid to the selection of terms forexpansion and to the reweighting of these terms

Terms for expansion are selected based on theirsimilarity to the whole query



A similarity thesaurus is built using term to term


85/104

relationships

These relationships are derived by considering that theterms are concepts in a concept space

In this concept space, each term is indexed by thedocuments in which it appears

Thus, terms assume the original role of documentswhile documents are interpreted as indexing elements



Let,


86/104

t: number of terms in the collection

N: number of documents in the collection

fi,j: frequency of termki in documentdj

tj: number of distinct index terms in documentdj

Then,

itfj = log t

tj

is the inverse term frequency for documentdj

(analogous to inverse document frequency)



Within this framework, with each termki is associated a

i b


87/104

vectorki given by

ki= (wi,1, wi,2, . . . , wi,N)

These weights are computed as follows

wi,j =

(0.5+0.5 fi,j

maxj(fi,j)) itfjN

l=1(0.5+0.5 fi,l

maxl(fi,l))2 itf2j

wheremaxj(fi,j)computes the maximum of all fi,jfactors for thei-th term



The relationship between two termsku andkv is

t d l ti f t i b


88/104

computed as a correlation factorcu,v given by

cu,v =ku kv = dj

wu,j wv,j

The global similarity thesaurus is given by the scalarterm-term matrix composed of correlation factorscu,v

This global similarity thesaurus has to be computedonly once and can be updated incrementally



Given the global similarity thesaurus, query expansion

i d i th t f ll


89/104

is done in three steps as follows

First, represent the query in the same vector space used for

representing the index terms

Second, compute a similaritysim(q, kv)between each termkv

correlated to the query terms and the whole queryq

Third, expand the query with the top r ranked terms according to

sim(q, kv)



For the first step, the query is represented by a vector q

i b


90/104

given by

q=kiq

wi,qki

wherewi,qis a term-query weight computed using the

equation forwi,j, but withqin place of dj

For the second step, the similarity sim(q, kv)iscomputed as

sim(q, kv) =q

kv =

kiq

wi,q

ci,v



A termkv might be closer to the whole query centroid

C than to the individual query terms


91/104

qCthan to the individual query terms

Thus, terms selected here might be distinct from thoseselected by previous global analysis methods



For the third step, the top rranked terms are added to

the query to form the expanded query


92/104

the queryqto form the expanded queryqm

To each expansion termkv in queryqm is assigned aweightwv,qm given by

wv,qm = sim(q, kv)

kiqwi,q

The expanded queryqm is then used to retrieve newdocuments

This technique has yielded improved retrieval

performance (in the range of 20%) with three differentcollections



Consider a documentdj which is represented in the

term vector space by d

w k


93/104

term vector space by dj =

kidj wi,j ki

Assume that the queryqis expanded to include all thetindex terms (properly weighted) in the collection

Then, the similaritysim(q, dj)betweendj andqcan becomputed in the term vector space by

sim(q, dj)

kvdj

kuq

wv,j wu,q cu,v



The previous expression is analogous to the similarity

formula in the generalized vector space model


94/104

formula in the generalized vector space model

Thus, the generalized vector space model can beinterpreted as a query expansion technique

The two main differences are

the weights are computed differently

only the toprranked terms are used



95/104

Query Expansion based on a StatisticalThesaurus


Global Statistical Thesaurus

We now discuss a query expansion technique based on

a global statistical thesaurus


96/104

aglobal statistical thesaurus

The approach is quite distinct from the one based on asimilarity thesaurus

The global thesaurus is composed of classes that groupcorrelated terms in the context of the whole collection

Such correlated terms can then be used to expand the

original user query



To be effective, the terms selected for expansion must

have high term discrimination values


97/104

have high term discrimination values

This implies that they must be low frequency terms

However, it is difficult to cluster low frequency termsdue to the small amount of information about them

To circumvent this problem, documents are clusteredinto classes

The low frequency terms in these documents are thenused to define thesaurus classes



A document clustering algorithm that produces small

and tight clusters is the complete link algorithm:


98/104

and tight clusters is thecomplete link algorithm:

1. Initially, place each document in a distinct cluster

2. Compute the similarity between all pairs of clusters

3. Determine the pair of clusters[Cu, Cv]with the highest

inter-cluster similarity

4. Merge the clustersCu andCv

5. Verify a stop criterion (if this criterion is not met then go back to

step 2)

6. Return a hierarchy of clusters



The similarity between two clusters is defined as the

minimum of the similarities between two documents not


99/104

minimum of the similarities between two documents notin the same cluster

To compute the similarity between documents in a pair,

the cosine formula of the vector model is usedAs a result of this minimality criterion, the resultantclusters tend to be small and tight



Consider that the whole document collection has been

clustered using the complete link algorithm


100/104

clustered using the complete link algorithm

Figure below illustrates a portion of the whole clusterhierarchy generated by the complete link algorithm

Cu

Cv

Cz

0.11

0.15

where the inter-cluster similarities are shown in the

ovalsChap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 100


The terms that compose each class of the global

thesaurus are selected as follows


101/104

thesaurus are selected as follows

Obtain from the user three parameters:

TC: threshold class

NDC: number of documents in a class

MIDF: minimum inverse document frequency

Paramenter TC determines the document clusters thatwill be used to generate thesaurus classes

Two clustersCu andCv are selected, when TC is surpassed by

sim(Cu, Cv)



Use NDC as a limit on the number of documents of the

clusters


102/104

s s

For instance, if bothCu+v andCu+v+z are selected then the

parameter NDC might be used to decide between the two

MIDF defines the minimum value of IDF for any termwhich is selected to participate in a thesaurus class



Given that the thesaurus classes have been built, they

can be used for query expansion


103/104

q y p

For this, an average term weightwtCfor each thesaurusclassCis computed as follows

wtC=

|C|i=1wi,C|C|

where

|C|is the number of terms in the thesaurus class C, andwi,Cis a weight associated with the term-class pair [ki, C]



This average term weight can then be used to compute

a thesaurus class weightwCas


104/104

g

wC=wtC

|C| 0.5

The above weight formulations have been verifiedthrough experimentation and have yielded good results


Slides Chap05

Documents

Transcript of Slides Chap05