Slides Chap05
Transcript of Slides Chap05
-
8/13/2019 Slides Chap05
1/104
Modern Information Retrieval
Chapter 5Relevance Feedback and
Query Expansion
Introduction
A Framework for Feedback Methods
Explicit Relevance Feedback
Explicit Feedback Through Clicks
Implicit Feedback Through Local Analysis
Implicit Feedback Through Global Analysis
Trends and Research Issues
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 1
-
8/13/2019 Slides Chap05
2/104
Introduction
Most users find it difficult to formulate queries that arewell designed for retrieval purposes
Yet, most users often need to reformulate their queries
to obtain the results of their interestThus, the first query formulation should be treated as an initial
attempt to retrieve relevant information
Documents initially retrieved could be analyzed for relevance andused to improve initial query
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 2
-
8/13/2019 Slides Chap05
3/104
Introduction
The process of query modification is commonly referredas
relevance feedback, when the user provides information on
relevant documents to a query, orquery expansion, when information related to the query is used
to expand it
We refer to both of them as feedback methodsTwo basic approaches of feedback methods:
explicit feedback, in which the information for query
reformulation is provided directly by the users, and
implicit feedback, in which the information for query
reformulation is implicitly derived by the system
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 3
-
8/13/2019 Slides Chap05
4/104
A Framework for Feedback Methods
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 4
-
8/13/2019 Slides Chap05
5/104
A Framework
Consider a set of documentsDr that are known to berelevant to the current queryq
In relevance feedback, the documents inDr are used to
transformqinto a modified queryqmHowever, obtaining information on documents relevantto a query requires the direct interference of the user
Most users are unwilling to provide this information, particularly inthe Web
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 5
-
8/13/2019 Slides Chap05
6/104
A Framework
Because of this high cost, the idea of relevancefeedback has been relaxed over the years
Instead of asking the users for the relevant documents,
we could:Look at documents they have clicked on; or
Look at terms belonging to the top documents in the result set
In both cases, it is expect that the feedback cycle willproduce results of higher quality
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 6
-
8/13/2019 Slides Chap05
7/104
A Framework
Afeedback cycleis composed of two basic steps:Determine feedback information that is either related or expected
to be related to the original query qand
Determine how to transform query qto take this informationeffectively into account
The first step can be accomplished in two distinct ways:
Obtain the feedback information explicitly from the users
Obtain the feedback information implicitly from the query results
or from external sources such as athesaurus
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 7
-
8/13/2019 Slides Chap05
8/104
A Framework
In anexplicit relevance feedbackcycle, the feedbackinformation is provided directly by the users
However, collecting feedback information is expensive
and time consumingIn the Web,user clickson search results constitute anew source of feedback information
A click indicate a document that is of interest to the userin the context of the current query
Notice that a click does not necessarily indicate a document that
is relevant to the query
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 8
-
8/13/2019 Slides Chap05
9/104
Explicit Feedback Information
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 9
-
8/13/2019 Slides Chap05
10/104
A Framework
In animplicit relevance feedbackcycle, the feedbackinformation is derived implicitly by the system
There are two basic approaches for compiling implicit
feedback information:local analysis, which derives the feedback information from the
top ranked documents in the result set
global analysis, which derives the feedback information fromexternal sources such as a thesaurus
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 10
-
8/13/2019 Slides Chap05
11/104
Implicit Feedback Information
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 11
-
8/13/2019 Slides Chap05
12/104
Explicit Relevance Feedback
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 12
-
8/13/2019 Slides Chap05
13/104
Explicit Relevance Feedback
In a classic relevance feedback cycle, the user ispresented with a list of the retrieved documents
Then, the user examines them and marks those that
are relevantIn practice, only the top 10 (or 20) ranked documentsneed to be examined
The main idea consists ofselecting important terms from the documents that have been
identified as relevant, and
enhancing the importance of these terms in a new queryformulation
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 13
E li i R l F db k
-
8/13/2019 Slides Chap05
14/104
Explicit Relevance Feedback
Expected effect: the new query will be moved towardsthe relevant docs and away from the non-relevant ones
Early experiments have shown good improvements in
precision for small test collectionsRelevance feedback presents the followingcharacteristics:
it shields the user from the details of the query reformulationprocess (all the user has to provide is a relevance judgement)
it breaks down the whole searching task into a sequence of small
steps which are easier to grasp
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 14
-
8/13/2019 Slides Chap05
15/104
The Rocchio Method
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 15
-
8/13/2019 Slides Chap05
16/104
The Rocchio Method
Documents identified as relevant (to a given query)have similarities among themselves
Further, non-relevant docs have term-weight vectors
which are dissimilar from the relevant documentsThe basic idea of the Rocchio Method is to reformulatethe query such that it gets:
closer to the neighborhood of the relevant documents in thevector space, and
away from the neighborhood of the non-relevant documents
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 16
Th R hi M h d
-
8/13/2019 Slides Chap05
17/104
The Rocchio Method
Let us define terminology regarding the processing of agiven queryq, as follows:
Dr: set ofrelevantdocuments among the documents retrieved
Nr: number of documents in setDr
Dn: set ofnon-relevantdocs among the documents retrieved
Nn: number of documents in setDn
Cr: set of relevant docs among all documents in the collectionN: number of documents in the collection
, , : tuning constants
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 17
Th R hi M th d
-
8/13/2019 Slides Chap05
18/104
The Rocchio Method
Consider that the setCr is known in advanceThen, the best query vector for distinguishing therelevant from the non-relevant docs is given by
qopt= 1
|Cr|
djCr
dj 1N |Cr|
djCr
dj
where
|Cr|refers to the cardinality of the setCrdj is a weighted term vector associated with document dj, and
qopt is the optimal weighted term vector for query q
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 18
Th R hi M th d
-
8/13/2019 Slides Chap05
19/104
The Rocchio Method
However, the setCr is not known a prioriTo solve this problem, we can formulate an initial queryand to incrementally change the initial query vector
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 19
Th R hi M th d
-
8/13/2019 Slides Chap05
20/104
The Rocchio Method
There are three classic and similar ways to calculatethe modified queryqm as follows,
Standard_Rocchio: qm = q +
Nr
djDr
dj
Nn
djDn
dj
Ide_Regular: qm = q +
djDr
dj
djDn
dj
Ide_Dec_Hi: qm = q +
djDr
dj max_rank(Dn)
wheremax_rank(Dn)is the highest rankednon-relevant doc
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 20
Th R hi M th d
-
8/13/2019 Slides Chap05
21/104
The Rocchio Method
Three different setups of the parameters in the Rocchioformula are as follows:
= 1, proposed by Rocchio
=== 1, proposed by Ide= 0, which yields apositivefeedback strategy
The current understanding is that the three techniques yield
similar resultsThe main advantages of the above relevance feedbacktechniques are simplicity and good results
Simplicity: modified term weights are computed directly from theset of retrieved documents
Good results: the modified query vector does reflect a portion of
the intended query semantics (observed experimentally)
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 21
-
8/13/2019 Slides Chap05
22/104
Relevance Feedback for the Probabilistic
Model
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 22
A Probabilistic Method
-
8/13/2019 Slides Chap05
23/104
A Probabilistic Method
The probabilistic model ranks documents for a queryqaccording to the probabilistic ranking principle
The similarity of a documentdj to a queryqin the
probabilistic model can be expressed as
sim(dj , q)
kiqkidj
log
P(ki|R)1
P(ki
|R)
+ log1 P(ki|R)
P(ki|R)
where
P(ki
|R)stands for the probability of observing the term ki in the
setRof relevant documents
P(ki|R)stands for the probability of observing the term ki in thesetRof non-relevant docs
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 23
A Probabilistic Method
-
8/13/2019 Slides Chap05
24/104
A Probabilistic Method
Initially, the equation above cannot be used becauseP(ki|R)andP(ki|R)are unknownDifferent methods for estimating these probabilities
automatically were discussed in Chapter 3With user feedback information, these probabilities areestimated in a slightly different way
For the initial search (when there are no retrieveddocuments yet), assumptions often made include:
P(ki|R)is constant for all termski (typically 0.5)the term probability distributionP(ki|R)can be approximated bythe distribution in the whole collection
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 24
A Probabilistic Method
-
8/13/2019 Slides Chap05
25/104
A Probabilistic Method
These two assumptions yield:
P(ki|R) = 0.5 P(ki|R) = niN
wherenistands for the number of documents in thecollection that contain the termki
Substituting into similarity equation, we obtain
siminitial(dj , q) =
kiqkidj
logN ni
ni
For the feedback searches, the accumulated statisticson relevance are used to evaluateP(ki|R)andP(ki|R)
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 25
A Probabilistic Method
-
8/13/2019 Slides Chap05
26/104
A Probabilistic Method
Letnr,ibe the number of documents in setDr thatcontain the termki
Then, the probabilitiesP(ki|R)andP(ki|R)can beapproximated by
P(ki|R) = nr,iNr
P(ki|R) = ni nr,iN
Nr
Using these approximations, the similarity equation canrewritten as
sim(dj , q) =
kiqkidj
log
nr,i
Nr nr,i + logN Nr (ni nr,i)
ni nr,i
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 26
A Probabilistic Method
-
8/13/2019 Slides Chap05
27/104
A Probabilistic Method
Notice that here, contrary to the Rocchio Method, noquery expansion occurs
The same query terms are reweighted using feedback
information provided by the userThe formula above poses problems for certain smallvalues ofNr andnr,i
For this reason, a 0.5 adjustment factor is often addedto the estimation ofP(ki|R)andP(ki|R):
P(ki|R) = nr,i+ 0.5Nr+ 1
P(ki|R) = ni nr,i+ 0.5NNr+ 1
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 27
A Probabilistic Method
-
8/13/2019 Slides Chap05
28/104
A Probabilistic Method
The main advantage of this feedback method is thederivation of new weights for the query terms
The disadvantages include:
document term weights arenottaken into account during thefeedback loop;
weights of terms in the previous query formulations are
disregarded; andno query expansion is used (the same set of index terms in the
original query is reweighted over and over again)
Thus, this method doesnotin general operate aseffectively as the vector modification methods
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 28
-
8/13/2019 Slides Chap05
29/104
Evaluation of Relevance Feedback
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 29
Evaluation of Relevance Feedback
-
8/13/2019 Slides Chap05
30/104
Evaluation of Relevance Feedback
Consider the modified query vectorqm produced byexpandingqwith relevant documents, according to theRocchio formula
Evaluation ofqm
:
Compare the documents retrieved byqm with the set of relevant
documents forq
In general, the results show spectacular improvements
However, a part of this improvement results from the higher ranks
assigned to the relevant docs used to expand qinto qm
Since the user has seen these docs already, such evaluation is
unrealistic
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 30
The Residual Collection
-
8/13/2019 Slides Chap05
31/104
The Residual Collection
A more realistic approach is to evaluateqm consideringonly theresidual collection
We call residual collection the set of all docs minus the set of
feedback docs provided by the user
Then, the recall-precision figures forqm tend to be lowerthan the figures for the original query vector q
This is not a limitation because the main purpose of theprocess is to compare distinct relevance feedbackstrategies
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 31
-
8/13/2019 Slides Chap05
32/104
Explicit Feedback Through Clicks
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 32
Explicit Feedback Through Clicks
-
8/13/2019 Slides Chap05
33/104
Explicit Feedback Through Clicks
Web search engine users not only inspect the answersto their queries, they also click on them
The clicks reflect preferences for particular answers inthe context of a given query
They can be collected in large numbers withoutinterfering with the user actions
The immediate question is whether they also reflectrelevance judgements on the answers
Under certain restrictions, the answer is affirmative as
we now discuss
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 33
Eye Tracking
-
8/13/2019 Slides Chap05
34/104
Eye Tracking
Clickthrough data provides limited information on theuser behavior
One approach to complement information on userbehavior is to useeye tracking devices
Such commercially available devices can be used todetermine the area of the screen the user is focussed in
The approach allows correctly detecting the area of thescreen of interest to the user in 60-90% of the cases
Further, the cases for which the method does not workcan be determined
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 34
Eye Tracking
-
8/13/2019 Slides Chap05
35/104
y g
Eye movements can be classified in four types:fixations, saccades, pupil dilation, and scan paths
Fixationsare a gaze at a particular area of the screenlasting for 200-300 milliseconds
This time interval is large enough to allow effective braincapture and interpretation of the image displayed
Fixations are the ocular activity normally associatedwith visual information acquisition and processing
That is, fixations are key to interpreting user behavior
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 35
Relevance Judgements
-
8/13/2019 Slides Chap05
36/104
g
To evaluate thequalityof the results, eye tracking is notappropriate
This evaluation requires selecting a set of test queriesand determining relevance judgements for them
This is also the case if we intend to evaluate the qualityof the signal produced by clicks
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 36
User Behavior
-
8/13/2019 Slides Chap05
37/104
User Behavior
Eye tracking experiments have shown that users scanthe query results from top to bottom
The users inspect the first and second results rightaway, within the second or third fixation
Further, they tend to scan the top 5 or top 6 answersthoroughly, before scrolling down to see other answers
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 37
User Behavior
-
8/13/2019 Slides Chap05
38/104
Percentage of times each one of the top results wasviewed and clicked on by a user, for 10 test tasks and29 subjects (Thorsten Joachimset al)
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 38
User Behavior
http://portal.acm.org/citation.cfm?id=1229179.1229181http://portal.acm.org/citation.cfm?id=1229179.1229181http://portal.acm.org/citation.cfm?id=1229179.1229181 -
8/13/2019 Slides Chap05
39/104
We notice that the users inspect the top 2 answersalmost equally, but they click three times more in thefirst
This might be indicative of a user bias towards thesearch engine
That is, that the users tend to trust the search engine in
recommending a top result that is relevant
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 39
User Behavior
-
8/13/2019 Slides Chap05
40/104
This can be better understood by presenting testsubjects with two distinct result sets:
the normal ranking returned by the search engine and
a modified ranking in which the top 2 results have their positionsswapped
Analysis suggest that the user displays atrust biasinthe search engine that favors the top result
That is, the position of the result has great influence onthe users decision to click on it
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 40
Clicks as a Metric of Preferences
-
8/13/2019 Slides Chap05
41/104
Thus, it is clear that interpreting clicks as a directindicative of relevance is not the best approach
More promising is to interpret clicks as a metric ofuserpreferences
For instance, a user can look at a result and decide toskip it to click on a result that appears lower
In this case, we say that the user prefers the resultclicked on to the result shown upper in the ranking
This type of preference relation takes into account:
the results clicked on by the user
the results that were inspected and not clicked on
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 41
Clicks within a Same Query
-
8/13/2019 Slides Chap05
42/104
To interpret clicks as user preferences, we adopt thefollowing definitions
Given a ranking functionR(qi, dj), letrk be thekth ranked result
That is,r1, r2, r3stand for the first, the second, and the third topresults, respectively
Further, let
rk indicate that the user has clicked on thekth result
Define a preference functionrk > rkn,0< k
n < k, that states
that, according to the click actions of the user, thekth top result is
preferrable to the (k n)th result
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 42
Clicks within a Same Query
-
8/13/2019 Slides Chap05
43/104
To illustrate, consider the following example regardingthe click behavior of a user:
r1 r2
r3 r4
r5 r6 r7 r8 r9
r10
This behavior does not allow us to make definitivestatements about the relevance of resultsr3,r5, andr10
However, it does allow us to make statements on the
relative preferences of this userTwo distinct strategies to capture the preferencerelations in this case are as follows.
Skip-Above: ifrk thenrk > rkn, for allrkn that was not clickedSkip-Previous: if
rk andrk1 has not been clicked then rk > rk1
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 43
Clicks within a Same Query
-
8/13/2019 Slides Chap05
44/104
To illustrate, consider again the following exampleregarding the click behavior of a user:
r1 r2
r3 r4
r5 r6 r7 r8 r9
r10
According to the Skip-Above strategy, we have:
r3> r2; r3> r1
And, according to the Skip-Previous strategy, we have:
r3> r2
We notice that the Skip-Above strategy produces morepreference relations than the Skip-Previous strategy
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 44
Clicks within a Same Query
-
8/13/2019 Slides Chap05
45/104
Empirical results indicate that user clicks are inagreement with judgements on the relevance of resultsin roughly 80% of the cases
Both the Skip-Above and the Skip-Previous strategies produce
preference relations
If we swap the first and second results, the clicks still reflect
preference relations, for both strategies
If we reverse the order of the top 10 results, the clicks still reflect
preference relations, for both strategies
Thus, the clicks of the users can be used as a strong
indicative of personal preferences
Further, they also can be used as a strong indicative ofthe relative relevance of the results for a given query
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 45
Clicks within a Query Chain
-
8/13/2019 Slides Chap05
46/104
The discussion above was restricted to the context of asingle query
However, in practice, users issue more than one queryin their search for answers to a same task
The set of queries associated with a same task can beidentified inlive query streams
This set constitute what is referred to as a query chainThe purpose of analysing query chains is to producenew preference relations
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 46
Clicks within a Query Chain
-
8/13/2019 Slides Chap05
47/104
To illustrate, consider that two result sets in a samequery chain led to the following click actions:
r1 r2 r3 r4 r5 r6 r7 r8 r9 r10s1
s2 s3 s4
s5 s6 s7 s8 s9 s10
where
rj refers to an answer in the first result set
sj refers to an answer in the second result set
In this case, the user only clicked on the second and
fifth answers of the second result set
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 47
Clicks within a Query Chain
-
8/13/2019 Slides Chap05
48/104
Two distinct strategies to capture the preference relations
in this case, are as follows
Top-One-No-Click-Earlier: ifsk | sk thensj > r1, forj10.
Top-Two-No-Click-Earlier: ifsk | sk thensj > r1 andsj > r2, forj10
According the first strategy, the following preferences are
produced by the click of the user on results2:
s1> r1; s2> r1; s3> r1; s4> r1; s5> r1; . . .
According the second strategy, we have:
s1> r1; s2> r1; s3> r1; s4> r1; s5> r1; . . .
s1> r2; s2> r2; s3> r2; s4> r2; s5> r2; . . .
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 48
Clicks within a Query Chain
-
8/13/2019 Slides Chap05
49/104
We notice that the second strategy produces twicemore preference relations than the first
These preference relations must be compared with therelevance judgements of the human assessors
The following conclusions were derived:
Both strategies produce preference relations in agreement with
the relevance judgements in roughly 80% of the casesSimilar agreements are observed even if we swap the first and
second results
Similar agreements are observed even if we reverse the order of
the results
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 49
Clicks within a Query Chain
-
8/13/2019 Slides Chap05
50/104
These results suggest:
The users provide negative feedback on whole result sets (by not
clicking on them)
The users learn with the process and reformulate better queries
on the subsequent iterations
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 50
-
8/13/2019 Slides Chap05
51/104
Click-based Ranking
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 51
Click-based Ranking
-
8/13/2019 Slides Chap05
52/104
Click through information can be used to improve theranking
This can be done bylearninga modified rankingfunction from click-based preferences
One approach is to use support vector machines(SVMs) to learn the ranking function
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 52
Click-based Ranking
-
8/13/2019 Slides Chap05
53/104
In this case, preference relations are transformed intoinequalities among weighted term vectors representingthe ranked documents
These inequalities are then translated into an SVM
optimization problem
The solution of this optimization problem is the optimalweights for the document terms
The approach proposes the combination of differentretrieval functions with different weights
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 53
-
8/13/2019 Slides Chap05
54/104
Implicit Feedback Through Local Analysis
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 54
Local analysis
-
8/13/2019 Slides Chap05
55/104
Local analysis consists in deriving feedback informationfrom the documents retrieved for a given query q
This is similar to a relevance feedback cycle but donewithout assistance from the user
Two local strategies are discussed here: localclusteringand localcontext analysis
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 55
-
8/13/2019 Slides Chap05
56/104
Local Clustering
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 56
Local Clustering
-
8/13/2019 Slides Chap05
57/104
Adoption of clustering techniques for query expansionhas been a basic approach in information retrieval
The standard procedure is to quantify term correlationsand then use the correlated terms for query expansion
Term correlations can be quantified by using globalstructures, such as association matrices
However, global structures might not adapt well to thelocal context defined by the current query
To deal with this problem,local clusteringcan beused, as we now discuss
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 57
Association Clusters
-
8/13/2019 Slides Chap05
58/104
For a given queryq, let
D: local document set, i.e., set of documents retrieved byq
N: number of documents in Dl
Vl: local vocabulary, i.e., set of all distinct words in Dlfi,j: frequency of occurrence of a term ki in a documentdj DlM=[mij]: term-document matrix withVl rows andNl columns
mij=fi,j : an element of matrixM
MT: transpose ofM
The matrix
C=MMT
is a local term-term correlation matrix
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 58
Association Clusters
-
8/13/2019 Slides Chap05
59/104
Each elementcu,vC expresses a correlation
between termsku andkv
This relationship between the terms is based on theirjoint co-occurrences inside documents of the collection
Higher the number of documents in which the two termsco-occur, stronger is this correlation
Correlation strengths can be used to define localclusters of neighbor terms
Terms in a same cluster can then be used for queryexpansion
We consider three types of clusters here: associationclusters,metric clusters, andscalar clusters.
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 59
Association Clusters
-
8/13/2019 Slides Chap05
60/104
An association cluster is computed from a localcorrelation matrix C
For that, we re-define the correlation factors cu,vbetween any pair of termsku andkv, as follows:
cu,v =djDl
fu,j fv,j
In this case the correlation matrix is referred to as alocal association matrix
The motivation is that terms that co-occur frequently
inside documents have a synonymity association
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 60
Association Clusters
-
8/13/2019 Slides Chap05
61/104
The correlation factorscu,v and the association matrixC are said to be unnormalized
An alternative is to normalize the correlation factors:
cu,v = cu,v
cu,u+cv,v cu,vIn this case the association matrix C is said to be
normalized
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 61
Association Clusters
-
8/13/2019 Slides Chap05
62/104
Given a local association matrix C, we can use it tobuild local association clusters as follows
LetCu(n)be a function that returns thenlargest factorscu,v
C, wherev varies over the set of local terms and
v=uThen,Cu(n)defines a local association cluster, aneighborhood, around the termku
Given a queryq, we are normally interested in findingclusters only for the|q|query termsThis means that such clusters can be computedefficiently at query time
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 62
Metric Clusters
-
8/13/2019 Slides Chap05
63/104
Association clusters do not take into account where theterms occur in a document
However, two terms that occur in a same sentence tendto be more correlated
Ametric clusterre-defines the correlation factorscu,vas a function of their distances in documents
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 63
Metric Clusters
L ( ) b f i h h
-
8/13/2019 Slides Chap05
64/104
Letku(n, j)be a function that returns thenthoccurrence of termku in documentdj
Further, letr(ku(n, j), kv(m, j))be a function thatcomputes the distance between
thenthoccurrence of term ku in documentdj; and
themthoccurrence of term kv in documentdj
We define,
cu,v = djDl
nm
1
r(ku(n, j), kv(m, j))
In this case the correlation matrix is referred to as alocalmetric matrix
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 64
Metric Clusters
N ti th t if k d k i di ti t d t
-
8/13/2019 Slides Chap05
65/104
Notice that ifku andkv are in distinct documents wetake their distance to be infinity
Variations of the above expression forcu,v have been
reported in the literature, such as1/r2(ku(n, j), kv(m, j))
The metric correlation factorcu,v quantifies absoluteinverse distances and is said to be unnormalized
Thus, the local metric matrixC
is said to beunnormalized
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 65
Metric Clusters
A lt ti i t li th l ti f t
-
8/13/2019 Slides Chap05
66/104
An alternative is to normalize the correlation factor
For instance,
cu,v = cu,v
total number of [ku, kv]pairs considered
In this case the local metric matrix C is said to benormalized
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 66
Scalar Clusters
The correlation between two local terms can also be
-
8/13/2019 Slides Chap05
67/104
The correlation between two local terms can also bedefined by comparing the neighborhoods of the twoterms
The idea is that two terms with similar neighborhoods
have some synonymity relationship
In this case we say that the relationship is indirect or induced by
the neighborhood
We can quantify this relationship comparing the neighborhoods of
the terms through a scalar measure
For instance, the cosine of the angle between the two vectors is a
popular scalar similarity measure
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 67
Scalar Clusters
Let
-
8/13/2019 Slides Chap05
68/104
Let
su =(cu,x1 , su,x2 , . . . , su,xn): vector of neighborhood correlation
values for the termku
sv =(cv,y1 , cv,y2 , . . . , cv,ym): vector of neighborhood correlation
values for termkv
Define,
cu,v =
su
sv|su| |sv|
In this case the correlation matrix C is referred to as alocal scalar matrix
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 68
Scalar Clusters
The local scalar matrix C is said to be induced by the
-
8/13/2019 Slides Chap05
69/104
The local scalar matrix C is said to be induced by theneighborhood
LetCu(n)be a function that returns thenlargestcu,vvalues in a local scalar matrix C,v
=u
Then,Cu(n)defines a scalar cluster around term ku
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 69
Neighbor Terms
Terms that belong to clusters associated to the query
-
8/13/2019 Slides Chap05
70/104
Terms that belong to clusters associated to the queryterms can be used to expand the original query
Such terms are called neighbors of the query terms andare characterized as follows
A termkv that belongs to a clusterCu(n), associatedwith another termku, is said to be aneighborofku
Often, neighbor terms represent distinct keywords thatare correlated by the current query context
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 70
Neighbor Terms
Consider the problem of expanding a given user query q
-
8/13/2019 Slides Chap05
71/104
Consider the problem of expanding a given user query qwith neighbor terms
One possibility is to expand the query as follows
For each termkuq, selectmneighbor terms from theclusterCu(n)and add them to the queryThis can be expressed as follows:
qm=q {kv|kvCu(n), kuq}Hopefully, the additional neighbor termskv will retrieve
new relevant documents
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 71
Neighbor Terms
The set C (n) might be composed of terms obtained
-
8/13/2019 Slides Chap05
72/104
The setCu(n)might be composed of terms obtainedusing correlation factors normalized and unnormalized
Query expansion is important because it tends toimprove recall
However, the larger number of documents to rank alsotends to lower precision
Thus, query expansion needs to be exercised with greatcare and fine tuned for the collection at hand
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 72
-
8/13/2019 Slides Chap05
73/104
Local Context Analysis
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 73
Local Context Analysis
The local clustering techniques are based on the set of
-
8/13/2019 Slides Chap05
74/104
The local clustering techniques are based on the set of
documents retrieved for a query
A distinct approach is to search for term correlations inthe whole collection
Global techniques usually involve the building of athesaurus that encodes term relationships in the wholecollection
The terms are treated as concepts and the thesaurus isviewed as a concept relationship structure
The building of a thesaurus usually considers the use of
small contexts and phrase structures
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 74
Local Context Analysis
Local context analysis is an approach that combines
-
8/13/2019 Slides Chap05
75/104
Local context analysis is an approach that combines
global and local analysis
It is based on the use ofnoun groups, i.e., a singlenoun, two nouns, or three adjacent nouns in the text
Noun groups selected from the top ranked documentsare treated as document concepts
However, instead of documents, passages are used fordetermining term co-occurrences
Passages are text windows of fixed size
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 75
Local Context Analysis
More specifically, the local context analysis procedure
-
8/13/2019 Slides Chap05
76/104
o e spec ca y, t e oca co te t a a ys s p ocedu e
operates in three steps
First, retrieve the topnranked passages using the original query
Second, for each concept cin the passages compute the
similaritysim(q, c)between the whole query qand the conceptc
Third, the topmranked concepts, according tosim(q, c), are
added to the original queryq
A weight computed as 1 0.9 i/m is assigned toeach conceptc, where
i: position ofcin the concept rankingm: number of concepts to add to q
The terms in the original queryqmight be stressed by
assigning a weight equal to 2 to each of themChap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 76
Local Context Analysis
Of these three steps, the second one is the most
-
8/13/2019 Slides Chap05
77/104
p ,
complex and the one which we now discuss
The similaritysim(q, c)between each conceptcand theoriginal queryqis computed as follows
sim(q, c) =
kiq
+ log(f(c,ki)idfc)logn
idfi
wherenis the number of top ranked passagesconsidered
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 77
Local Context Analysis
The functionf(c, ki)quantifies the correlation between
-
8/13/2019 Slides Chap05
78/104
f ( , i) q
the conceptcand the query termki and is given by
f(c, ki) =n
j=1
pfi,j
pfc,j
where
pfi,j is the frequency of termki in thej-th passage; andpfc,j is the frequency of the conceptcin thej-th passage
Notice that this is the correlation measure defined for
association clusters, but adapted for passages
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 78
Local Context Analysis
The inverse document frequency factors are computed
-
8/13/2019 Slides Chap05
79/104
q y p
as
idfi = max(1,log10N/npi
5 )
idfc = max(1,log10N/npc5
)
where
Nis the number of passages in the collection
npi is the number of passages containing the term ki; and
npc is the number of passages containing the concept c
Theidfi factor in the exponent is introduced toemphasize infrequent query terms
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 79
Local Context Analysis
The procedure above for computingsim(q, c)is a
-
8/13/2019 Slides Chap05
80/104
non-trivial variant of tf-idf ranking
It has been adjusted for operation with TREC data anddid not work so well with a different collection
Thus, it is important to have in mind that tuning mightbe required for operation with a different collection
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 80
-
8/13/2019 Slides Chap05
81/104
Implicit Feedback Through Global Analysis
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 81
Global Context Analysis
The methods of local analysis extract information from
-
8/13/2019 Slides Chap05
82/104
the local set of documents retrieved to expand the query
An alternative approach is to expand the query usinginformation from the whole set of documentsa
strategy usually referred to asglobal analysisprocedures
We distinguish two global analysis procedures:
Query expansion based on a similarity thesaurus
Query expansion based on a statistical thesaurus
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 82
-
8/13/2019 Slides Chap05
83/104
Query Expansion based on a SimilarityThesaurus
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 83
Similarity Thesaurus
We now discuss a query expansion model based on a
-
8/13/2019 Slides Chap05
84/104
globalsimilarity thesaurusconstructed automatically
The similarity thesaurus is based on term to termrelationships rather than on a matrix of co-occurrence
Special attention is paid to the selection of terms forexpansion and to the reweighting of these terms
Terms for expansion are selected based on theirsimilarity to the whole query
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 84
Similarity Thesaurus
A similarity thesaurus is built using term to term
-
8/13/2019 Slides Chap05
85/104
relationships
These relationships are derived by considering that theterms are concepts in a concept space
In this concept space, each term is indexed by thedocuments in which it appears
Thus, terms assume the original role of documentswhile documents are interpreted as indexing elements
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 85
Similarity Thesaurus
Let,
-
8/13/2019 Slides Chap05
86/104
t: number of terms in the collection
N: number of documents in the collection
fi,j: frequency of termki in documentdj
tj: number of distinct index terms in documentdj
Then,
itfj = log t
tj
is the inverse term frequency for documentdj
(analogous to inverse document frequency)
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 86
Similarity Thesaurus
Within this framework, with each termki is associated a
i b
-
8/13/2019 Slides Chap05
87/104
vectorki given by
ki= (wi,1, wi,2, . . . , wi,N)
These weights are computed as follows
wi,j =
(0.5+0.5 fi,j
maxj(fi,j)) itfjN
l=1(0.5+0.5 fi,l
maxl(fi,l))2 itf2j
wheremaxj(fi,j)computes the maximum of all fi,jfactors for thei-th term
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 87
Similarity Thesaurus
The relationship between two termsku andkv is
t d l ti f t i b
-
8/13/2019 Slides Chap05
88/104
computed as a correlation factorcu,v given by
cu,v =ku kv = dj
wu,j wv,j
The global similarity thesaurus is given by the scalarterm-term matrix composed of correlation factorscu,v
This global similarity thesaurus has to be computedonly once and can be updated incrementally
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 88
Similarity Thesaurus
Given the global similarity thesaurus, query expansion
i d i th t f ll
-
8/13/2019 Slides Chap05
89/104
is done in three steps as follows
First, represent the query in the same vector space used for
representing the index terms
Second, compute a similaritysim(q, kv)between each termkv
correlated to the query terms and the whole queryq
Third, expand the query with the top r ranked terms according to
sim(q, kv)
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 89
Similarity Thesaurus
For the first step, the query is represented by a vector q
i b
-
8/13/2019 Slides Chap05
90/104
given by
q=kiq
wi,qki
wherewi,qis a term-query weight computed using the
equation forwi,j, but withqin place of dj
For the second step, the similarity sim(q, kv)iscomputed as
sim(q, kv) =q
kv =
kiq
wi,q
ci,v
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 90
Similarity Thesaurus
A termkv might be closer to the whole query centroid
C than to the individual query terms
-
8/13/2019 Slides Chap05
91/104
qCthan to the individual query terms
Thus, terms selected here might be distinct from thoseselected by previous global analysis methods
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 91
Similarity Thesaurus
For the third step, the top rranked terms are added to
the query to form the expanded query
-
8/13/2019 Slides Chap05
92/104
the queryqto form the expanded queryqm
To each expansion termkv in queryqm is assigned aweightwv,qm given by
wv,qm = sim(q, kv)
kiqwi,q
The expanded queryqm is then used to retrieve newdocuments
This technique has yielded improved retrieval
performance (in the range of 20%) with three differentcollections
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 92
Similarity Thesaurus
Consider a documentdj which is represented in the
term vector space by d
w k
-
8/13/2019 Slides Chap05
93/104
term vector space by dj =
kidj wi,j ki
Assume that the queryqis expanded to include all thetindex terms (properly weighted) in the collection
Then, the similaritysim(q, dj)betweendj andqcan becomputed in the term vector space by
sim(q, dj)
kvdj
kuq
wv,j wu,q cu,v
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 93
Similarity Thesaurus
The previous expression is analogous to the similarity
formula in the generalized vector space model
-
8/13/2019 Slides Chap05
94/104
formula in the generalized vector space model
Thus, the generalized vector space model can beinterpreted as a query expansion technique
The two main differences are
the weights are computed differently
only the toprranked terms are used
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 94
-
8/13/2019 Slides Chap05
95/104
Query Expansion based on a StatisticalThesaurus
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 95
Global Statistical Thesaurus
We now discuss a query expansion technique based on
a global statistical thesaurus
-
8/13/2019 Slides Chap05
96/104
aglobal statistical thesaurus
The approach is quite distinct from the one based on asimilarity thesaurus
The global thesaurus is composed of classes that groupcorrelated terms in the context of the whole collection
Such correlated terms can then be used to expand the
original user query
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 96
Global Statistical Thesaurus
To be effective, the terms selected for expansion must
have high term discrimination values
-
8/13/2019 Slides Chap05
97/104
have high term discrimination values
This implies that they must be low frequency terms
However, it is difficult to cluster low frequency termsdue to the small amount of information about them
To circumvent this problem, documents are clusteredinto classes
The low frequency terms in these documents are thenused to define thesaurus classes
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 97
Global Statistical Thesaurus
A document clustering algorithm that produces small
and tight clusters is the complete link algorithm:
-
8/13/2019 Slides Chap05
98/104
and tight clusters is thecomplete link algorithm:
1. Initially, place each document in a distinct cluster
2. Compute the similarity between all pairs of clusters
3. Determine the pair of clusters[Cu, Cv]with the highest
inter-cluster similarity
4. Merge the clustersCu andCv
5. Verify a stop criterion (if this criterion is not met then go back to
step 2)
6. Return a hierarchy of clusters
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 98
Global Statistical Thesaurus
The similarity between two clusters is defined as the
minimum of the similarities between two documents not
-
8/13/2019 Slides Chap05
99/104
minimum of the similarities between two documents notin the same cluster
To compute the similarity between documents in a pair,
the cosine formula of the vector model is usedAs a result of this minimality criterion, the resultantclusters tend to be small and tight
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 99
Global Statistical Thesaurus
Consider that the whole document collection has been
clustered using the complete link algorithm
-
8/13/2019 Slides Chap05
100/104
clustered using the complete link algorithm
Figure below illustrates a portion of the whole clusterhierarchy generated by the complete link algorithm
Cu
Cv
Cz
0.11
0.15
where the inter-cluster similarities are shown in the
ovalsChap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 100
Global Statistical Thesaurus
The terms that compose each class of the global
thesaurus are selected as follows
-
8/13/2019 Slides Chap05
101/104
thesaurus are selected as follows
Obtain from the user three parameters:
TC: threshold class
NDC: number of documents in a class
MIDF: minimum inverse document frequency
Paramenter TC determines the document clusters thatwill be used to generate thesaurus classes
Two clustersCu andCv are selected, when TC is surpassed by
sim(Cu, Cv)
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 101
Global Statistical Thesaurus
Use NDC as a limit on the number of documents of the
clusters
-
8/13/2019 Slides Chap05
102/104
s s
For instance, if bothCu+v andCu+v+z are selected then the
parameter NDC might be used to decide between the two
MIDF defines the minimum value of IDF for any termwhich is selected to participate in a thesaurus class
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 102
Global Statistical Thesaurus
Given that the thesaurus classes have been built, they
can be used for query expansion
-
8/13/2019 Slides Chap05
103/104
q y p
For this, an average term weightwtCfor each thesaurusclassCis computed as follows
wtC=
|C|i=1wi,C|C|
where
|C|is the number of terms in the thesaurus class C, andwi,Cis a weight associated with the term-class pair [ki, C]
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 103
Global Statistical Thesaurus
This average term weight can then be used to compute
a thesaurus class weightwCas
-
8/13/2019 Slides Chap05
104/104
g
wC=wtC
|C| 0.5
The above weight formulations have been verifiedthrough experimentation and have yielded good results
Chap 05: Relevance Feedback and Query Expansion, Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition p. 104