Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University...
-
Upload
marcus-bryant -
Category
Documents
-
view
217 -
download
0
Transcript of Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University...
Minimal Test Minimal Test Collections for Collections for Retrieval EvaluationRetrieval Evaluation
B. Carterette, J. Allan, R. SitaramanB. Carterette, J. Allan, R. SitaramanUniversity of Massachusetts AmherstUniversity of Massachusetts Amherst
SIGIR2006SIGIR2006
OutlineOutline
IntroductionIntroduction
Previous WorkPrevious Work
Intuition and TheoryIntuition and Theory
Experimental Setup and ResultsExperimental Setup and Results
DiscussionDiscussion
ConclusionConclusion
IntroductionIntroduction
Information retrieval system evaluation Information retrieval system evaluation requires requires test collectionstest collections::– corpora of documents, sets of topics and corpora of documents, sets of topics and
relevance judgmentsrelevance judgments
Stable, fine-grained evaluation metrics Stable, fine-grained evaluation metrics take both of take both of precisionprecision and and recallrecall into into account, and require large sets of account, and require large sets of judgments.judgments.– At best inefficient, at worst infeasibleAt best inefficient, at worst infeasible
IntroductionIntroduction
The TREC conferencesThe TREC conferences– Goal: building test collections that are Goal: building test collections that are reusablereusable
– PoolingPooling process process Top results from many system runs are judgedTop results from many system runs are judged
Reusability is not always a major concern.Reusability is not always a major concern.– TREC-style topics may not suit a specific taskTREC-style topics may not suit a specific task– Dynamic collection such as the webDynamic collection such as the web
OutlineOutline
IntroductionIntroduction
Previous WorkPrevious Work
Intuition and TheoryIntuition and Theory
Experimental Setup and ResultsExperimental Setup and Results
DiscussionDiscussion
ConclusionConclusion
Previous WorkPrevious Work
The pooling method has been shown to be sufficiThe pooling method has been shown to be sufficient for research purposes.ent for research purposes.
[Soboroff, 2001] [Soboroff, 2001] – random assignment of relevance to documents in a poorandom assignment of relevance to documents in a poo
l to give a decent ranking of systemsl to give a decent ranking of systems [Sanderson, 2004] [Sanderson, 2004]
– ranking systems reliably from a set of judgments obtainranking systems reliably from a set of judgments obtained from a single system or iterating relevance feedback ed from a single system or iterating relevance feedback runsruns
[Carterette, 2005][Carterette, 2005]– proposing an algorithm to achieve high rank correlation proposing an algorithm to achieve high rank correlation
with a very small set of judgmentswith a very small set of judgments
OutlineOutline
IntroductionIntroduction
Previous WorkPrevious Work
Intuition and TheoryIntuition and Theory
Experimental Setup and ResultsExperimental Setup and Results
DiscussionDiscussion
ConclusionConclusion
Average PrecisionAverage Precision
Let Let xxii be a Boolean indicator of the relevance of document be a Boolean indicator of the relevance of document ii
Rd
dprec@rankR
AP )(||
1
ir
n
r
r
i
r
i
in
r
r xxrRr
xx
R
1 111
1
||
1
||
1
)}(),({
1
||
1
1
jrankirankmaxa
xxaR
ij
n
i ij
jiij
IntuitionIntuition
Let Let SS be the set of judged relevant documents, be the set of judged relevant documents,
suppose suppose ΔAPΔAP>> 00
(stopping (stopping condition)condition)
Intuitively, we want to increase the LHS by Intuitively, we want to increase the LHS by finding relevant documents and decrease the finding relevant documents and decrease the RHS by finding irrelevant documents.RHS by finding irrelevant documents.
ijijij
n
i ij
jiij
bac
xxcR
APAPAP
1
21||
1
0,
,,
||0
ijcandTjSiorTji
ij
Sji
ijjiij ccxxc
An Optimal AlgorithmAn Optimal Algorithm
THEOREM 1: THEOREM 1: If pIf pii== p for all i, the set S maximizes p for all i, the set S maximizes
., Sji
jiij xxcE
AP is Normally AP is Normally DistributedDistributed
Given a set of relevance judgments, we use the norGiven a set of relevance judgments, we use the normal cumulative density function (cdf) to find P(mal cumulative density function (cdf) to find P(ΔAΔAPP≦0), the confidence that ≦0), the confidence that ΔAPΔAP≦0.≦0.
Figure 1: We simulated two ranked lists of 100 documents. Setting pi=0.5, we randomly generated 5000 sets of relevance judgments and calculated ΔAP for each set. The Anderson-Darling goodness of fit test concludes that we cannot reject the hypothesis that the sample came from a normal distribution.
Application to MAPApplication to MAP
Because topics are independent,Because topics are independent,
Then if AP ~ N(0, 1), MAP ~ N(0, 1) as wellThen if AP ~ N(0, 1), MAP ~ N(0, 1) as well
Each (topic, document) pair is treated as a Each (topic, document) pair is treated as a unique “document”unique “document”
Tt
t
Tt
t
APVarT
MAPVar
APET
MAP
2||
1
||
1
OutlineOutline
IntroductionIntroduction
Previous WorkPrevious Work
Intuition and TheoryIntuition and Theory
Experimental Setup and ResultsExperimental Setup and Results
DiscussionDiscussion
ConclusionConclusion
Outline of the Outline of the ExperimentExperiment1)1) We ran eight retrieval systems on a set of We ran eight retrieval systems on a set of
baseline topics for which we had full sets of baseline topics for which we had full sets of judgments.judgments.
2)2) Six annotators are asked to develop 60 new Six annotators are asked to develop 60 new topics; these were run on the same eight topics; these were run on the same eight systems.systems.
3)3) The annotators then judged documents The annotators then judged documents selected by the algorithm.selected by the algorithm.
The BaselineThe Baseline
Baseline topicsBaseline topics– Used to estimate the system performanceUsed to estimate the system performance– The 2005 Robust /HARD track topics and ad hoc toThe 2005 Robust /HARD track topics and ad hoc to
pics 301 through 450pics 301 through 450
CorporaCorpora– Aquaint for the Robust topics, 1 million articlesAquaint for the Robust topics, 1 million articles– TREC disk 4&5 for the ad hoc topics, 50,000 articlesTREC disk 4&5 for the ad hoc topics, 50,000 articles
Retrieval systemsRetrieval systems– Six freely-available retrieval systems: Indri, Lemur, Six freely-available retrieval systems: Indri, Lemur,
Lucene, mg, Smart, and ZettairLucene, mg, Smart, and Zettair
Experiment ResultsExperiment Results
2200 relevance judgments obtained in 2.5 hours2200 relevance judgments obtained in 2.5 hours– About 4.5 per system per topics on averageAbout 4.5 per system per topics on average– About 2.5 per minute per annotatorAbout 2.5 per minute per annotator
The rate is 2.2 in TREC.The rate is 2.2 in TREC.
The systems are ranked by expected value of MAP:The systems are ranked by expected value of MAP:
where where ppii == 1 if document 1 if document ii has been judged releva has been judged relevant, 0 if irrelevant, and 0.5 otherwise.nt, 0 if irrelevant, and 0.5 otherwise.
i ij
jiijiii
tt
t ppcpcp
APEMAPi
1
Experiment ResultsExperiment Results
Table 1: True MAPs of eight systems over 200 baseline topicTable 1: True MAPs of eight systems over 200 baseline topics, and expected MAP, with 95% confidence intervals, over 6s, and expected MAP, with 95% confidence intervals, over 60 new topics. Horizontal lines indicate “bin” divisions det0 new topics. Horizontal lines indicate “bin” divisions determined by statistical significance.ermined by statistical significance.
Experiment ResultsExperiment Results
Figure 2: Confidence increases as more judgments are Figure 2: Confidence increases as more judgments are made.made.
OutlineOutline
IntroductionIntroduction
Previous WorkPrevious Work
Intuition and TheoryIntuition and Theory
Experimental Setup and ResultsExperimental Setup and Results
DiscussionDiscussion
ConclusionConclusion
DiscussionDiscussion
Some simulations using the Robust 2005 topics Some simulations using the Robust 2005 topics and judgments by NIST evaluate performance and judgments by NIST evaluate performance of the algorithm.of the algorithm.
Some questions are explored:Some questions are explored:– To what degree are the results dependent on the To what degree are the results dependent on the
algorithm rather than the evaluation metric?algorithm rather than the evaluation metric?– How many judgments are required to differentiate a How many judgments are required to differentiate a
single pair of ranked lists with 95% confidence?single pair of ranked lists with 95% confidence?– How does confidence vary as more judgments are How does confidence vary as more judgments are
made?made?– Are test collections produced by our algorithm Are test collections produced by our algorithm
reusable?reusable?
Comparing εMAP and Comparing εMAP and MAPMAP Simulation: after several documents had been judged,Simulation: after several documents had been judged,
εMAP and MAP on all systems are calculated and com εMAP and MAP on all systems are calculated and compared with the pared with the truetrue ranking by Kendall’s tau correlat ranking by Kendall’s tau correlation.ion.
How Many Judgments?How Many Judgments?
The number of judgments that must be made in The number of judgments that must be made in comparing two systems depends on how similar comparing two systems depends on how similar they are.they are.
Fig. 4: Absolute difference in true AP for Robust 2005 topics vs. number of judgments to 95% confidence for pairs of ranked lists for individual topics.
Confidence over TimeConfidence over Time
Incremental poolingIncremental pooling: all documents in a pool of depth : all documents in a pool of depth kk will be judged. will be judged.
The pool of depth 10 contains 2228 documents and The pool of depth 10 contains 2228 documents and 569 of them are not in the algorithmic set.569 of them are not in the algorithmic set.
Reusability of Test Reusability of Test CollectionCollection One of the eight systems is removed to build test One of the eight systems is removed to build test
collections.collections.
All eight systems are ranked by εMAP, setting All eight systems are ranked by εMAP, setting ppii as the ratio of relevant documents in the test as the ratio of relevant documents in the test collection.collection.
Table 4: Reusability of test collections. The 8th system is always placed in the correct spot or swapped with the next (statistically indistinguishable) system.
OutlineOutline
IntroductionIntroduction
Previous WorkPrevious Work
Intuition and TheoryIntuition and Theory
Experimental Setup and ResultsExperimental Setup and Results
DiscussionDiscussion
ConclusionConclusion
ConclusionConclusion
A new perspective on average precision leads to A new perspective on average precision leads to an algorithm for selecting documents judged to an algorithm for selecting documents judged to evaluate retrieval systems in minimal time.evaluate retrieval systems in minimal time.
After only six hours of annotation time, we had After only six hours of annotation time, we had achieved a ranking with 90% confidence.achieved a ranking with 90% confidence.
A direction for future work is extending the A direction for future work is extending the analysis to other evaluation metrics for different analysis to other evaluation metrics for different tasks.tasks.
Another direction is estimating probabilities of Another direction is estimating probabilities of relevance.relevance.