Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University...

Minimal Test Minimal Test Collections for Collections for Retrieval EvaluationRetrieval Evaluation

B. Carterette, J. Allan, R. SitaramanB. Carterette, J. Allan, R. SitaramanUniversity of Massachusetts AmherstUniversity of Massachusetts Amherst

SIGIR2006SIGIR2006

OutlineOutline

IntroductionIntroduction

Previous WorkPrevious Work

Intuition and TheoryIntuition and Theory

Experimental Setup and ResultsExperimental Setup and Results

DiscussionDiscussion

ConclusionConclusion


Information retrieval system evaluation Information retrieval system evaluation requires requires test collectionstest collections::– corpora of documents, sets of topics and corpora of documents, sets of topics and

relevance judgmentsrelevance judgments

Stable, fine-grained evaluation metrics Stable, fine-grained evaluation metrics take both of take both of precisionprecision and and recallrecall into into account, and require large sets of account, and require large sets of judgments.judgments.– At best inefficient, at worst infeasibleAt best inefficient, at worst infeasible


The TREC conferencesThe TREC conferences– Goal: building test collections that are Goal: building test collections that are reusablereusable

– PoolingPooling process process Top results from many system runs are judgedTop results from many system runs are judged

Reusability is not always a major concern.Reusability is not always a major concern.– TREC-style topics may not suit a specific taskTREC-style topics may not suit a specific task– Dynamic collection such as the webDynamic collection such as the web

OutlineOutline








The pooling method has been shown to be sufficiThe pooling method has been shown to be sufficient for research purposes.ent for research purposes.

[Soboroff, 2001] [Soboroff, 2001] – random assignment of relevance to documents in a poorandom assignment of relevance to documents in a poo

l to give a decent ranking of systemsl to give a decent ranking of systems [Sanderson, 2004] [Sanderson, 2004]

– ranking systems reliably from a set of judgments obtainranking systems reliably from a set of judgments obtained from a single system or iterating relevance feedback ed from a single system or iterating relevance feedback runsruns

[Carterette, 2005][Carterette, 2005]– proposing an algorithm to achieve high rank correlation proposing an algorithm to achieve high rank correlation

with a very small set of judgmentswith a very small set of judgments

OutlineOutline







Average PrecisionAverage Precision

Let Let xxii be a Boolean indicator of the relevance of document be a Boolean indicator of the relevance of document ii

Rd

dprec@rankR

AP )(||

1

ir

n

r

r

i

r

i

in

r

r xxrRr

xx

R

1 111

1

||

1

||

1

)}(),({

1

||

1

1

jrankirankmaxa

xxaR

ij

n

i ij

jiij

IntuitionIntuition

Let Let SS be the set of judged relevant documents, be the set of judged relevant documents,

suppose suppose ΔAPΔAP＞＞ 00

(stopping (stopping condition)condition)

Intuitively, we want to increase the LHS by Intuitively, we want to increase the LHS by finding relevant documents and decrease the finding relevant documents and decrease the RHS by finding irrelevant documents.RHS by finding irrelevant documents.

ijijij

n

i ij

jiij

bac

xxcR

APAPAP

1

21||

1

0,

,,

||0

ijcandTjSiorTji

ij

Sji

ijjiij ccxxc

An Optimal AlgorithmAn Optimal Algorithm

THEOREM 1: THEOREM 1: If pIf pii＝＝ p for all i, the set S maximizes p for all i, the set S maximizes

., Sji

jiij xxcE

AP is Normally AP is Normally DistributedDistributed

Given a set of relevance judgments, we use the norGiven a set of relevance judgments, we use the normal cumulative density function (cdf) to find P(mal cumulative density function (cdf) to find P(ΔAΔAPP≦0), the confidence that ≦0), the confidence that ΔAPΔAP≦0.≦0.

Figure 1: We simulated two ranked lists of 100 documents. Setting pi=0.5, we randomly generated 5000 sets of relevance judgments and calculated ΔAP for each set. The Anderson-Darling goodness of fit test concludes that we cannot reject the hypothesis that the sample came from a normal distribution.

Application to MAPApplication to MAP

Because topics are independent,Because topics are independent,

Then if AP ~ N(0, 1), MAP ~ N(0, 1) as wellThen if AP ~ N(0, 1), MAP ~ N(0, 1) as well

Each (topic, document) pair is treated as a Each (topic, document) pair is treated as a unique “document”unique “document”

Tt

t

Tt

t

APVarT

MAPVar

APET

MAP

2||

1

||

1

OutlineOutline







Outline of the Outline of the ExperimentExperiment1)1) We ran eight retrieval systems on a set of We ran eight retrieval systems on a set of

baseline topics for which we had full sets of baseline topics for which we had full sets of judgments.judgments.

2)2) Six annotators are asked to develop 60 new Six annotators are asked to develop 60 new topics; these were run on the same eight topics; these were run on the same eight systems.systems.

3)3) The annotators then judged documents The annotators then judged documents selected by the algorithm.selected by the algorithm.

The BaselineThe Baseline

Baseline topicsBaseline topics– Used to estimate the system performanceUsed to estimate the system performance– The 2005 Robust /HARD track topics and ad hoc toThe 2005 Robust /HARD track topics and ad hoc to

pics 301 through 450pics 301 through 450

CorporaCorpora– Aquaint for the Robust topics, 1 million articlesAquaint for the Robust topics, 1 million articles– TREC disk 4&5 for the ad hoc topics, 50,000 articlesTREC disk 4&5 for the ad hoc topics, 50,000 articles

Retrieval systemsRetrieval systems– Six freely-available retrieval systems: Indri, Lemur, Six freely-available retrieval systems: Indri, Lemur,

Lucene, mg, Smart, and ZettairLucene, mg, Smart, and Zettair

Experiment ResultsExperiment Results

2200 relevance judgments obtained in 2.5 hours2200 relevance judgments obtained in 2.5 hours– About 4.5 per system per topics on averageAbout 4.5 per system per topics on average– About 2.5 per minute per annotatorAbout 2.5 per minute per annotator

The rate is 2.2 in TREC.The rate is 2.2 in TREC.

The systems are ranked by expected value of MAP:The systems are ranked by expected value of MAP:

where where ppii ＝＝ 1 if document 1 if document ii has been judged releva has been judged relevant, 0 if irrelevant, and 0.5 otherwise.nt, 0 if irrelevant, and 0.5 otherwise.

i ij

jiijiii

tt

t ppcpcp

APEMAPi

1


Table 1: True MAPs of eight systems over 200 baseline topicTable 1: True MAPs of eight systems over 200 baseline topics, and expected MAP, with 95% confidence intervals, over 6s, and expected MAP, with 95% confidence intervals, over 60 new topics. Horizontal lines indicate “bin” divisions det0 new topics. Horizontal lines indicate “bin” divisions determined by statistical significance.ermined by statistical significance.


Figure 2: Confidence increases as more judgments are Figure 2: Confidence increases as more judgments are made.made.

OutlineOutline








Some simulations using the Robust 2005 topics Some simulations using the Robust 2005 topics and judgments by NIST evaluate performance and judgments by NIST evaluate performance of the algorithm.of the algorithm.

Some questions are explored:Some questions are explored:– To what degree are the results dependent on the To what degree are the results dependent on the

algorithm rather than the evaluation metric?algorithm rather than the evaluation metric?– How many judgments are required to differentiate a How many judgments are required to differentiate a

single pair of ranked lists with 95% confidence?single pair of ranked lists with 95% confidence?– How does confidence vary as more judgments are How does confidence vary as more judgments are

made?made?– Are test collections produced by our algorithm Are test collections produced by our algorithm

reusable?reusable?

Comparing εMAP and Comparing εMAP and MAPMAP Simulation: after several documents had been judged,Simulation: after several documents had been judged,

εMAP and MAP on all systems are calculated and com εMAP and MAP on all systems are calculated and compared with the pared with the truetrue ranking by Kendall’s tau correlat ranking by Kendall’s tau correlation.ion.

How Many Judgments?How Many Judgments?

The number of judgments that must be made in The number of judgments that must be made in comparing two systems depends on how similar comparing two systems depends on how similar they are.they are.

Fig. 4: Absolute difference in true AP for Robust 2005 topics vs. number of judgments to 95% confidence for pairs of ranked lists for individual topics.

Confidence over TimeConfidence over Time

Incremental poolingIncremental pooling: all documents in a pool of depth : all documents in a pool of depth kk will be judged. will be judged.

The pool of depth 10 contains 2228 documents and The pool of depth 10 contains 2228 documents and 569 of them are not in the algorithmic set.569 of them are not in the algorithmic set.

Reusability of Test Reusability of Test CollectionCollection One of the eight systems is removed to build test One of the eight systems is removed to build test

collections.collections.

All eight systems are ranked by εMAP, setting All eight systems are ranked by εMAP, setting ppii as the ratio of relevant documents in the test as the ratio of relevant documents in the test collection.collection.

Table 4: Reusability of test collections. The 8th system is always placed in the correct spot or swapped with the next (statistically indistinguishable) system.

OutlineOutline








A new perspective on average precision leads to A new perspective on average precision leads to an algorithm for selecting documents judged to an algorithm for selecting documents judged to evaluate retrieval systems in minimal time.evaluate retrieval systems in minimal time.

After only six hours of annotation time, we had After only six hours of annotation time, we had achieved a ranking with 90% confidence.achieved a ranking with 90% confidence.

A direction for future work is extending the A direction for future work is extending the analysis to other evaluation metrics for different analysis to other evaluation metrics for different tasks.tasks.

Another direction is estimating probabilities of Another direction is estimating probabilities of relevance.relevance.

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University...

Documents

Transcript of Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University...