IR System Evaluation
description
Transcript of IR System Evaluation
![Page 1: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/1.jpg)
IR System Evaluation
Farhad Oroumchian
![Page 2: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/2.jpg)
IR System Evaluation
• System-centered strategy – Given documents, queries, and relevance judgments– Try several variations on the retrieval system– Measure which gets more good docs near the top
• User-centered strategy– Given several users, and at least 2 retrieval systems– Have each user try the same task on both systems– Measure which system works the “best”
![Page 3: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/3.jpg)
Which is the Best Rank Order?
R
R
R
R
R
R
R
R
R
R
RRRRRR
RR
R
R
R
RRR
R
R
R
R
R
RRRRR
R
RRRR
R
R
a b c d e f g h
![Page 4: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/4.jpg)
Measures of Effectiveness
• Good measures of effectiveness should– Capture some aspect of what the user wants– Have predictive value for other situations
• Different queries, different document collection– Be easily replicated by other researchers– Be expressed as a single number
• Allows two systems to be easily compared
• No measures of effectiveness are that good!
![Page 5: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/5.jpg)
Some Assumptions
• Unchanging, known queries– The same queries are used by each system
• Binary relevance– Every document is either relevant or it is not
• Unchanging, known relevance– The relevance of each doc to each query is known
• But only used for evaluation, not retrieval!
• Focus on effectiveness, not efficiency
![Page 6: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/6.jpg)
Exact Match MOE
• Precision– How much of what was found is relevant?
• Often of interest, particularly for interactive searching
• Recall– How much of what is relevant was found?
• Particularly important for law and patent searches
• Fallout– How much of what was irrelevant was rejected?
• Useful when different size collections are compared
![Page 7: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/7.jpg)
The Contingency Table
Relevant Retrieved
Irrelevant Retrieved Irrelevant Rejected
Relevant RejectedRelevant
Not relevant
Retrieved Not RetrievedDocAction
Precision
Relevant RetrievedRetrieved
RecallRelevant Retrieved
Relevant
FalloutIrrelevant Rejected
Not Relevant
![Page 8: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/8.jpg)
MOE for Ranked Retrieval
• Start with the first relevant doc on the list– Compute recall and precision at that point
• Move down the list, 1 relevant doc at a time– Computing recall and precision at each point
• Plot precision for every value of recall– Interpolate with a nonincreasing step function
• Repeat for several queries• Average the plots at every point
![Page 9: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/9.jpg)
The Precision-Recall CurveRR
R
R
Relevant Retrieved
Irrelevant Retrieved Irrelevant Rejected
Relevant RejectedRelevant=4
Not relevant=6
Retrieved Not RetrievedDoc=10
Action
00.10.20.30.40.50.60.70.80.91
1 2 3 4 5 6 7 8 9 10
PrecisionRecall 0
0.2
0.4
0.6
0.8
1
0 0.5 1
Recall
Precision
0
0.2
0.4
0.6
0.8
1
0 0.5 1
Recall
InterpolatedPrecision
![Page 10: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/10.jpg)
Single-Number MOE• Precision at a fixed number of documents
– Precision at 10 docs is the “AltaVista measure”• Precision at a given level of recall
– Adjusts for the total number of relevant docs• Average precision
– Average of precision at recall=0.0, 0.1, …, 1.0– Area under the precision/recall curve
• Breakeven point– Point where precision = recall
![Page 11: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/11.jpg)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Recall
Single-Number MOEPrecision at recall=0.1
Precision at 10 docs
Average Precision
Breakeven Point
![Page 12: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/12.jpg)
Single-Number MOE Weaknesses• Precision at 10 documents
– Pays no attention to recall• Precision at constant recall
– A specific recall fraction is rarely the user’s goal• Breakeven point
– Nobody ever searches at the breakeven point• Average precision
– Users typically operate near an extreme of the curve• So the average is not very informative
![Page 13: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/13.jpg)
Why Choose Average Precision?
• It is easy to trade between recall and precision– Adding related query terms improves recall
• But naive query expansion techniques kill precision– Limiting matches by part-of-speech helps precision
• But it almost always hurts recall
• Comparisons should give some weight to both– Average precision is a principled way to do this
• Rewards improvements in either factor
![Page 14: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/14.jpg)
How Much is Enough?
• The maximum average precision is 1.0– But inter-rater reliability is 0.8 or less– So 0.8 is a practical upper bound at every point
• Precision 0.8 is sometimes seen at low recall
• Two goals– Achieve a meaningful amount of improvement
• This is a judgment call, and depends on the application– Achieve that improvement reliably across queries
• This can be verified using statistical tests
![Page 15: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/15.jpg)
Statistical Significance Tests• How sure can you be that an observed difference doesn’t
simply result from the particular queries you chose?
System A0.200.210.220.190.170.200.21
System B0.400.410.420.390.370.400.41
Experiment 1Query
1234567
Average 0.20 0.40
System A0.020.390.160.580.040.090.12
System B0.760.070.370.210.020.910.46
Experiment 1Query
1234567
Average 0.20 0.40
![Page 16: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/16.jpg)
The Sign Test
• Compare the average precision for each query– Note which system produces the bigger value
• Assume that either system is equally likely to produce the bigger value for any query
• Compute the probability of the outcome you got– Any statistics package contains the formula for this
• Probabilities<0.05 are “statistically significant”– But they still need to pass the “meaningful” test!
![Page 17: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/17.jpg)
The Students T-Test• More powerful than the sign test
– If the assumptions are satisfied• Compute the average precision difference
– On a query by query basis for enough queries to approximate a normal distribution
• Assume that the queries are independent• Compute the probability of the outcome you got
– Again, any statistics package can be used• A probability>0.05 is “statistically significant”
![Page 18: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/18.jpg)
Obtaining Relevance Judgments
• Exhaustive assessment can be too expensive– TREC has 50 queries for >1 million docs each year
• Random sampling won’t work either– If relevant docs are rare, none may be found!
• IR systems can help focus the sample– Each system finds some relevant documents– Different systems find different relevant documents– Together, enough systems will find most of them
![Page 19: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/19.jpg)
Pooled Assessment Methodology• Each system submits top 1000 documents• Top 100 documents for each are judged
– All are placed in a single pool– Duplicates are eliminated– Placed in an arbitrary order to avoid bias
• Evaluated by the person that wrote the query• Assume unevaluated documents not relevant
– Overlap evaluation shows diminishing returns• Compute average precision over all 1000 docs
![Page 20: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/20.jpg)
TREC Overview
• Documents are typically distributed in April• Topics are distributed June 1
– Queries are formed from topics using standard rules• Top 1000 selections are due August 15• Special interest track results due 2 weeks later
– Cross-language IR, Spoken Document Retrieval, …• Relevance judgments available in October• Results presented in late November each year
![Page 21: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/21.jpg)
Concerns About Precision/Recall
• Statistical significance may be meaningless• Average precision won’t reveal curve shape
– Averaging over recall washes out information• How can you know the quality of the pool?• How to extrapolate to other collections?
![Page 22: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/22.jpg)
Project Test Collection
موادقانونازسيستم�قوانيندادهپردازي•• 41queryازسيستمقوانين
• Relevance judgments keyed to ItemID– Relevance is in scale 0-4
• بطوركلينامربوط -=0• نامربوط =1• كميمربوط =2• مربوط =3• كامالمربوط =4
– نامربوطتلقيميشوند0،1،2مربوطو4و3معموال
![Page 23: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/23.jpg)
Project Overview• Install or write the software the software• Choose the parameters
– Stopwords, stemming, term weights, etc.• Index the document collection
– This may require some format-specific tweaking• Run the 20 queries• Compute average precision and other measures• Test query length effect for statistical significance
![Page 24: IR System Evaluation](https://reader034.fdocuments.net/reader034/viewer/2022051117/56815d46550346895dcb4c0e/html5/thumbnails/24.jpg)
Team Project User Studies
• Measure value of some part of the interface– e.g., selection interface with and without titles
• Choose a dependent variable to measure– e.g., number of documents examined
• Run a pilot study with users from your team– Fine tune your experimental procedure
• Run the experiment with at least 3 subjects– From outside your team (may be in the class)