Information Retrieval
Quality of a Search Engine
Is it good ? How fast does it index
Number of documents/hour (Average document size)
How fast does it search Latency as a function of index size
Expressiveness of the query language
Measures for a search engine All of the preceding criteria are measurable
The key measure: user happiness…useless answers won’t make a user happy
Happiness: elusive to measure Commonest approach is given by the
relevance of search results How do we measure it ?
Requires 3 elements:1. A benchmark document collection2. A benchmark suite of queries3. A binary assessment of either Relevant or
Irrelevant for each query-doc pair
Evaluating an IR system Standard benchmarks
TREC: National Institute of Standards and Testing (NIST) has run large IR testbed for many years
Other doc collections: marked by human experts, for each query and for each doc, Relevant or Irrelevant
On the Web everything is more complicated since we cannot mark the entire corpus !!
General scenario
Relevant
Retrieved
collection
Precision: % docs retrieved that are relevant [issue “junk” found]
Precision vs. Recall
Relevant
Retrieved
collection
Recall: % docs relevant that are retrieved [issue “info” found]
How to compute them Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved
Precision P = tp/(tp + fp) Recall R = tp/(tp + fn)
Relevant Not RelevantRetrieved tp (true positive) fp (false positive)
Not Retrieved
fn (false negative) tn (true negative)
Some considerations Can get high recall (but low precision) by
retrieving all docs for all queries!
Recall is a non-decreasing function of the number of docs retrieved
Precision usually decreases
Precision vs. Recall
Relevant
Highest precision, very low recall
Retrieved
Precision: fraction of retrieved docs that are relevantRecall: fraction of relevant docs that are retrieved
Relevant
Lowest precision and recall
Retrieved
Precision: fraction of retrieved docs that are relevantRecall: fraction of relevant docs that are retrieved
Precision vs. Recall
Relevant
Low precision and very high recall
Retrieved
Precision: fraction of retrieved docs that are relevantRecall: fraction of relevant docs that are retrieved
Precision vs. Recall
Relevant
Very high precision and recall
Retrieved
Precision: fraction of retrieved docs that are relevantRecall: fraction of relevant docs that are retrieved
Precision vs. Recall
Precision-Recall curve We measures Precision at various levels of Recall Note: it is an AVERAGE over many queries
precision
recall
x
x
x
x
A common picture
precision
recall
x
x
x
x
Interpolated precision If you can increase precision by increasing
recall, then you should get to count that…
Other measures Precision at fixed recall
most appropriate for web search: 10 results
11-point interpolated average precision The standard measure for TREC: you take
the precision at 11 levels of recall varying from 10% to 100% by 10% of retrieved docs each step, using interpolation, and average them
F measure Combined measure (weighted harmonic mean):
People usually use balanced F1 measure i.e., with = 1 or = ½ thus 1/F = ½ (1/P + 1/R)
Use this if you need to optimize a single measure that balances precision and recall.
RPPR
RP
F
2
2 )1(1)1(1
1
Top Related