2004.09.28 SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis...

2004.09.28 SLIDE 1IS 202 – FALL 2004

Lecture 9: IR Evaluation

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 pm

Fall 2004http://www.sims.berkeley.edu/academics/courses/is202/f04/

SIMS 202:

Information Organization

and Retrieval

2004.09.28 SLIDE 2IS 202 – FALL 2004

Lecture Overview

• Review– Probabilistic IR

• Evaluation of IR systems– Precision vs. Recall– Cutoff Points and other measures– Test Collections/TREC– Blair & Maron Study– Discussion

Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

2004.09.28 SLIDE 3IS 202 – FALL 2004

Lecture Overview




2004.09.28 SLIDE 4IS 202 – FALL 2004

Probability Ranking Principle

• “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in the order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”

Stephen E. Robertson, J. Documentation 1977

2004.09.28 SLIDE 5IS 202 – FALL 2004

Model 1 – Maron and Kuhns

• Concerned with estimating probabilities of relevance at the point of indexing:– If a patron came with a request using term t i,

what is the probability that she/he would be satisfied with document Dj ?

2004.09.28 SLIDE 6IS 202 – FALL 2004

Model 2

• Documents have many different properties; some documents have all the properties that the patron asked for, and other documents have only some or none of the properties. If the inquiring patron were to examine all of the documents in the collection she/he might find that some having all the sought after properties were relevant, but others (with the same properties) were not relevant. And conversely, he/she might find that some of the documents having none (or only a few) of the sought after properties were relevant, others not. The function of a document retrieval system is to compute the probability that a document is relevant, given that it has one (or a set) of specified properties.

Robertson, Maron & Cooper, 1982

2004.09.28 SLIDE 7IS 202 – FALL 2004

Model 2 – Robertson & Sparck Jones

Document Relevance

DocumentIndexing

Given a term t and a query q

+ -

+ r n-r n

- R-r N-n-R+r N-n

R N-R N

2004.09.28 SLIDE 8IS 202 – FALL 2004

Robertson-Sparck Jones Weights

• Retrospective formulation

rRnNrnrR

r

log

2004.09.28 SLIDE 9IS 202 – FALL 2004

Robertson-Sparck Jones Weights

• Predictive formulation

5.05.05.0

5.0

log)1(

rRnNrnrR

r

w

2004.09.28 SLIDE 10IS 202 – FALL 2004

Probabilistic Models: Some Unifying Notation

• D = All present and future documents

• Q = All present and future queries

• (Di,Qj) = A document query pair

• x = class of similar documents,

• y = class of similar queries,

• Relevance (R) is a relation:

}Q submittinguser by therelevant judged

isDdocument ,Q ,D | )Q,{(D R

j

ijiji QD

DxQy

2004.09.28 SLIDE 11IS 202 – FALL 2004

Probabilistic Models

• Model 1 -- Probabilistic Indexing, P(R|y,Di)

• Model 2 -- Probabilistic Querying, P(R|Qj,x)

• Model 3 -- Merged Model, P(R| Qj, Di)

• Model 0 -- P(R|y,x)

• Probabilities are estimated based on prior usage or relevance estimation

2004.09.28 SLIDE 12IS 202 – FALL 2004

Probabilistic Models

QD

x

y

Di

Qj

2004.09.28 SLIDE 13IS 202 – FALL 2004

Logistic Regression

• Another approach to estimating probability of relevance

• Based on work by William Cooper, Fred Gey and Daniel Dabney

• Builds a regression model for relevance prediction based on a set of training data

• Uses less restrictive independence assumptions than Model 2– Linked Dependence

2004.09.28 SLIDE 14IS 202 – FALL 2004

Logistic Regression

100 -

90 -

80 -

70 -

60 -

50 -

40 -

30 -

20 -

10 -

0 - 0 10 20 30 40 50 60Term Frequency in Document

Rel

evan

ce

2004.09.28 SLIDE 15IS 202 – FALL 2004

Relevance Feedback

• Main Idea:– Modify existing query based on relevance

judgements• Extract terms from relevant documents and add

them to the query• And/or re-weight the terms already in the query

– Two main approaches:• Automatic (pseudo-relevance feedback)• Users select relevant documents

– Users/system select terms from an automatically-generated list

2004.09.28 SLIDE 16IS 202 – FALL 2004

Rocchio Method

0.25) to and 0.75 to set best to studies some(in terms

t nonrelevan andrelevant of importance thetune and ,

chosen documentsrelevant -non ofnumber the

chosen documentsrelevant ofnumber the

document relevant -non for the vector the

document relevant for the vector the

query initial for the vector the

2

1

0

121101

21

n

n

iS

iR

Q

where

Sn

Rn

QQ

i

i

i

n

i

n

ii

2004.09.28 SLIDE 17IS 202 – FALL 2004

Rocchio/Vector Illustration

Retrieval

Information

0.5

1.0

0 0.5 1.0

D1

D2

Q0

Q’

Q”

Q0 = retrieval of information = (0.7,0.3)D1 = information science = (0.2,0.8)D2 = retrieval systems = (0.9,0.1)

Q’ = ½*Q0+ ½ * D1 = (0.45,0.55)Q” = ½*Q0+ ½ * D2 = (0.80,0.20)

2004.09.28 SLIDE 18IS 202 – FALL 2004

Lecture Overview




2004.09.28 SLIDE 19IS 202 – FALL 2004

IR Evaluation

• Why Evaluate?

• What to Evaluate?

• How to Evaluate?

2004.09.28 SLIDE 20IS 202 – FALL 2004

Why Evaluate?

• Determine if the system is desirable

• Make comparative assessments– Is system X better than system Y?

• Others?

2004.09.28 SLIDE 21IS 202 – FALL 2004

What to Evaluate?

• How much of the information need is satisfied

• How much was learned about a topic

• Incidental learning:– How much was learned about the collection– How much was learned about other topics

• Can serendipity be measured?

• How inviting the system is?

2004.09.28 SLIDE 22IS 202 – FALL 2004

Relevance (revisited)

• In what ways can a document be relevant to a query?– Answer precise question precisely– Partially answer question– Suggest a source for more information– Give background information– Remind the user of other knowledge– Others...

2004.09.28 SLIDE 23IS 202 – FALL 2004

Relevance (revisited)

• How relevant is the document?– For this user for this information need

• Subjective, but• Measurable to some extent

– How often do people agree a document is relevant to a query?

• How well does it answer the question?– Complete answer? Partial? – Background Information?– Hints for further exploration?

2004.09.28 SLIDE 24IS 202 – FALL 2004

• What can be measured that reflects users’ ability to use system? (Cleverdon 66)– Coverage of information– Form of presentation– Effort required/ease of use– Time and space efficiency– Recall

• Proportion of relevant material actually retrieved

– Precision• Proportion of retrieved material actually relevant

What to Evaluate?E

ffec

tiven

ess

2004.09.28 SLIDE 25IS 202 – FALL 2004

Relevant vs. Retrieved

Relevant

Retrieved

All Docs

2004.09.28 SLIDE 27IS 202 – FALL 2004

Why Precision and Recall?

Get as much good stuff as possible while at the same time getting as little junk as possible

2004.09.28 SLIDE 28IS 202 – FALL 2004

Retrieved vs. Relevant Documents

Very high precision, very low recall

Relevant

2004.09.28 SLIDE 29IS 202 – FALL 2004


Very low precision, very low recall (0 in fact)

Relevant

2004.09.28 SLIDE 30IS 202 – FALL 2004


High recall, but low precision

Relevant

2004.09.28 SLIDE 31IS 202 – FALL 2004


High precision, high recall (at last!)

Relevant

2004.09.28 SLIDE 32IS 202 – FALL 2004

Precision/Recall Curves

• There is a well-known tradeoff between Precision and Recall

• So we typically measure Precision at different (fixed) levels of Recall

• Note: this is an AVERAGE over MANY queries

precision

recall

x

x

x

x

2004.09.28 SLIDE 33IS 202 – FALL 2004

Precision/Recall Curves

• Difficult to determine which of these two hypothetical results is better:

precision

recall

x

x

x

x

2004.09.28 SLIDE 34IS 202 – FALL 2004

TREC (Manual Queries)

2004.09.28 SLIDE 35IS 202 – FALL 2004

Lecture Overview




2004.09.28 SLIDE 36IS 202 – FALL 2004

Document Cutoff Levels

• Another way to evaluate:– Fix the number of RELEVANT documents retrieved at

several levels:• Top 5• Top 10• Top 20• Top 50• Top 100• Top 500

– Measure precision at each of these levels– (Possibly)Take average over levels

• This is a way to focus on how well the system ranks the first k documents

2004.09.28 SLIDE 37IS 202 – FALL 2004

Problems with Precision/Recall

• Can’t know true recall value – Except in small collections

• Precision/Recall are related– A combined measure sometimes more

appropriate

• Assumes batch mode– Interactive IR is important and has different

criteria for successful searches– We will touch on this in the UI section

• Assumes that a strict rank ordering matters

2004.09.28 SLIDE 38IS 202 – FALL 2004

Relation to Contingency Table

• Accuracy: (a+d) / (a+b+c+d)• Precision: a/(a+b)• Recall: ?• Why don’t we use Accuracy for IR Evaluation?

(Assuming a large collection)– Most docs aren’t relevant – Most docs aren’t retrieved– Inflates the accuracy value

Doc is Relevant

Doc is NOT relevant

Doc is retrieved a b

Doc is NOT retrieved c d

2004.09.28 SLIDE 39IS 202 – FALL 2004

The E-Measure

Combine Precision and Recall into one number (van Rijsbergen 79)

P = precisionR = recall = measure of relative importance of P or RFor example,= 1 means user is equally interested in precision and recall= means user doesn’t care about precision= 0 means user doesn’t care about recall

)1/(1

1)1(

11

1

2

RP

E

2004.09.28 SLIDE 40IS 202 – FALL 2004

F Measure (Harmonic Mean)

documentth -j for theprecision theis P(j) and

documentth -j for the recall theis )(

)(1

)(1

2)(

jrwhere

jPjr

jF

2004.09.28 SLIDE 41IS 202 – FALL 2004

Lecture Overview




2004.09.28 SLIDE 42IS 202 – FALL 2004

Test Collections

• Cranfield 2 – – 1400 Documents, 221 Queries– 200 Documents, 42 Queries

• INSPEC – 542 Documents, 97 Queries• UKCIS -- > 10000 Documents, multiple sets, 193

Queries• ADI – 82 Documents, 35 Queries• CACM – 3204 Documents, 50 Queries• CISI – 1460 Documents, 35 Queries• MEDLARS (Salton) 273 Documents, 18 Queries

2004.09.28 SLIDE 43IS 202 – FALL 2004

TREC

• Text REtrieval Conference/Competition– Run by NIST (National Institute of Standards &

Technology) -- http://trec.nist.gov– 13th TREC in mid November

• Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs– Newswire & full text news (AP, WSJ, Ziff, FT)– Government documents (federal register,

Congressional Record)– Radio Transcripts (FBIS) in multiple languages– Web “subsets” (“Large Web” separate with 18.5

Million pages of Web data – 100 Gbytes) • New GOV2 collection nearly 1 Tb

– Patents

2004.09.28 SLIDE 44IS 202 – FALL 2004

TREC (cont.)

• Queries + Relevance Judgments– Queries devised and judged by “Information

Specialists”– Relevance judgments done only for those documents

retrieved—not entire collection!

• Competition– Various research and commercial groups compete

(TREC 6 had 51, TREC 7 had 56, TREC 8 had 66)– Results judged on precision and recall, going up to a

recall level of 1000 documents

• Following slides are from TREC overviews by Ellen Voorhees of NIST

2004.09.28 SLIDE 45IS 202 – FALL 2004

2004.09.28 SLIDE 46IS 202 – FALL 2004

2004.09.28 SLIDE 47IS 202 – FALL 2004

2004.09.28 SLIDE 48IS 202 – FALL 2004

2004.09.28 SLIDE 49IS 202 – FALL 2004

2004.09.28 SLIDE 50IS 202 – FALL 2004

Sample TREC Query (Topic)

<num> Number: 168<title> Topic: Financing AMTRAK

<desc> Description:A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK)

<narr> Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant.

2004.09.28 SLIDE 51IS 202 – FALL 2004

2004.09.28 SLIDE 52IS 202 – FALL 2004

2004.09.28 SLIDE 53IS 202 – FALL 2004

2004.09.28 SLIDE 54IS 202 – FALL 2004

2004.09.28 SLIDE 55IS 202 – FALL 2004

2004.09.28 SLIDE 56IS 202 – FALL 2004

TREC

• Benefits:– Made research systems scale to large collections (at

least pre-WWW “large”)– Allows for somewhat controlled comparisons

• Drawbacks:– Emphasis on high recall, which may be unrealistic for

what many users want– Very long queries, also unrealistic– Comparisons still difficult to make, because systems

are quite different on many dimensions– Focus on batch ranking rather than interaction

• There is an interactive track but not a lot is being learned, given the constraints of the TREC evaluation process

2004.09.28 SLIDE 57IS 202 – FALL 2004

TREC is Changing

• Emphasis on specialized “tracks”– Interactive track– Natural Language Processing (NLP) track– Multilingual tracks (Chinese, Spanish, Arabic)– Filtering track– High-Precision– High-Performance– Very-Large Scale (terabyte track)

• http://trec.nist.gov/

2004.09.28 SLIDE 58IS 202 – FALL 2004

Other Test Forums/Collections

• CLEF (Cross-language Evaluation Forum)– Collections in English, French, German,

Spanish, Italian with new languages being added (Russian, Finnish, etc). Primarily Euro.

• NTCIR (NII-NACSIS Test Coll. For IR Sys.)– Primarily Japanese, Chinese and Korean, with

partial English

• INEX (Initiative for Evaluation of XML Retrieval). – Main track uses about 525Mb of XML data from

IEEE. Combines Structure and Content.

2004.09.28 SLIDE 59IS 202 – FALL 2004

Lecture Overview




2004.09.28 SLIDE 60IS 202 – FALL 2004

Blair and Maron 1985

• A classic study of retrieval effectiveness– Earlier studies were on unrealistically small

collections• Studied an archive of documents for a lawsuit

– ~350,000 pages of text– 40 queries– Focus on high recall– Used IBM’s STAIRS full-text system

• Main Result: – The system retrieved less than 20% of the relevant

documents for a particular information need– Lawyers thought they had 75%

• But many queries had very high precision

2004.09.28 SLIDE 61IS 202 – FALL 2004

Blair and Maron (cont.)

• How they estimated recall– Generated partially random samples of

unseen documents– Had users (unaware these were random)

judge them for relevance

• Other results:– Two lawyers searches had similar

performance– Lawyers recall was not much different from

paralegal’s

2004.09.28 SLIDE 62IS 202 – FALL 2004

Blair and Maron (cont.)

• Why recall was low– Users can’t foresee exact words and phrases

that will indicate relevant documents• “accident” referred to by those responsible as:• “event,” “incident,” “situation,” “problem,” …• Differing technical terminology• Slang, misspellings

– Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

2004.09.28 SLIDE 63IS 202 – FALL 2004

Lecture Overview




2004.09.28 SLIDE 64IS 202 – FALL 2004

An Evaluation of Retrieval Effectiveness (Blair & Maron)

• Questions from Shufei Lei

• Blair and Maron concluded that the full-text retrieval system such as IBMs STAIRS was ineffective because Recall was very low (average 20%) when searching for documents in a large database of documents (about 40,000 documents). However, the lawyers who were asked to perform this test were quite satisfied with the results of their search. Think about how you search the web today. How do you evaluate the effectiveness of the full-text retrieval system (user satisfaction or Recall rate)?

• The design of the full-text retrieval system is based on the assumption that it is simple matter for users to foresee the exact words and phrases that will be used in the documents they will find useful, and only in those documents.The author pointed out some factors that invalidate this assumption: misspellings, using different terms to refer to the same event, synonyms, etc. What can we do to help overcome these problems?

2004.09.28 SLIDE 65IS 202 – FALL 2004

Rave Reviews (Belew)

• Questions from Scott Fisher

• What are the drawbacks of using an "expert" to evaluate documents in a collection for relevance?

• RAVEUnion follows the pooling procedure used by many evaluators. What is a weakness of this procedure? How do the RAVE researchers try to overcome this weakness?

2004.09.28 SLIDE 66IS 202 – FALL 2004

A Case for Interaction (Koeneman & Belkin)

• Questions from Lulu Guo

• It is reported that people thought that using the feedback component as a suggestion device made them "lazy" since the task of generating terms was replaced by selecting terms. Is there any potential problem with this "laziness"?

• In evaluating the effectiveness of the second search task, the authors reported median precision (M) instead of mean (X bar) precision. What's the difference between the two, and which do you think is more appropriate?

2004.09.28 SLIDE 67IS 202 – FALL 2004

Work Tasks and Socio-Cognitive Relevance (Hjorland & Chritensen)

• Questions from Kelly Snow• Schizophrenia research has a number of different

theories (psychosocial, biochemical) leading to different courses of treatment. According to the reading, finding a 'focus' is crucial for the search process. When prevailing consensus has not been reached, how might a Google-like page-rank approach be a benefit? How might it pose problems?

• The article discusses relevance ranking by the user as a subjective measure. Relevance ranking can be a reflection of a user's uncertainty about an item's relevance. It can also reflect relevance to a specific situation at a certain time - A document might be relevant for discussion with a colleague but not for clinical treatment. Does this insight change the way you've been thinking about relevance as discussed in the course so far?

2004.09.28 SLIDE 68IS 202 – FALL 2004

Social Information Filtering (Shardanand & Maes)

• Questions from Judd Antin• Would carelessly rating albums or artists 'break' Ringo? Why or why

not? How would you break Ringo if you wanted to?

• Is the accuracy or precision of predicted target values a good measure of system performance? What good is a social filtering system if it never provides information which leads to new or different behavior? How do we measure performance in a practical sense?

• One important criticism of Social Information Filtering is that it does not situate information in its sociocultural context - that liking or disliking a piece of music is an evolving relationship between music and the listening environment. So, in this view, Social Information Filtering fails because a quantitative, statistical measure of preference is not enough to account for the reality of any individual user's preference. How might a system account for this failing? Would it be enough to include additional metadata such as 'Mood,' 'Genre,' 'First Impression,' etc.?

2004.09.28 SLIDE 69IS 202 – FALL 2004

Next Time

• WEDNESDAY 10:30-11:30 RM 205: A Gentle (re)introduction to Math notation and IR.

• Thursday: In-class workshop on IR Evaluation – Bring computer and/or calculator!

• Readings– MIR Chapter 10

2004.09.28 SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis...

Documents

Transcript of 2004.09.28 SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis...