Ellen Voorhees NIST

26
Evaluating Answers to Evaluating Answers to Definition Questions Definition Questions in in HLT-NAACL 2003 HLT-NAACL 2003 & & Overview of TREC 2003 Overview of TREC 2003 Question Answering Tra Question Answering Tra ck ck in in TREC 2003 TREC 2003 Ellen Voorhees Ellen Voorhees NIST NIST

description

Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003. Ellen Voorhees NIST. QA Tracks in NIST. Pilot evaluation in ARDA AQUAINT program (fall, 2002) - PowerPoint PPT Presentation

Transcript of Ellen Voorhees NIST

Page 1: Ellen Voorhees NIST

Evaluating Answers to DefinitiEvaluating Answers to Definition Questionson Questions

in in HLT-NAACL 2003HLT-NAACL 2003&&

Overview of TREC 2003 QuesOverview of TREC 2003 Question Answering Tracktion Answering Track

in in TREC 2003TREC 2003

Ellen VoorheesEllen Voorhees

NISTNIST

Page 2: Ellen Voorhees NIST

QA Tracks in NISTQA Tracks in NIST

Pilot evaluation in ARDA AQUAINT prograPilot evaluation in ARDA AQUAINT program (fall, 2002)m (fall, 2002) The purpose of each pilot is to develop an effThe purpose of each pilot is to develop an eff

ective evaluation methodology for systems thective evaluation methodology for systems that answer a certain kind of question.at answer a certain kind of question.

The paper in HLT-NAACL 2003 is about the DThe paper in HLT-NAACL 2003 is about the Definition Pilot.efinition Pilot.

Page 3: Ellen Voorhees NIST

QA Tracks in NIST (QA Tracks in NIST (Cont.Cont.))

TREC 2003 QA Track (August, 2003)TREC 2003 QA Track (August, 2003) Passage taskPassage task

Systems returned a single text snippet in response Systems returned a single text snippet in response to factoid questions.to factoid questions.

Main taskMain taskThe task contains The task contains factoidfactoid, , listlist, and , and definitiondefinition

questions.questions.The final score is a combination of the scores for The final score is a combination of the scores for

the separate question types.the separate question types.

Page 4: Ellen Voorhees NIST

Definition QuestionsDefinition Questions

Asking for the definition or explanation of a Asking for the definition or explanation of a term, or an introduction of a person or an term, or an introduction of a person or an organizationorganization ““What is mold?What is mold?”” andand ““Who is Colin Powell?Who is Colin Powell?””

Longer answer textLonger answer textVarious answers, not easy to evaluate the Various answers, not easy to evaluate the

performance of systemsperformance of systems Precision? Recall? Exactness?Precision? Recall? Exactness?

Page 5: Ellen Voorhees NIST

Example of Response of Def QExample of Response of Def Q

““Who is Christopher Reeve?Who is Christopher Reeve?””System responses:System responses: ActorActor the actor who was paralyzed when he fell off hthe actor who was paralyzed when he fell off h

is horseis horse the name attractionthe name attraction stars on Sunday in ABC’s remake of ”Rear Wistars on Sunday in ABC’s remake of ”Rear Wi

ndowndow was injured in a show jumping accident and hwas injured in a show jumping accident and h

as become a spokesman for the causeas become a spokesman for the cause

Page 6: Ellen Voorhees NIST

First Round of Def PilotFirst Round of Def Pilot 8 runs (ABCDEFGH); allowing multiple answers 8 runs (ABCDEFGH); allowing multiple answers

for each question in one run; no length limitfor each question in one run; no length limit Two assessors (author of the questions and the Two assessors (author of the questions and the

other person)other person) Two kinds of scores (0-10 pt.)Two kinds of scores (0-10 pt.)

Content score: higher if more useful and less Content score: higher if more useful and less misleading informationmisleading information

Organization score: higher if useful information Organization score: higher if useful information appears earlierappears earlier

Final score is the combination with more Final score is the combination with more emphasis on content score.emphasis on content score.

Page 7: Ellen Voorhees NIST

Result of 1st Round of Def PilotResult of 1st Round of Def Pilot

Ranking of runs:Ranking of runs: Author: Author: FAFADEBGDEBGCHCH Other: Other: FAFAEGDBEGDBHCHC

Scores varied across assessors.Scores varied across assessors. Different interpretation of “organization score”Different interpretation of “organization score” But organization score was strongly But organization score was strongly

correlated with content score.correlated with content score.Some relative ranking was shown.Some relative ranking was shown.

Page 8: Ellen Voorhees NIST

Second Round of Def PilotSecond Round of Def Pilot

Goal: develop a more quantitative Goal: develop a more quantitative evaluation of system responsesevaluation of system responses

““Information nuggets”: pieces of (atomic) Information nuggets”: pieces of (atomic) information about the target of the questioninformation about the target of the question

What assessors do:What assessors do:1.1. Create a list of info nuggetsCreate a list of info nuggets2.2. Decide which nuggets are vital (must appear Decide which nuggets are vital (must appear

in a good definition)in a good definition)3.3. Mark which nuggets appear in a system Mark which nuggets appear in a system

responseresponse

Page 9: Ellen Voorhees NIST

Example of AssessmentExample of Assessment1 * actor 1 Actor!2 * accident 1,2 the actor who was paralyzed when he fell off his horse3 * treatment/therapy the name attraction4 spinal cord injury activist 1 stars on Sunday in ABC’s remake of ”Rear Window5 written an autobiography6 human embryo research activist

a) list of concepts (* vital) b) system response

was injured in a show jumping accident andhas become a spokesman for the cause

2,4

Concept recall is quite straightforward: ratio of Concept recall is quite straightforward: ratio of concepts retrieved.concepts retrieved.

Precision is hard to define. (Hard to divide text Precision is hard to define. (Hard to divide text into concepts. Denominator is unknown.)into concepts. Denominator is unknown.)

Using only recall to evaluate systems is Using only recall to evaluate systems is untenable. (Entire documents get full recall.)untenable. (Entire documents get full recall.)

Page 10: Ellen Voorhees NIST

Approximation to PrecisionApproximation to Precision

Borrowed from DUC Borrowed from DUC (Harman and Over, 2002)(Harman and Over, 2002)

An allowance of 100 (non-space) An allowance of 100 (non-space) characters for each nugget retrievedcharacters for each nugget retrieved

Punishment if the length of the response is Punishment if the length of the response is longer than allowancelonger than allowance

Precision=1-(length-allowance)/lengthPrecision=1-(length-allowance)/length In the previous example, allowance=4*100, In the previous example, allowance=4*100,

length=175, thus precision=1.length=175, thus precision=1.

Page 11: Ellen Voorhees NIST

Final ScoreFinal Score Recall is computed only over vital nuggets. (2/3 in prev.)Recall is computed only over vital nuggets. (2/3 in prev.) Precision is computed over all nuggets.Precision is computed over all nuggets.

LetLet rrbe the number of vital nuggets returned in a response;be the number of vital nuggets returned in a response;

aabe the number of acceptable (non-vital but in the list) nuggets retbe the number of acceptable (non-vital but in the list) nuggets returned in a response;urned in a response;

RRbe the total number of vital nuggets in the assessor’s list;be the total number of vital nuggets in the assessor’s list;

lenlenbe of the number of non-white space characters in an answer sbe of the number of non-white space characters in an answer string summed over all answer strings in the response;tring summed over all answer strings in the response;

ThenThen

recallprecision25recallprecision26

5

otherwise

allowancelen if

lenallowancelen

1

1precision

100allowance

recall

F

ar

Rr

Page 12: Ellen Voorhees NIST

Result of 2nd Round of Def PilotResult of 2nd Round of Def Pilot F-measure F-measure

Different Different ββ value results in different f-measure ranking.value results in different f-measure ranking. ββ=5 approximates the ranking of first round.=5 approximates the ranking of first round.

author other lengthauthor other lengthFF 0.6880.688 FF 0.7570.757 FF 935.6935.6 more verbosemore verboseAA 0.6060.606 AA 0.6870.687 AA 1121.21121.2 more verbosemore verboseDD 0.5680.568 GG 0.6710.671 DD 281.8281.8GG 0.5620.562 DD 0.6690.669 GG 164.5164.5 relatively terserelatively terseEE 0.5550.555 EE 0.6570.657 EE 533.9533.9BB 0.4670.467 BB 0.5220.522 BB 1236.51236.5 complete sentencecomplete sentenceCC 0.3490.349 CC 0.3840.384 CC 84.784.7HH 0.3300.330 HH 0.3650.365 HH 33.733.7 single snippetsingle snippet

... ... Rankings are stable!Rankings are stable!

RP

PRF

2

2 )1(

Page 13: Ellen Voorhees NIST

Def Task in TREC QADef Task in TREC QA50 questions50 questions

30 for person (e.g. Andrea Bocceli, Ben Hur)30 for person (e.g. Andrea Bocceli, Ben Hur) 10 for organization (e.g. Friends of the Earth)10 for organization (e.g. Friends of the Earth) 10 for other thing (e.g. TB, feng shui)10 for other thing (e.g. TB, feng shui)

ScenarioScenario The questioner is an adult, a native speaker of English, and an “aThe questioner is an adult, a native speaker of English, and an “a

verage” reader of US newspapers. In reading an article, the user verage” reader of US newspapers. In reading an article, the user has come across a term that they would like to find out more abohas come across a term that they would like to find out more about. They may have some basic idea of what the term means eitheut. They may have some basic idea of what the term means either from the context of the article (for example, a bandicoot must be r from the context of the article (for example, a bandicoot must be a type of animal) or basic background knowledge (Ulysses S. Graa type of animal) or basic background knowledge (Ulysses S. Grant was a US president). They are not experts in the domain of the nt was a US president). They are not experts in the domain of the target, and therefore are not seeking esoteric details (e.g., not a ztarget, and therefore are not seeking esoteric details (e.g., not a zoologist looking to distinguish the different species in genus Peraoologist looking to distinguish the different species in genus Perameles).meles).

Page 14: Ellen Voorhees NIST

Result of Def Task, QA TrackResult of Def Task, QA TrackRun Tag Submitter F(β=5) Ave Length

BBN2003C BBN 0.555 2059.20nusmml03r2 National University of Singapore (Yang) 0.473 1478.74isi03a University of Southern California, ISI 0.461 1404.78LCCmainS03 Language Computer Corp. 0.442 1407.82cuaqdef2003 Univ. of Colorado/Columbia Univ. 0.338 1685.60irstqa2003d ITC-irst 0.318 431.26UAmsT03M1 University of Amsterdam 0.315 2815.08MITCSAIL03a Massachusetts Institute of Technology 0.309 620.28shef12simple University of Sheffield 0.236 338.42UIowaQA0303 University of Iowa 0.231 3039.26CMUJAV2003 Carnegie Mellon University (Javelin) 0.216 182.34FDUT12QA3 Fudan University 0.192 203.54piq002 University of Pisa 0.185 89.52IBM2003b IBM Research (Prager) 0.177 223.16ntt2003qam1 NTT Communication Science Labs 0.169 2219.24

Page 15: Ellen Voorhees NIST

Analysis of TREC QA TrackAnalysis of TREC QA Track

Fidelity: the extent to which the evaluation Fidelity: the extent to which the evaluation measures what it is intended to measure.measures what it is intended to measure. TREC: the extent to which the abstraction TREC: the extent to which the abstraction

captures (some of) the issues of the real taskcaptures (some of) the issues of the real taskReliability: the extent to which an Reliability: the extent to which an

evaluation result can be trusted. evaluation result can be trusted. TERC: an evaluation ranks a better system TERC: an evaluation ranks a better system

ahead of a worse systemahead of a worse system

Page 16: Ellen Voorhees NIST

Definition Task FidelityDefinition Task Fidelity

It is unclear whether the average user stroIt is unclear whether the average user strongly prefer recall. (since ngly prefer recall. (since ββ=5)=5)

And it seems longer responses receive higAnd it seems longer responses receive higher scores.her scores.

Determine how selective a system isDetermine how selective a system is Baseline: returns all sentences in the corpus cBaseline: returns all sentences in the corpus c

ontaining the targetontaining the target Smarter baseline (BBN): as the baseline but tSmarter baseline (BBN): as the baseline but t

he overlap between sentences is smallhe overlap between sentences is small

Page 17: Ellen Voorhees NIST

Definition Task Fidelity (Definition Task Fidelity (Cont.Cont.))Run Tag F(β=1) F(β=2) F(β=3) F(β=4) F(β=5)

BBN2003C 0.310 0.423 0.493 0.532 0.555SENT-BASE 0.205 0.315 0.400 0.456 0.493nusmml03r2 0.261 0.360 0.421 0.454 0.473isi03a 0.270 0.353 0.409 0.442 0.461LCCmainS03 0.332 0.374 0.408 0.429 0.442cuaqdef2003 0.187 0.256 0.299 0.324 0.338irstqa2003d 0.310 0.310 0.314 0.316 0.318UAmsT03M1 0.163 0.215 0.259 0.292 0.315MITCSAIL03a 0.296 0.298 0.304 0.307 0.309shef12simple 0.195 0.211 0.224 0.232 0.236UIowaQA0303 0.156 0.188 0.210 0.223 0.231CMUJAV2003 0.246 0.223 0.218 0.217 0.216FDUT12QA3 0.214 0.196 0.193 0.192 0.192piq002 0.234 0.198 0.189 0.186 0.185IBM2003b 0.209 0.186 0.180 0.178 0.177ntt2003qam1 0.145 0.151 0.159 0.165 0.169

225

No conclusion of No conclusion of ββ value can be made. value can be made. At least At least ββ=5 matches the user need in the pilot.=5 matches the user need in the pilot.

Page 18: Ellen Voorhees NIST

Definition Task ReliabilityDefinition Task Reliability Noise or error:Noise or error:

Human mistake in judgmentHuman mistake in judgment Different opinions from different assessorsDifferent opinions from different assessors Questions setQuestions set

Evaluating the effect of different opinionsEvaluating the effect of different opinions Two assessors create two different nugget sets.Two assessors create two different nugget sets. Runs are scored using two nugget lists.Runs are scored using two nugget lists. The stability of rankings is measured by Kendall’s The stability of rankings is measured by Kendall’s ττ..

The The ττ score is 0.848 (considered stable if score is 0.848 (considered stable if ττ>0.9)>0.9) Not good enoughNot good enough

Page 19: Ellen Voorhees NIST

Example of Different Nugget ListsExample of Different Nugget Lists “What is a golden parachute?”

11 vitalvital Agreement between companies and top executivesAgreement between companies and top executives22 vitalvital Provides remuneration to executives who lose jobsProvides remuneration to executives who lose jobs33 vitalvital Remuneration is usually very generousRemuneration is usually very generous44 Encourages execs not to resist takeover beneficial to shareholdeEncourages execs not to resist takeover beneficial to shareholde

rsrs55 Incentive for execs to join companiesIncentive for execs to join companies66 Arrangement for which IRS can impose excise taxArrangement for which IRS can impose excise tax

11 vitalvital provides remuneration to executives who lose jobsprovides remuneration to executives who lose jobs22 vitalvital assures officials of rich compensation if lose job due to takeoverassures officials of rich compensation if lose job due to takeover33 vitalvital contract agreement between companies and their top executivescontract agreement between companies and their top executives44 aids in hiring and retentionaids in hiring and retention55 encourages officials not to resist a mergerencourages officials not to resist a merger66 IRS can impose taxesIRS can impose taxes

Page 20: Ellen Voorhees NIST

Definition Task Reliability (Definition Task Reliability (ConCont.t.))

Use two large question sets with the same siUse two large question sets with the same size, F-measure scores of the system should bze, F-measure scores of the system should be similar.e similar.

Simulation of such evaluationSimulation of such evaluation Randomly create two question sets of the requireRandomly create two question sets of the require

d sized size Define error rate as the percentage of rank swapDefine error rate as the percentage of rank swap

ss Grouping by the difference of F(Grouping by the difference of F(ββ=5)=5)

Page 21: Ellen Voorhees NIST
Page 22: Ellen Voorhees NIST

Definition Task Reliability (Definition Task Reliability (Cont.Cont.))

Most errors (rank swaps) happen in small Most errors (rank swaps) happen in small diff groups.diff groups. Difference > 0.123 is required to have confideDifference > 0.123 is required to have confide

nce in F(nce in F(ββ=5)=5)More questions are needed in the test set tMore questions are needed in the test set t

o increase the sensitivity while remaining eo increase the sensitivity while remaining equally confident in the result.qually confident in the result.

Page 23: Ellen Voorhees NIST

List TaskList Task

List questions with multiple possible List questions with multiple possible answersanswers ““List the names of chewing gums”

No target number is specified.No target number is specified.Final answer list of a question is the Final answer list of a question is the

collection of correct answers in the corpus.collection of correct answers in the corpus. Instance precision (Instance precision (IPIP) and instance recall ) and instance recall

((IRIR))FF=2*=2*IPIP**IRIR/(/(IPIP++IRIR))

Page 24: Ellen Voorhees NIST

Example of Final Answer ListExample of Final Answer List

1915: List the names of chewing gums.1915: List the names of chewing gums.StimorolStimorol OrbitOrbit WinterfreshWinterfresh Double BubbleDouble Bubble

DirolDirol TridentTrident SpearmintSpearmint BazookaBazooka

DoublemintDoublemint DentyneDentyne FreedentFreedent Hubba BubbaHubba Bubba

Juicy FruitJuicy Fruit Big RedBig Red ChicletsChiclets NicoretteNicorette

Page 25: Ellen Voorhees NIST

Other TasksOther Tasks

Passage Task:Passage Task: return a short (<250) span of text containing an answer.return a short (<250) span of text containing an answer. Texts are restricted to extraction of a documentTexts are restricted to extraction of a document

Factoid Task:Factoid Task: Exact answersExact answers

Passage task is evaluated separately.Passage task is evaluated separately. The final score of the main task is The final score of the main task is

FinalScore=1/2*FactoidScore+1/4*ListScore+1/4*DefScoreFinalScore=1/2*FactoidScore+1/4*ListScore+1/4*DefScore

Page 26: Ellen Voorhees NIST