An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko...
-
date post
21-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of An Analysis of the AskMSR Question-Answering System Eric Brill, Susan Dumais, and Michelle Banko...
An Analysis of the AskMSR An Analysis of the AskMSR Question-Answering SystemQuestion-Answering System
Eric Brill, Susan Dumais, and Michelle Banko
Microsoft Research
From Proceedings of the From Proceedings of the EMNLP Conference, 2002EMNLP Conference, 2002
GoalsGoals
Evaluate contributions of components
Explore strategies for predicting when answers are incorrect
AskMSR – What Sets It ApartAskMSR – What Sets It Apart
Dependency on data redundancy
No sophisticated linguistic analyses– Of questions– Of answers
TREC Question Answering TREC Question Answering TrackTrack
Fact-based, short-answer questions– How many calories are there in a Big Mac?– Who killed Abraham Lincoln?– How tall is Mount Everest?
562 – In case you’re wonderingMotivation for much of recent work in QA
Other ApproachesOther Approaches
POS taggingParsingNamed Entity extractionSemantic relationsDictionariesWordNet
AskMSR ApproachAskMSR Approach
Web – “gigantic data repository” Different from other systems using web
– Simplicity & efficiency No complex parsing No entity extraction For queries or best matching web pages No local caching
Claim: techniques used in approach to short-answer tasks are more broadly applicable
Some QA DifficultiesSome QA Difficulties
Single, small information source– Likely only 1 answer exists
Source with small # of answer formulations– Complex relations between Q & A
Lexical, syntactic, semantic relations Anaphora, synonymy, alternate syntactic formulations,
indirect answers make this difficult
Answer RedundancyAnswer Redundancy
Greater answer redundancy in source– More likely: simple relation between Q & A
exists– Less likely: need to deal with difficulties facing
NLP systems
System ArchitectureSystem Architecture
Query ReformulationQuery Reformulation
Rewrite question– Substring of declarative answer– Weighted– “when was the paper clip invented?” “the
paper clip was invented”Produce less precise rewrites
– Greater chance of matching– Backoff to simple ANDing of non-stop words
Query Reformulation (cont.)Query Reformulation (cont.)
String based manipulationsNo parserNo POS taggingSmall lexicon for possible POS and
morphological variantsCreated rewrite rules by handChose associated weights by hand
N-gram MiningN-gram Mining
Formulate rewrite for search engineCollect and analyze page summariesWhy use summaries?
– Efficiency– Contain search terms, plus some context
N-grams collected from summaries
N-gram Mining (Cont.)N-gram Mining (Cont.)
Extract 1-, 2-, 3-grams from summary– Score by weight of rewrite that retrieved it
Sum scores across all summaries with n-gram
No frequency within summaryFinal score for n-gram
– Weights associated with rewrite rules– # of unique summaries it is in
N-gram FilteringN-gram Filtering
Use handwritten filter rulesQuestion type assignment
– e.g. who, what, how
Choose set of filters based on q-typeRescore n-grams based on presence of
features relevant to filters
N-gram Filtering (Cont.)N-gram Filtering (Cont.)
15 simple filters– Based on human knowledge
Question types Answer domain
– Surface string features Capitalization Digits Handcrafted regular expression patterns
N-gram TilingN-gram TilingMerge similar answersCreate longer answers from overlapping
smaller answer fragments– “A B C”, “B C D” “A B C D”
Greedy algorithm– Start w/ top-scoring n-gram, check lower
scoring n-grams for tiling potential If can be tiled, replace higher-scoring n-gram with
tiled n-gram, remove lower-scoring n-gram
– Stop when can no longer tile
ExperimentsExperimentsFirst 500 TREC-9 queriesUse scoring patterns provided by NIST
– Modified some patterns to accommodate web answers not in TREC
– More specific answers allowed Edward J. Smith vs. Edward Smith
– More general answers not allowed Smith vs. Edward Smith
– Simple substitutions allowed 9 months vs. nine months
Experiments (cont.)Experiments (cont.)
Time differences between Web & TREC– “Who is the president of Bolivia?”– Did NOT modify answer key– Would make comparison w/earlier TREC
results impossible (instead of difficult?)
Changes influence absolute scores, not relative performance
Experiments (cont.)Experiments (cont.)
Automatic runs– Start w/queries– Generate ranked list of 5 answers
Use Google as search engine– Query-relevant summaries for n-gram mining
efficiencyAnswers are max. of 50 bytes long
– Typically shorter
““Basic” System PerformanceBasic” System Performance
Backwards notion of basic– Current system, all modules implemented– Default settings
Mean Reciprocal Rank (MRR) – 0.507 61% of questions answered correctly Average answer length – 12 bytes Impossible to compare precisely with TREC-9
groups, but still very good performance
Component ContributionsComponent Contributions
Query Rewrite ContributionQuery Rewrite Contribution
More precise queries – higher weightsAll rewrites equal – MRR drops 3.6%Only backoff AND – MRR drops 11.2%Rewrites capitalize on web redundancyCould use more specific regular expression
matching
N-gram Filtering ContributionN-gram Filtering Contribution
1-, 2-, 3-grams from 100 best-matching summaries
Filter by question type “How many dogs pull a sled in the Iditarod?” Question prefers a number Run, Alaskan, dog racing, many mush ranked lower
than pool of 16 dogs (correct answer)
No filtering – MRR drops 17.9%
N-gram Tiling ContributionN-gram Tiling Contribution
Benefits of tiling– Substrings take up only 1 answer slot
e.g. San, Francisco, San Francisco
– Longer answers can never be found with only tri-grams
e.g. “light amplification by [stimulated] emission of radiation”
No tiling – MRR drops 14.2%
Component CombinationsComponent Combinations
Only weighted sum of occurrences of1-, 2-, 3-grams – MRR drops 47.5%
Simple statistical system– No linguistic knowledge or processing– Only AND queries– Filtering – no, (statistical) tiling – yes– MRR drops 33% to 0.338
Component CombinationsComponent Combinations
Statistical system –good performance?– Reasonable on absolute scale?– One TREC-9 50 byte run performed better
All components contribute to accuracy– Precise weights of rewrites unimportant– N-gram tiling – a “poor man’s named-entity
recognizer”– Biggest contribution from filters/selection
Component CombinationsComponent Combinations
Claim: “Because of the effectiveness of our tiling algorithm…we do not need to use any named entity recognition components.”– By having filters with capitalization info
(section 2.3, 2nd paragraph), aren’t they doing some NE recognition?
Component ProblemsComponent Problems
Component Problems (cont.)Component Problems (cont.)
No correct answer in top 5 hypotheses23% of errors – not knowing units
– How fast can Bill’s Corvette go? mph or k/h
34% (Time, Correct) – time problems or answer not in TREC-9 answer key
16% from shortcomings in n-gram tilingNumber retrieval (5%) – query limitation
Component Problems (cont.)Component Problems (cont.)
12% - beyond current system paradigm– Can’t be fixed with minor enhancements– Is this really so? or have they been easy on
themselves in error attribution?
9% - no discussion
Knowing When…Knowing When…
Some cost for answering incorrectlySystem can choose to not answer instead of
giving incorrect answer– How likely hypothesis is correct?
TREC – no distinction between wrong answer and no answer
Deploy real system – trade-off between precision & recall
Knowing When…(cont.)Knowing When…(cont.)
Answer is ad-hoc combination of hand tuned weights
Is it possible to induce useful precision-recall (ROC) curve when answers don’t have meaningful probabilities?
What is an ROC (Receiver Operating Characteristic) curve?
ROCROC From http://www-csli.stanford.edu/~schuetze/roc.html (Hinrich Schütze, co-author of
Foundations of Statistical Natural Language Processing)
ROC (cont.)ROC (cont.)
Determining LikelihoodDetermining Likelihood
Ideal – determine likelihood of correct answer based only on question
If possible, can skip such questionsUse decision tree based on set of features
from question string 1-, 2-grams, type sentence length, longest word length # capitalized words, # stop words Ratio of stop words to non-stop words
Decision Tree/Diagnostic ToolDecision Tree/Diagnostic Tool
Performs worst on how questionsPerforms best on short who questions
w/many stop wordsInduce ROC curve from decision tree
– Sort leaf nodes from highest probability of being correct to lowest
– Gain precision by not answering questions with highest probability of error
Decision Tree–QueryDecision Tree–Query
Decision Tree–Query ResultsDecision Tree–Query Results
Decision Tree trained on TREC-9Tested on TREC-10Overfits training data – insufficient
generalization
Decision Tree–Query TrainingDecision Tree–Query Training
Decision Tree–Query TestDecision Tree–Query Test
Answer Correctness/ScoreAnswer Correctness/Score
Ad-hoc score based on– # of retrieved passages n-gram occurs in– weight of rewrite used to retrieve passage– what filters apply– effects of n-gram tiling
Correlation between whether answer appears in top 5 output and…
Correct Answer In Top 5Correct Answer In Top 5
…and score of system’s first ranked answer– Correlation coefficient: 0.363– No time-sensitive q’s: 0.401
…and score of first ranked answer minus second– Correlation coefficient: 0.270
Answer #1 Score - TrainAnswer #1 Score - Train
Answer #1 Score – TestAnswer #1 Score – Test
Other Likelihood IndicatorsOther Likelihood Indicators
Snippets gathered for each question– AND queries– More refined exact string match rewrites
MRR and snippets– All snippets from AND: 0.238– 11 to 100 from non-AND: 0.612– 100 to 400 from non-AND: 0.628
But wasn’t MRR for “base” system 0.507?
Another Decision TreeAnother Decision Tree
Features of first DT, plus– Score of #1 answer– State of system in processing
Total # of matching passages # of non-AND matching passages Filters applied Weight of best rewrite rule yielding matching
passages Others
Decision Tree–All featuresDecision Tree–All features
Decision Tree–All TrainDecision Tree–All Train
Decision Tree–All TestDecision Tree–All Test
Decision Tree–AllDecision Tree–All
Gives useful ROC curve on test dataOutperformed by Answer #1 ScoreThough outperformed by simpler ad-hoc
technique, still useful as diagnostic tool
ConclusionsConclusions
Novel approach to QACareful analysis of contributions of major
system componentsAnalysis of factors behind errorsApproach for learning when system is
likely to answer incorrectly– Allowing system designers to decide when to
trade recall for precision
My ConclusionsMy Conclusions
Claim: techniques used in approach to short-answer tasks are more broadly applicable
Reality: “We are currently exploring whether these techniques can be extended beyond short answer QA to more complex cases of information access.”
My Conclusions (cont.)My Conclusions (cont.)“…we do not need to use any named entity
recognition components.”– Filters w/capitalization info = NE recognition
12% of errors beyond system paradigm– Still wonder–is this really so?
9% of errors–no discussionAd hoc method outperforms Decision Tree
– Did they merely do a good job of designing system, of assigning weights, etc.?
– Did they get lucky?