Finding Similar Questions in large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon...

Finding Similar Questions in large Question and Answer Archives

Jiwoon Jeon, W. Bruce Croft and Joon Ho LeeACM CIKM ‘05

Presented by Mat KellyCS895 – Web-based Information Retrieval

Old Dominion UniversityDecember 13, 2011

Question Answering fromFrequently Asked Question Files

Robin D. Burke, Kristian J Hammond, Valdimir Kulyukin, Steven L. Lytinen, Noriko Tomurom and Scott Schoenberg

AI magazine; Summer 1997

What is FAQ Finder?

• Matches answers to questions already asked in a site’s FAQ file

• 4 Assumptions1. Information in QA Format2. All information needed to determine relevance of

QA is can be found in QA Pair3. Q half of QA pair most relevant for matching to

user’s question4. Broad, shallow knowledge of language is sufficient

for question matching

How Does It Work?

• Uses SMART IR system to narrow focus of relevant FAQ files

• Iterates through QA pairs in FAQ file, comparing against user’s question and computing a score using 3 metrics– Statistical term-vector similarity score t– Semantic similarity score s– Coverage score c

CWT

cCsStTm

T,S and C are constant weights that adjust reliance of system on each metric.

Calculating Similarity

• QA pair represented as a term vector w/ signif. Values for each term in the pair

• Significance value = tfidf• n (term freq) = # time term appears in QA pair• m = # QA pairs term appears in in file• tfidf = n x log(M/m)• Evaluate relative rarity of term within

documents– Use as factor to weight freq of term in document

Nuances

• Many ways to express the same question– Synonymous terms often used in large documents– Thus, variations will have no effect

• However, FAQ Finder is matching on small # of terms, system needs means of matching synonyms– How do I reboot my system?– What do I do when my computer crashes?– Causal relationship resolved with WordNet

WordNet

• Semantic network of English words• Provides relations between words and

synonym sets & between synonym sets and themselves

• FAQ Finder utilizes through marker-passing algorithm– Compares each word in the user’s question to

each word in FAQ file question

WordNet (cont…)

• Not a single semantic network, different sub-networks exist for nouns, verbs, etc.

• Syntactically ambiguous words (e.g. run) appears in more than one network.

• Simply relying on default word sense worked as well as any more sophisticated techniques

Evaluating Performance

• Corpus from log file of system’s use– May-Dec 1996.

• 241 questions used• Manually scanned and found 138 answers to

questions and 103 questions unanswered• Assumes there is a correct (single QA pair)• Because this task is different than conventional

IR problem, have to redefine recall and precision

Why Redefine Recall & Precision?

• RECALL – typically is measurement of % of relevant docs in set relative to query

• PRECISION – typically measurement of % retrieved docs that are relevant

• There is only one correct doc, these are not independent• e.g. query returns 5 QA pairs

– FAQ Finder returns either 100% recall and 20% precisionOR– Returns 0% recall, 0% precision– If no answer exists, precision = 0%, recall = undefined

Redefining Recall & Precision

• Recallnew=% questions FAQFinder returns correct answer when one exists– Does not penalize if >1 correct answer (original)

• Instead of precision, calculate rejection• Rejection - % questions FAQFinder correctly reports as

being answered– Adjusted to set cutoff point for minimum-allowable-matches

• There is still a tradeoff between rejection and recall– Rejection threshold too high, some correct answers eliminated– Rejection too low, incorrect answers given to user when no

answer exists

Results

• Correct file appears 88% of time within top 5 files returned, 48% of time in first position

Equates to 88% Recall, 23% Precision• System confidently

returns garbage when there is not correct answer in file

Ablation Study

• Evaluation of different components in matching scheme by disabling1. QA pairs selected randomly from FAQ file2. Coverage score for each condition used by itself3. Semantic scores from WordNet used in eval4. Term vector comparison used in isolation

Conditions’ Contributions

• WordNet and stat technique contribute strongly

• Their combination yields results that are better than either individually.

Where FAQ Finder Fails

• Biggest culprit of not finding is undue weight given to semantically useless words– Where can I find woodworking plans for a futon‽– woodworking is incorporated as strongly as futon– futon should be much more important inside the

woodworking FAQ than woodworking, which applies to everything

• Other problem: violation of assumptions about FAQ files

Conclusion

• When there is an existing collection of Qs & As, Qs can be reduced to matching new questions against QA pairs

• Power of approach is because FAQ Finder uses highly organized knowledge sources that are designed to answer commonly asked Qs.

Citing Paper’s Objectives

• Find questions in archive semantically similar to user’s question.

• Resolve:– Two questions that have the same meaning use

very different wording– Similarity measures developed for document

retrieval work poorly when there is little word overlap.

Approaches Toward The Word Mismatch Problem

1. Use knowledge databases as machine readable dictionaries (req. from first paper)– Current quality and structure are insufficient

2. Employ manual rules and templates– Expensive and hard to scale for large collections

3. Use statistical techniques from IR and natural language processing– Most promising with enough trained data

Problems with the Statistical Approach

• Need: Large # of semantically similar but lexically different sentences or Q pairs– No such collection exists on large scale

• Researchers artificially generate collections through methods like translation and subsequent reverse translation

• Paper proposed automatic way of building collections of semantically similar questions from existing Q&A collections

Question & Answer Archives

• Naver – leading portal site in S. Korea. Ex.

• Avg len of Q field = 5.8w• Avg Q body = 49w• Avg Answer = 179w• Made 2 test collections from archive– A-6.8M QA Pairs across all categories– B-68k QA Pairs across “Computer Novice” Categ.

Question Title How to make multi-booting systems?

Question Body I am using Windows98. I’d like to multi-boot with Windows XP. How can I do this?

Answer You must parition your hard disk, then install windows98 first. If there is no problem with windows98, then, install windows XP on…

• Need: Sets of topics with relevance judgments– 2 sets of 50 QA pairs rand. Selected• First set from Collection A and chosen across all Cats• Second set from Collection B, chosen from “Comp.

Novice” category

• Each pair converted to topic– QTITLE short query

– QBODY long query– A supplemental query } Used only in relevance judgement

procedure

Find Relevant QA Pairs

• Given a topic, employ TREC pooling technique• 18 diff. retrieval results generated by varying retrieval

algorithm, query type & search field• Retrieval models such as Okapi BM25, query-likelihood and

overlap coefficient used• Pooled top 20 QA pairs from each, did manual relevance

judgments– As long as seman. Identical or very similar to query, QA pair is

considered relevant– If no QA pairs found for a given topic, manually browse the

collection to find ≥1 QA pair• Result = 785 Relevant QA Pairs for A, 1557 for B

Verifying Field Importance

• Prev. Research: Similarity between questions is more important than similarity betw. Qs & As in FAQ Retrieval task

• Exp. 1: Search only QTitle field

• Exp. 2: Only QBody

• Exp 3: Only Answer• For all exps, use query

likelihood model with Dirichlet smoothing and Okapi BM25

Regardless of retrieval model, best performance from searching the question title field. Performance gaps for others are significant.

Collecting Semantically Similar Questions

• Many people don’t search to see if Q has already been asked, so ask a seman. similar Q.

• Assume: If two answers are similar then corresponding Qs are semantically similar but lexically different.

I’d like to insert music into Powerpoint.How can I link sounds in Powerpoint?

How can I shut down my system in Dos-mode.How to turn off computers in Dos-mode

Photo transfer from cell phones to computers.How to move photos taken by cell phones.

Sample semantically similar questions with little word overlap

Algorithm

• Consider 4 popular document similarity measures:1. Cosine similarity with vector space model2. Negative KL divergence between language

models3. Output score of query likelihood model4. Score of Okapi model

Finding a Similarity Measure: The Cosine Similarity Model

• Length of answers vary considerably– Some very short (factoids)– Others very long (C&P from web)

• Any similarity measure affected by length is not appropriate

Finding a Similarity Measure: Negative KL Divergence & Okapi

• Values are not symmetric and not probabilities– pair of answers that has a higher negative KL

divergence than another pair does not necessarily have stronger semantic connections

• Hard to rank pairs• Okapi Model has Similar Problems

Finding a Similarity Measure: Query Likelihood Model

• Score is a probability.• Can be used across different answer pairs• Score are NOT symmetric

Overcoming Problems

• Using ranks instead of scores was more effective– If answer A retrieves answer B @ rank r1 and

answer B retrieves answer A @ rank r2 then similarity between 2 answers = reverse harmonic mean of two ranks:

– Use query likelihood model to calc init. ranks

21

11

2

1),(

rrBAsim

Experiments & Results

• 68,000*67,999/2 answers possible from 68,000 Q&A pairs in Collection B

• All ranked using established measure• Empirically set threshold 0.005– Judge whether pair is related or not– Higher threshold = smaller but better quality collections– To acquire enough training samples, threshold cannot

be too high• 331,965 pairs have score above threshold

Word Translation Probabilities

• Question pair collection a parallel corpus• IBM model 1– Does not require any linguistic knowledge for

src/target language, treats every word alignment equally

– Translation from src s to target t =

– λs = normalization factor, so sum of probs = 1– N = # training samples– Ji= ith pair in training set

N

i

is JstcstP

1);|()|( 1

Word Translation Probabilities (cont)

• {s1,…,sn} = words in src sentence in Ji

• #(t,Ji) = number of times t occurs in Ji

• Still need: old translation probs• We initialize translation probs with rand values,

then est. new translation probs– Repeat until probs converge– Procedure always converges to same final solution1

),()#,()#|(...)|(

)|();|(

1ii

n

i

JsJtstPstP

stPJstc

[1] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra and R. L. Mercer. The mathematics of statistical machine translation: parameter estimation. Comput. Linguis., 19(2):263-311, 1993.

Experiments & Results(Word Translation)

• Removed stop words• Collection of 331965 Q pairs duplicated by

switching src and target pars then used as input

• Usually: most similar word to a given word is the word itself

• Found semantic relationships: found “bmp” to be similar to “jpg” and “gif”

Question Retrieval

• Where to go from Q titles from word translation probs?

• Similarity between query and document:

• Avoid 0 Probs, est. more accurate lang. models

• term w generated from collection C/D• In translation model, convert to:

Qw

DwPDQPDQsim

)|()|(),(

)|()|()1()|( CwPDwPDwP mlml

)|())|()|(()1()|( CwPDwPtwTDwP mlmlDt

Experiments & Results(Question Retrieval)

• 50 short queries from collection B, searching only title field

• Similarities betw. query Q and Q titles calculated

• Compare performance model with vector space model w/ cosine similarity, Okapi BM25 and query likelihood language model

Experiments & Results cont…(Question Retrieval)

Model Cosine LM Okapi Trans

MAP 0.183 0.258 0.251 0.314

R-Precision @ 5 0.368 0.492 0.476 0.520

R-Precision @ 10 0.310 0.456 0.436 0.480

•Approach outperforms other baseline models at recall levels•QL and Okapi show comparable performance•In all evaluations, approach outperforms other models

Conclusions and Seminal Paper Relevance

• Retrieval model based on translation probs learned from archive significantly outperforms other approaches in finding semantically similar questions despite lexical mismatch

• Using translation probabilities and determining similarity of answers is a much more robust approach for resolving similar QA pairs with fewer prerequisite of corpus

References

• Burke, R. D., Hammond, K. J., Kulyukin, V. A., Lytinen, S. L., Tomuro, N., & Schoenberg, S. (1997). Question answering from frequently asked question files: Experience with the FAQ finder system (Tech. Rep.). Chicago,, IL, USA.

• Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee. 2005. Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM international conference on Information and knowledge management (CIKM '05). ACM, New York, NY, USA, 84-90.

Finding Similar Questions in large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon...

Documents

Transcript of Finding Similar Questions in large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon...