A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems...

A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering SystemsSaeedeh Momtazi, Dietrich KlakowUniversity of Saarland,Germany(CIKM’09)

Advisor: Dr. Koh, Jia-LingSpeaker: Cho, Chin-WeiDate: 2010.03.01

Outline

•Introduction•Term Clustering Model•Term Clustering Algorithm•N-gram Models•Experiments•Conclusion

Introduction

•Open domain QA has become one of the most actively investigated topics in natural language processing.

•A user receives an exact answer to his questions rather than being overwhelmed with a large number of retrieved documents, which he must then sort through to find the desired answer.

Introduction• In complete a QA system, document

retrieval is an important component which should provide a list of candidate documents to be analyzed by the rest of the system.

•Retrieved documents are much larger than the required answer, and topic changes typically occur within a single document.

•Relevant information is most often found in one sentence or two.

Introduction

•Hence, it is essential to split the text into smaller segments, such as sentences, and rank them in a sentence retrieval step.

•Retrieved sentences are then further processed using a variety of techniques to extract the final answers.

Introduction

•Simple word-based unigram:▫all terms are treated independently ▫no relationships between words are

considered▫a search is performed for only the exact

literal words present in the query▫fail to retrieve other relevant information

Introduction• Q: “Who invented the car?”• A1: “Between 1832 and 1839, Robert

Anderson of Scotland invented the first crude electric car carriage.”

.• A2: “Nicolas-Joseph Cugnot built the first self

propelled mechanical vehicle.”• A3: “An automobile powered by his own

engine was built by Karl Benz in 1885 and granted a patent.”

Introduction

•For this reason, it is desirable to have a more sophisticated model to capture the semantics of sentences rather than just the term distributions.

• Improvements in system performance from such schemes have proven challenging:▫the difficulty of estimating term relationships▫the difficulty of integrating both exact match

and term relationships in a single weighting scheme

Term Clustering Model

•We use a class-based LM by applying term clustering to capture term relationships.

•LM-based sentence retrieval▫the probability P(Q|S) of generating query

Q conditioned on the observation of sentence S is first calculated.

▫Sentences are ranked in descending order of this probability.

•For word-based unigrams, the probability of the query Q given the sentence S is estimated based on query terms:

•For class-based unigrams, P(Q|S) is computed based on the cluster to which each query term belongs:

Term Clustering Model - Overview• For each sentence in the word-based model,

the sentence terms are extracted.

“Between 1832 and 1839, Robert Anderson of Scotland invented the first crude electric car carriage.”-> “invented”, “car”, ….

• Then the sentence unigram is created from the terms.

Term Clustering Model - Overview• While in the class-based model, using a term

clustering technique, a map between terms and clusters is created.

• In this case, instead of extracting the terms, the clusters which the terms belong to are extracted.

(“car”, “vehicle”, “automobile”)

(“built”, “invented”)

• Then the sentence unigram is created from the clusters.

Term Clustering Model - Overview•Finally, P(Q|S) is computed based on the

new model of S, and the sentences are ranked according to this probability.

Term Clustering Algorithm

•The Brown algorithm uses mutual information between cluster pairs in a bottom-up approach to maximize Average Mutual Information (AMI) between adjacent clusters.

Term Clustering Algorithm

•— significantly, highly, strongly, fairly•— shipping, carrying•— Jazz, symphony, theater

•For example, seeing the term “highly” in a query, the class-based model is able to retrieve sentences that do not have this term but are nonetheless relevant to the query in that the cluster also contains other terms like “strongly”.

Interpolation Model• The class-based model may

▫increase system recall, in that it is able to find more relevant sentences.

▫decrease the system precision, by retrieving more irrelevant sentences.

• We use the linear interpolation of our new class-based model and the word-based model:

N-gram Models

•Unigram(N=1) is most often used in information retrieval applications▫the word dependencies are not considered

•To overcome this problem, it is necessary to consider longer word contexts, such as those used in the bigram(N=2) and trigram(N=3).

N-gram Models• It is very difficult to use higher order n-grams for

sentence retrieval.▫ data sparsity at the sentence level is much more

pronounced than that at the document level.▫ the probability of seeing large patterns is very low.

• To overcome this problem, we applied the class-based bigram and trigram:

Experiments

•Data set▫TREC2007 questions from the QA track▫A set of 1,960 sentences were used as input

to our system.▫The lexicon contained nearly 10,000 terms,

which included all words present in the questions and the sentences to be processed.

Experiments

Conclusion•We found out that the class-based model

using term clustering is an effective approach to solve the data sparsity, the exact matching, and the term independence problems.

•Linear interpolation of a class-based bigram and a word-based unigram was investigated.

•Our results indicated a significant improvement of sentence retrieval mean average precision from 23.62% to 29.91%.

A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems...

Documents

Transcript of A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems...

Interdisciplinary design education in Turkey and Cyprus Saeedeh K.Zeinali Eastern Mediterranean University 1/23/2008.

Natural Language Processing SoSe 2015 - Hasso … Language Processing SoSe 2015 Relation Extraction Dr. Mariana Neves June 15th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Outline

UNIVERSITI PUTRA MALAYSIA EFFECT OF …psasir.upm.edu.my/5716/1/A_FS_2009_13.pdfSIFAT-SIFAT SUPERKONDUKTOR SERAMIK (Bi1.6 Pb0.4)Sr2Ca2Cu3O10. Oleh . SAEEDEH SADAT HOSSEINI RAVANDI

Natural Language Processing - Hasso Plattner …Natural Language Processing Machine Learning Potsdam, 26 April 2012 Saeedeh Momtazi Information Systems Group Introduction Machine Learning

Natural Language Processing - Hasso Plattner Institute · Natural Language Processing Lexical Semantics Word Sense Disambiguation and Word Similarity Potsdam, 31 May 2012 Saeedeh

2010 Fall Transportation Conference A Guideline for Choosing Cycle Length to Maximize Two-Way Progression in Downtown Area Saeedeh Farivar Zong Tian University.

Rethinking organization design to enforce …...1 Rethinking organization design to enforce organizational agility Saeedeh Shafiee Kristensen Milestone Systems sana@milestone.dk Sara

Map of the Field October 2008 Dietrich Klakow based on

Parsing Potsdam, 10 May 2012 - Hasso Plattner InstituteParsing Potsdam, 10 May 2012 Saeedeh Momtazi Information Systems Group based on the slides of the course book ... 0.9 0.1 0.3

Pourquoi et comment participer à Wikipédia: …...Marashi, Sayed-Amir, Seyed Mohammad Amin Hosseini-Nami, Khadijeh Alishah, Mahdieh Hadi, Ali Karimi, Saeedeh Hosseinian, Rouhallah

Patrick Gets Hearing Aids, 1994, Maureen Cassidy Riski ... · Patrick Gets Hearing Aids, 1994, Maureen Cassidy Riski, Nikolas Klakow, 0964769107, 9780964769106, Phonak, 1994 DOWNLOAD

Torabi Jafrodi , Saeedeh; Ghanbari, Mojgan; Mahmoudian, … · applied sciences Article A Novel Control Strategy to Active Power Filter with Load Voltage Support Considering Current

A Survey of Nurses' and Patients' Opinions about Patient … · Saeedeh Ranjbar4, Marzieh Hosseinzadeh5, Shahin Imani6 Abstract Background and Objectives: ... Hassannejhad N, Ranjbar

Human activity recognition by using MHIs of frame sequencesjournals.tubitak.gov.tr/elektrik/issues/elk-20-28... · Human activity recognition by using MHIs of frame sequences Saeedeh

SMOKED BEEF DRIPPING, TOMATO & CAPSICUM SALSA … · 2014 valli ‘burn cottage’ pinot noir | 180 central otago, new zealand 2013 kelley fox pinot noir ‘momtazi’ | 268 ... 2012

یملع تأیه ءاضعا )CV( هموزر مرف - iauyazd.ac.ir · 2018. 10. 15. · Mohammad T. Baei, Alireza Soltanib, Saeedeh Hashemian 191(5) 702-708 2016 ISI IF=0.95 PHOSPHORUS,

Workload Forecast System Saeedeh Mehrouzi Camilo Cifuentes Srikanth Korrapati Joshua Adomako.

ICMP (Internet Control Message Protocol) Computer Networks By: Saeedeh Zahmatkesh 90-91 spring.

R. Hess/R. Klakow-Franck (Bearb.) Gebührenordnung für ... · Im nachstehenden Analogverzeichnis ist jede Analog-bewertung mit einem großen „A“ und einer Nummer, der sogenannten

Saeedeh Rashidi and Nasrin Soltankhah