Recovering Question Answering Errors via Query Revisionxyan/papers/emnlp17_revision.pdf · 2...

7
Recovering Question Answering Errors via Query Revision Semih Yavuz and Izzeddin Gur and Yu Su and Xifeng Yan Department of Computer Science, University of California, Santa Barbara {syavuz,izzeddingur,ysu,xyan}@cs.ucsb.edu Abstract The existing factoid QA systems often lack a post-inspection component that can help models recover from their own mis- takes. In this work, we propose to cross- check the corresponding KB relations be- hind the predicted answers and identify potential inconsistencies. Instead of devel- oping a new model that accepts evidences collected from these relations, we choose to plug them back to the original questions directly and check if the revised question makes sense or not. A bidirectional LSTM is applied to encode revised questions. We develop a scoring mechanism over the re- vised question encodings to refine the pre- dictions of a base QA system. This ap- proach can improve the F 1 score of STAGG (Yih et al., 2015), one of the leading QA systems, from 52.5% to 53.9% on WE- BQUESTIONS data. 1 Introduction With the recent advances in building large scale knowledge bases (KB) like Freebase (Bollacker et al., 2008), DBpedia (Auer et al., 2007), and YAGO (Suchanek et al., 2007) that contain the world’s factual information, KB-based question answering receives attention of research efforts in this area. Traditional semantic parsing is one of the most promising approaches that tackles this problem by mapping questions onto logical forms using logical languages CCG (Kwiatkowski et al., 2013; Reddy et al., 2014; Choi et al., 2015; Reddy et al., 2016), DCS (Berant et al., 2013; Berant and Liang, 2014, 2015), or directly query graphs (Yih et al., 2015) with predicates closely related to KB schema. Recently, neural network based models have been applied to question answering (Bordes Figure 1: Sketch of our approach. Elements in solid round rectangles are KB relation labels. Relation on the left is cor- rect, but the base QA system predicts the one on the right. Dotted rectangles represent revised questions with relation labels plugged in. The left revised question looks semanti- cally closer to the original question and itself is more consis- tent. Hence, it shall be ranked higher than the right one. et al., 2015; Yih et al., 2015; Xu et al., 2016a,b). While these approaches yielded successful re- sults, they often lack a post-inspection component that can help models recover from their own mis- takes. Table 1 shows the potential improvement we can achieve if such a component exists. Can we leverage textual evidences related to the pre- dicted answers to recover from a prediction error? In this work, we show it is possible. Our strategy is to cross-check the correspond- ing KB relations behind the predicted answers and identify potential inconsistencies. As an interme- diate step, we define question revision as a tai- lored transformation of the original question using textual evidences collected from these relations in a knowledge base, and check if the revised ques- tions make sense or not. Figure 1 illustrates the idea over an example question “what did Mary Wollstonecraft fight for?”. Obviously, “what [area of activism] did [activist] fight for?” looks more consistent over “what [profession] did [person] fight for?”. We shall build a model that prefers the former one. This model shall be specialized for comparing the revised questions and checking which one makes better sense, not for answering the revised questions. This strategy differentiates

Transcript of Recovering Question Answering Errors via Query Revisionxyan/papers/emnlp17_revision.pdf · 2...

Page 1: Recovering Question Answering Errors via Query Revisionxyan/papers/emnlp17_revision.pdf · 2 Question Revisions We formalize three kinds of question revi-sions, namely entity-centric,

Recovering Question Answering Errors via Query Revision

Semih Yavuz and Izzeddin Gur and Yu Su and Xifeng YanDepartment of Computer Science, University of California, Santa Barbara

{syavuz,izzeddingur,ysu,xyan}@cs.ucsb.edu

Abstract

The existing factoid QA systems oftenlack a post-inspection component that canhelp models recover from their own mis-takes. In this work, we propose to cross-check the corresponding KB relations be-hind the predicted answers and identifypotential inconsistencies. Instead of devel-oping a new model that accepts evidencescollected from these relations, we chooseto plug them back to the original questionsdirectly and check if the revised questionmakes sense or not. A bidirectional LSTMis applied to encode revised questions. Wedevelop a scoring mechanism over the re-vised question encodings to refine the pre-dictions of a base QA system. This ap-proach can improve the F1 score of STAGG

(Yih et al., 2015), one of the leading QAsystems, from 52.5% to 53.9% on WE-BQUESTIONS data.

1 Introduction

With the recent advances in building large scaleknowledge bases (KB) like Freebase (Bollackeret al., 2008), DBpedia (Auer et al., 2007), andYAGO (Suchanek et al., 2007) that contain theworld’s factual information, KB-based questionanswering receives attention of research efforts inthis area. Traditional semantic parsing is one ofthe most promising approaches that tackles thisproblem by mapping questions onto logical formsusing logical languages CCG (Kwiatkowski et al.,2013; Reddy et al., 2014; Choi et al., 2015; Reddyet al., 2016), DCS (Berant et al., 2013; Berant andLiang, 2014, 2015), or directly query graphs (Yihet al., 2015) with predicates closely related to KBschema. Recently, neural network based modelshave been applied to question answering (Bordes

Figure 1: Sketch of our approach. Elements in solid roundrectangles are KB relation labels. Relation on the left is cor-rect, but the base QA system predicts the one on the right.Dotted rectangles represent revised questions with relationlabels plugged in. The left revised question looks semanti-cally closer to the original question and itself is more consis-tent. Hence, it shall be ranked higher than the right one.

et al., 2015; Yih et al., 2015; Xu et al., 2016a,b).While these approaches yielded successful re-

sults, they often lack a post-inspection componentthat can help models recover from their own mis-takes. Table 1 shows the potential improvementwe can achieve if such a component exists. Canwe leverage textual evidences related to the pre-dicted answers to recover from a prediction error?In this work, we show it is possible.

Our strategy is to cross-check the correspond-ing KB relations behind the predicted answers andidentify potential inconsistencies. As an interme-diate step, we define question revision as a tai-lored transformation of the original question usingtextual evidences collected from these relations ina knowledge base, and check if the revised ques-tions make sense or not. Figure 1 illustrates theidea over an example question “what did MaryWollstonecraft fight for?”. Obviously, “what [areaof activism] did [activist] fight for?” looks moreconsistent over “what [profession] did [person]fight for?”. We shall build a model that prefersthe former one. This model shall be specializedfor comparing the revised questions and checkingwhich one makes better sense, not for answeringthe revised questions. This strategy differentiates

Page 2: Recovering Question Answering Errors via Query Revisionxyan/papers/emnlp17_revision.pdf · 2 Question Revisions We formalize three kinds of question revi-sions, namely entity-centric,

Refinement F1 # Refined Qs

STAGG 52.5 -w/ Best Alternative 58.9 639

Table 1: What if we know the questions on which the sys-tem makes mistakes? Best alternative is computed by re-placing the predictions of incorrectly answered questionsby STAGG with its second top-ranked candidate.

our work from many existing QA studies.Given a question, we first create its revisions

with respect to candidate KB relations. We encodequestion revisions using a bidirectional LSTM. Ascoring mechanism over these encodings is jointlytrained with LSTM parameters with the objectivethat the question revised by a correct KB relationhas higher score than that of other candidate KBrelations by a certain confidence margin. We eval-uate our method using STAGG (Yih et al., 2015)as the base question answering system. Our ap-proach is able to improve the F1 performance ofSTAGG (Yih et al., 2015) from 52.5% to 53.9%on a benchmark dataset WEBQUESTIONS (Berantet al., 2013). Certainly, one can develop special-ized LSTMs that directly accommodate text evi-dences without revising questions. We have modi-fied QA-LSTM and ATTENTIVE-LSTM (Tan et al.,2016) accordingly (See Section 4). However, sofar the performance is not as good as the questionrevision approach.

2 Question Revisions

We formalize three kinds of question revi-sions, namely entity-centric, answer-centric, andrelation-centric that revise the question with re-spect to evidences from topic entity type, answertype, and relation description. As illustrated inFigure 2, we design revisions to capture general-izations at different granularities while preservingthe question structure.

Let sr (e.g., Activist) and or (e.g.,ActivismIssue) denote the subjectand object types of a KB relation r (e.g.,AreaOfActivism), respectively. Let α(type.object.name) denote a functionreturning the textual description of a KB element(e.g., relation, entity, or type). Assuming that acandidate answer set is retrieved by executing aKB relation r from a topic entity in question, wecan uniquely identify the types of topic entity andanswer for the hypothesis by sr and or, respec-tively. It is also possible that a chain of relationsr = r1r2 . . . rk is used to retrieve an answer set

Figure 2: Illustration of different question revisionstrategies on the running example w.r.t KB relationactivism.activist.area of activism.

from a topic entity. When k = 2, by abuse ofnotation, we define sr1r2 = sr1 , or1r2 = or2 , andα(r1r2) = concat(α(r1), α(r2)).

Let m : (q, r) 7→ q′ denote a mapping froma given question q = [w1, w2, . . . , wL] and a KBrelation r to revised question q′. We denote theindex span of wh-words (e.g., “what”) and topicentity (e.g., “Mary Wollstonecraft”) by [is, ie] and[js, je].Entity-Centric (EC). Entity-centric question re-vision aims a generalization at the entity level.We construct it by replacing topic entity tokenswith its type. For the running example, it be-comes “what did [activist] fight for”. Formally,mEC(q, r) = [w[1:js−1];α(sr);w[je+1:L]].Answer-Centric (AC). It is constructed by aug-menting the wh-words of entity-centric questionrevision with the answer type. The running ex-ample is revised to “[what activism issue] did[activist] fight for”. Formally, we define it asmAC(q, r) = [w′[1:ie−1];α(or);w

′[ie+1:L]], where

w′i’s are the tokens of entity-centric revised ques-tion.Relation-Centric (RC). Here we augment the wh-words with the relation description instead. Thisform of question revision has the most expressivepower in distinguishing between the KB relationsin question context, but it can suffer more from thetraining data sparsity. For the running example,it maps to “[what area of activism] did [activist]fight for”. Formally, it is defined as mRC(q, r) =[w′[1:ie−1];α(r);w

′[ie+1:L]].

3 Model

3.1 Task FormulationGiven a question q, we first run an existing QAsystem to answer q. Suppose it returns r as thetop predicted relation and r′ is a candidate relationthat is ranked lower. Our objective is to decide if

Page 3: Recovering Question Answering Errors via Query Revisionxyan/papers/emnlp17_revision.pdf · 2 Question Revisions We formalize three kinds of question revi-sions, namely entity-centric,

there is a need to replace r with r′. We formulatethis task as finding a scoring function s(q, r) anda confidence margin threshold t,

replace(r, r′, q)=

{1, if s(q, r′)− s(q, r) ≥ t0, otherwise,

(1)

which makes the replacement decision.

3.2 Encoding Question Revisions

Let q′ = (w′1, w′2, . . . , w

′l) denote a question re-

vision. We first encode all the words into a d-dimensional vector space using an embedding ma-trix. Let ei denote the embedding of word w′i. Toobtain the contextual embeddings for words, weuse bi-directional LSTM

−→h i = LSTMfwd(

−→h i−1, ei) (2)

←−h i = LSTMbwd(

←−h i+1, ei) (3)

with−→h 0 = 0 and

←−h l+1 = 0. We combine

forward and backward contextual embeddings byhi = concat(

−→h i,←−h i). We then generate the fi-

nal encoding of revised question q′ by enc(q′) =concat(h1,hl).

3.3 Training Objective

Score Function. Given a question revision map-ping m, a question q, and a relation r, our scoringfunction is defined as s(q, r) = wTenc(m(q, r))wherew is a model parameter that is jointly learntwith the LSTM parameters.Loss Function. Let T = {(q, aq)} denote a setof training questions paired with their true answerset. Let U(q) denote the set of all candidate KBrelations for question q. Let f(q, r) denote the F1

value of an answer set obtained by relation r whencompared to aq. For each candidate relation r ∈U(q) with a positive F1 value, we define

N(q, r) = {r′ ∈ U(q) : f(q, r) > f(q, r′)} (4)

as the set of its negative relations for question q.Similar to a hinge-loss in (Bordes et al., 2014), wedefine the objective function J(θ,w,E) as∑

(q,r,r′)

max(0, δλ(q, r, r′)− (s(q, r)− s(q, r′))) (5)

where the sum is taken over all valid {(q, r, r′)}triplets and the penalty margin is defined asδλ(q, r, r

′) = λ(f(q, r)− f(q, r′)).

We use this loss function because: i) it allows usto exploit partially correct answers via F1 scores,and ii) training with it updates the model param-eters towards putting a large margin between thescores of correct (r) and incorrect (r′) relations,which is naturally aligned with our prediction re-finement objective defined in Equation 1.

4 Alternative Solutions

Our approach directly integrates additional textualevidences with the question itself, which can beprocessed by any sequence oriented model, andbenefit from its future updates without signifi-cant modification. However, we could also designmodels taking these textual evidences into specificconsideration, without even appealing to questionrevision. We have explored this option and triedtwo methods that closely follow QA-LSTM andATTENTIVE-LSTM (Tan et al., 2016). The lat-ter model achieves the state-of-the-art for passage-level question answer matching. Unlike our ap-proach, they encode questions and evidences forcandidate answers in parallel, and measure the se-mantic similarity between them using cosine dis-tance. The effectiveness of these architectures hasbeen shown in other studies (Neculoiu et al., 2016;Hermann et al., 2015; Chen et al., 2016; Muellerand Thyagarajan, 2016) as well.

We adopt these models in our setting as fol-lows: (1) Textual evidences α(sr) (equiv. of ECrevision), α(or) (equiv. of AC revision) or α(r)(equiv. of RC revision) of a candidate KB relationr is used in place of a candidate answer a in theoriginal model, (2) We replace the entity mentionwith a universal #entity# token as in (Yih et al.,2015) because individual entities are rare and un-informative for semantic similarity, (3) We trainthe score function sim(q, r) using the objectivedefined in Eq. 5. Further details of the alternativesolutions can be found in Appendix.

5 Experiments

Datasets. For evaluation, we use the WEBQUES-TIONS (Berant et al., 2013), a benchmark datasetfor QA on Freebase. It contains 5,810 questionswhose answers are annotated from Freebase us-ing Amazon Mechanical Turk. We also use SIM-PLEQUESTIONS (Bordes et al., 2015), a collectionof 108,442 question/Freebase-fact pairs, for train-ing data augmentation in some of our experiments,which is denoted by +SimpleQ. in results.

Page 4: Recovering Question Answering Errors via Query Revisionxyan/papers/emnlp17_revision.pdf · 2 Question Revisions We formalize three kinds of question revi-sions, namely entity-centric,

Method F1

(Berant et al., 2013) 35.7(Yao and Van Durme, 2014) 33.0(Berant and Liang, 2014) 39.9(Bao et al., 2014) 37.5(Bordes et al., 2014) 39.2(Yang et al., 2014) 41.3(Dong et al., 2015) 40.8(Yao, 2015) 44.3(Berant and Liang, 2015) 49.7STAGG (Yih et al., 2015) 52.5(Reddy et al., 2016) 50.3(Xu et al., 2016b) 53.3(Xu et al., 2016a) 53.8QUESREV on STAGG 53.9

EnsembleSTAGG-RANK (Yavuz et al., 2016) 54.0QUESREV on STAGG-RANK 54.3

Table 2: Comparison of our question revision approach(QUESREV) on STAGG with variety of recent KB-QA works.

Training Data Preparation. WEBQUESTIONS

only provides question-answer pairs along withannotated topic entities. We generate candidatesU(q) for each question q by retrieving 1-hop and2-hop KB relations r from annotated topic entity ein Freebase. For each relation r, we query (e, r, ?)against Freebase and retrieve the candidate an-swers ra. Then, we compute f(q, r) by comparingthe answer set ra with the annotated answers.

5.1 Implementation Details

Word embeddings are initialized with pretrainedGloVe (Pennington et al., 2014) vectors1, and up-dated during the training. We take the dimen-sion of word embeddings and the size of LSTMhidden layer equal and experiment with values in{50, 100, 200, 300}. We apply dropout regulariza-tion on both input and output of LSTM encoderwith probability 0.5. We hand tuned penalty mar-gin scalar λ as 1. The model parameters are op-timized using Adam (Kingma and Ba, 2015) withbatch size of 32. We implemented our models intensorflow (Abadi et al., 2016).

To refine predictions of a base QA system, wetake its second top ranked prediction as the refine-ment candidate r′ and employ the decision mecha-nism in Equation 1. Confidence margin thresholdt is tuned by grid search on the training data af-ter the score function is trained. QUESREV-AC+ RC model is obtained by a linear combina-tion of QUESREV-AC and QUESREV-RC mod-els, which is formally defined in Appendix B. Toevaluate the alternative solutions for prediction re-

1http://nlp.stanford.edu/projects/glove/

Refinement Model WebQ. + SimpleQ.

QA-LSTM-(equiv EC) 51.9 52.5QA-LSTM-(equiv AC) 52.4 52.9QA-LSTM-(equiv RC) 52.6 53.0ATTENTIVE-LSTM-(equiv EC) 52.2 52.6ATTENTIVE-LSTM-(equiv AC) 52.7 53.0ATTENTIVE-LSTM-(equiv RC) 52.9 53.1QUESREV-EC 52.9 52.8QUESREV-AC 53.5 53.6QUESREV-RC 53.2 53.8QUESREV-AC + RC 53.3 53.9

Table 3: F1 performance of variants of our model QUESREVand alternative solutions on base QA system STAGG.

finement, we apply the same decision mechanismreplace(r, r′, q) with the trained sim(q, r) in Sec-tion 4 as the score function.

We use a dictionary2 to identify wh-words in aquestion. We find topic entity spans using Stan-ford NER tagger (Manning et al., 2014). If thereare multiple matches, we use the first matchingspan for both.

5.2 ResultsTable 2 presents the main result of our predictionrefinement model using STAGG’s results. Our ap-proach improves the performance of a strong baseQA system by 1.4% and achieves 53.9% in F1

measure, which is slightly better than the state-of-the-art KB-QA system (Xu et al., 2016a). How-ever, it is important to note here that Xu et al.(2016a) uses DBPedia knowledge base in additionto Freebase and the Wikipedia corpus that we donot utilize. Moreover, applying our approach onthe STAGG predictions reranked by (Yavuz et al.,2016), referred as STAGG-RANK in Table 2, leadsto a further improvement over a strong ensem-ble baseline. These suggest that our system cap-tures orthogonal signals to the ones exploited inthe base QA models. Improvements of QUESREV

over both STAGG and STAGG-RANK are statisti-cally significant.

In Table 3, we present variants of our approach.We observe that AC model yields to best refine-ment results when trained only on WEBQUES-TIONS data (e.g., WebQ. column). This empiricalobservation is intuitively expected because it hasmore generalization power than RC, which mightmake AC more robust to the training data sparsity.This intuition is further justified by observing thataugmenting the training data with SIMPLEQUES-TIONS improves the performance of RC modelmost as it has more expressive power.

2what, who, where, which, when, how

Page 5: Recovering Question Answering Errors via Query Revisionxyan/papers/emnlp17_revision.pdf · 2 Question Revisions We formalize three kinds of question revi-sions, namely entity-centric,

Example Predictions and Replacements1. What position did vince lombardi play in college?STAGG: person.(education).institution- what position did person play in collegeQUESREV-EC: football player.position s- what position did american football player play in college2. What did mary wollstonecraft fight for ?STAGG: person.profession- what profession did person fight forQUESREV-AC: activist.area of activism- what activism issue did activist fight for3. Where does the zambezi river start ?STAGG: river.mouth- where mouth does the river startQUESREV-RC: river.origin- where origin does the river start4. Where was anne boleyn executed?STAGG: person.place of birth- where place of birth was person executedQUESREV-RC: deceased person.place of death- where place of death was deceased person executed

Table 4: Example questions and corresponding predic-tions of STAGG (Yih et al., 2015) before and after us-ing replacements proposed by QUESREV. The colors redand blue indicate wrong and correct, respectively. Do-main names of KB relations are dropped for brevity.person.(education).institution is used as ashorthand for the 2-hop relation person.education-education.institution in Example 1.

Although both QA-LSTM and ATTENTIVE-LSTM lead to successful prediction refinements onSTAGG, question revision approach consistentlyoutperforms both of the alternative solutions. Thissuggests that our way of incorporating the newtextual evidences by naturally blending them inthe question context leads to a better mechanismfor checking the consistency of KB relations withthe question. It is possible to argue that part of theimprovements of refinement models over STAGG

in Table 3 may be due to model ensembling. How-ever, the performance gap between QUESREV andthe alternative solutions enables us to isolate thiseffect for query revision approach.

6 Related Work

One of the promising approaches for KB-QA is se-mantic parsing, which uses logical language CCG(Kwiatkowski et al., 2013; Reddy et al., 2014;Choi et al., 2015) or DCS (Berant et al., 2013;Berant and Liang, 2014, 2015) for finding theright grounding of the natural language on knowl-edge base. Another major line of work (Bordeset al., 2014; Yih et al., 2015; Xu et al., 2016b) ex-ploit vector space embedding approach in differ-

ent ways to directly measure the semantic similar-ity between questions and candidate answer sub-graphs in KB. In this work, we propose a post-inspection step that can help existing KB-QA sys-tems recover from answer prediction errors.

Our work is conceptually related to query ex-pansion, a well-explored technique (Qiu and Frei,1993; Navigli and Velardi, 2003; Riezler et al.,2007; Fang, 2008; Sordoni et al., 2014; Almasriet al., 2016; Diaz et al., 2016) in information re-trieval area. High-level idea behind query expan-sion is to reformulate the original query to improveretrieval performance. Our approach, on the otherhand, reformulates questions using candidate an-swers and exploits the revised questions to mea-sure the quality of answers. Hence, it should betreated rather as a post-processing component.

7 Conclusion

We present a prediction refinement approach forquestion answering over knowledge bases. We in-troduce question revision as a tailored augmenta-tion of the question via various textual evidencesfrom KB relations. We exploit revised questionsas a way to reexamine the consistency of candi-date KB relations with the question itself. Weshow that our method improves the quality of an-swers produced by STAGG on the WEBQUES-TIONS dataset.

Acknowledgements

The authors would like to thank the anonymousreviewers for their thoughtful comments. This re-search was sponsored in part by the Army Re-search Laboratory under cooperative agreementsW911NF09-2-0053, NSF IIS 1528175, and NSFCCF 1548848. The views and conclusions con-tained herein are those of the authors and shouldnot be interpreted as representing the official poli-cies, either expressed or implied, of the ArmyResearch Laboratory or the U.S. Government.The U.S. Government is authorized to reproduceand distribute reprints for Government purposesnotwithstanding any copyright notice herein.

Appendix

See supplementary notes.

Page 6: Recovering Question Answering Errors via Query Revisionxyan/papers/emnlp17_revision.pdf · 2 Question Revisions We formalize three kinds of question revi-sions, namely entity-centric,

ReferencesMartın Abadi, Paul Barham, Jianmin Chen, Zhifeng

Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,Sanjay Ghemawat, Geoffrey Irving, Michael Isard,Manjunath Kudlur, Josh Levenberg, Rajat Monga,Sherry Moore, Derek G. Murray, Benoit Steiner,Paul Tucker, Vijay Vasudevan, Pete Warden, MartinWicke, Yuan Yu, and Xiaoqiang Zheng. 2016. Ten-sorflow: A system for large-scale machine learning.In 12th USENIX Symposium on Operating SystemsDesign and Implementation (OSDI 16).

Mohannad Almasri, Catherine Berrut, and Jean-PierreChevallet. 2016. A comparison of deep learningbased query expansion with pseudo-relevance feed-back and mutual information. In European Confer-ence on Information Retrieval (ECIR).

Soren Auer, Christian Bizer, Georgi Kobilarov, JensLehmann, Richard Cyganiak, and Zachary Ives.2007. DBpedia: A nucleus for a web of open data.In International Semantic Web Conference (ISWC).

Junwei Bao, Nan Duan, Ming Zhou, and Tiejun Zhao.2014. Knowledge-based question answering as ma-chine translation. In Annual Meeting of the Associ-ation for Computational Linguistics (ACL).

Jonathan Berant, Andrew Chou, Roy Frostig, and PercyLiang. 2013. Semantic parsing on freebase fromquestion-answer pairs. In Empirical Methods onNatural Language Processing (EMNLP).

Jonathan Berant and Percy Liang. 2014. Semanticparsing via paraphrasing. Annual Meeting of the As-sociation for Computational Linguistics (ACL) .

Jonathan Berant and Percy Liang. 2015. Imitationlearning of agenda-based semantic parsers. Trans-actions of the Association for Computational Lin-guistics (TACL) .

Kurt Bollacker, Colin Evans, Praveen Paritosh, TimSturge, and Jamie Taylor. 2008. Freebase: A col-laboratively created graph database for structuringhuman knowledge. In ACM SIGMOD InternationalConference on Management of Data.

Antoine Bordes, Sumit Chopra, and Jason Weston.2014. Question answering with subgraph embed-dings. ArXiv .

Antoine Bordes, Nicolas Usunier, Sumit Chopra, andJason Weston. 2015. Large-scale simple questionanswering with memory networks. ArXiv .

Danqi Chen, Jason Bolton, and Christopher D. Man-ning. 2016. A thorough examination of thecnn/daily mail reading comprehension task. In An-nual Meeting of the Association for ComputationalLinguistics (ACL).

Eunsol Choi, Tom Kwiatkowski, and Luke Zettle-moyer. 2015. Scalable semantic parsing with partialontologies. In Annual Meeting of the Association forComputational Linguistics (ACL).

Fernando Diaz, Bhaskar Mitra, and Nick Craswell.2016. Query expansion with locally-trained wordembeddings. In Annual Meeting of the Associationfor Computational Linguistics (ACL).

Li Dong, Furu Wei, Ming Zhou, and Ke Xu.2015. Question answering over freebase with multi-column convolutional neural networks. In AnnualMeeting of the Association for Computational Lin-guistics (ACL).

Nan Duan. 2013. Minimum bayes risk based answerre-ranking for question answering. In Annual Meet-ing of the Association for Computational Linguistics(ACL).

Hui Fang. 2008. A re-examination of query expansionusing lexical resources. In Annual Meeting of theAssociation for Computational Linguistics (ACL).

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. 2015. Teaching ma-chines to read and comprehend. In Advances in Neu-ral Information Processing Systems (NIPS).

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In InternationalConference on Learning Representations (ICLR).

Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and LukeZettlemoyer. 2013. Scaling semantic parsers withon-the-fly ontology matching. In Empirical Meth-ods on Natural Language Processing (EMNLP).

Christopher D Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J Bethard, and David Mc-Closky. 2014. The stanford corenlp natural lan-guage processing toolkit. In Annual Meeting of theAssociation for Computational Linguistics: SystemDemonstrations.

Jonas Mueller and Aditya Thyagarajan. 2016. Siameserecurrent architectures for learning sentence similar-ity. In AAAI Conference on Artificial Intelligence(AAAI).

Roberto Navigli and Paola Velardi. 2003. An anal-ysis of ontology-based query expansion strategies.In European Conference on Machine Learning(ECML).

Paul Neculoiu, Maarten Versteegh, and Mihai Rotaru.2016. Learning text similarity with siamese recur-rent networks. In Annual Meeting of the Associationfor Computational Linguistics (ACL).

Jeffrey Pennington, Richard Socher, and Christopher DManning. 2014. Glove: Global vectors for wordrepresentation. In Empirical Methods on NaturalLanguage Processing (EMNLP).

Yonggang Qiu and H.P Frei. 1993. Concept basedquery expansion. In International ACM SIGIR Con-ference on Research and Development in Informa-tion Retrieval (SIGIR).

Page 7: Recovering Question Answering Errors via Query Revisionxyan/papers/emnlp17_revision.pdf · 2 Question Revisions We formalize three kinds of question revi-sions, namely entity-centric,

Siva Reddy, Mirella Lapata, and Mark Steedman. 2014.Large scale semantic parsing without question-answer pairs. Transactions of the Association forComputational Linguistics (TACL) .

Siva Reddy, Oscar Tackstrom, Michael Collins, TomKwiatkowski, Dipanjan Das, Mark Steedman, andMirella Lapata. 2016. Transforming DependencyStructures to Logical Forms for Semantic Parsing.Transactions of the Association for ComputationalLinguistics (TACL) .

Stefan Riezler, Alexander Vasserman, IoannisTsochantaridis, Vibhu Mittal, and Yi Liu. 2007.Statistical machine translation for query expansionin answer retrieval. In Annual Meeting of theAssociation for Computational Linguistics (ACL).

Alessandro Sordoni, Yoshua Bengio, and Jian-Yun Nie.2014. Learning concept embeddings for query ex-pansion by quantum entropy minimization. In AAAIConference on Artificial Intelligence (AAAI).

Fabian M. Suchanek, Gjergji Kasneci, and GerhardWeikum. 2007. Yago: A core of semantic knowl-edge. In World Wide Web (WWW).

Ming Tan, Cicero dos Santos, Bing Xiang, and BowenZhou. 2016. Improved representation learning forquestion answer matching. In Annual Meeting of theAssociation for Computational Linguistics (ACL).

Alberto Tellez-Valero, Manuel Montes-y Gomez, LuisVillasenor-Pineda, and Anselmo Penas. 2008. Im-proving question answering by combining multi-ple systems via answer validation. In InternationalConference on Computational Linguistics and Intel-ligent Text Processing (CICLING).

Adam Trischler, Zheng Ye, Xingdi Yuan, and KaheerSuleman. 2016. Natural language comprehensionwith epireader. In Empirical Methods on NaturalLanguage Processing (EMNLP).

Kun Xu, Yansong Feng, Songfang Huang, andDongyan Zhao. 2016a. Hybrid question answer-ing over knowledge base and free text. In Inter-national Conference on Computational Linguistics(COLING).

Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang,and Dongyan Zhao. 2016b. Question answering onfreebase via relation extraction and textual evidence.In Annual Meeting of the Association for Computa-tional Linguistics (ACL).

Min-Chul Yang, Nan Duan, Ming Zhou, and Hae-Chang Rim. 2014. Joint relational embeddingsfor knowledged-based question answering. In Em-pirical Methods on Natural Language Processing(EMNLP).

Xuchen Yao. 2015. Lean question answering over free-base from scratch. In The North American Chap-ter of the Association for Computational Linguistics(NAACL).

Xuchen Yao and Benjamin Van Durme. 2014. Infor-mation extraction over structured data: Question an-swering with freebase. In Annual Meeting of the As-sociation for Computational Linguistics (ACL).

Semih Yavuz, Izzeddin Gur, Yu Su, Mudhakar Srivatsa,and Xifeng Yan. 2016. Improving semantic parsingvia answer type inference. In Empirical Methods onNatural Language Processing (EMNLP).

Wen-tau Yih, MingWei Chang, Xiaodong He, and Jian-feng Gao. 2015. Semantic parsing via staged querygraph generation: Question answering with knowl-edge base. In Annual Meeting of the Association forComputational Linguistics (ACL).