Multi-Hop Paragraph Retrieval for Open-Domain Question ... · Our solution, which we call MUPPET...

14
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2296–2309 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics 2296 Multi-Hop Paragraph Retrieval for Open-Domain Question Answering Yair Feldman and Ran El-Yaniv Department of Computer Science Technion – Israel Institute of Technology Haifa, Israel {yairf11, rani}@cs.technion.ac.il Abstract This paper is concerned with the task of multi-hop open-domain Question Answering (QA). This task is particularly challenging since it requires the simultaneous performance of textual reasoning and efficient searching. We present a method for retrieving multiple supporting paragraphs, nested amidst a large knowledge base, which contain the necessary evidence to answer a given question. Our method iteratively retrieves supporting para- graphs by forming a joint vector representa- tion of both a question and a paragraph. The retrieval is performed by considering contex- tualized sentence-level representations of the paragraphs in the knowledge source. Our method achieves state-of-the-art performance over two well-known datasets, SQuAD-Open and HotpotQA, which serve as our single- and multi-hop open-domain QA benchmarks, re- spectively. 1 1 Introduction Textual Question Answering (QA) is the task of answering natural language questions given a set of contexts from which the answers to these ques- tions can be inferred. This task, which falls un- der the domain of natural language understand- ing, has been attracting massive interest due to ex- tremely promising results that were achieved us- ing deep learning techniques. These results were made possible by the recent creation of a variety of large-scale QA datasets, such as TriviaQA (Joshi et al., 2017) and SQuAD (Rajpurkar et al., 2016). The latest state-of-the-art methods are even capa- ble of outperforming humans on certain tasks (De- vlin et al., 2018) 2 . The basic and arguably the most popular task of QA is often referred to as Reading Comprehension 1 Code is available at https://github.com/yairf11/MUPPET 2 https://rajpurkar.github.io/SQuAD-explorer/ (RC), in which each question is paired with a rela- tively small number of paragraphs (or documents) from which the answer can potentially be inferred. The objective in RC is to extract the correct an- swer from the given contexts or, in some cases, deem the question unanswerable (Rajpurkar et al., 2018). Most large-scale RC datasets, however, are built in such a way that the answer can be inferred using a single paragraph or document. This kind of reasoning is termed single-hop reasoning, since it requires reasoning over a single piece of evi- dence. A more challenging task, called multi-hop reasoning, is one that requires combining evidence from multiple sources (Talmor and Berant, 2018; Welbl et al., 2018; Yang et al., 2018). Figure 1 provides an example of a question requiring multi- hop reasoning. To answer the question, one must first infer from the first context that Alex Ferguson is the manager in question, and only then can the answer to the question be inferred with any confi- dence from the second context. Another setting for QA is open-domain QA, in which questions are given without any accompa- nying contexts, and one is required to locate the relevant contexts to the questions from a large knowledge source (e.g., Wikipedia), and then ex- tract the correct answer using an RC component. This task has recently been resurged following the work of Chen et al. (2017), who used a TF- IDF based retriever to find potentially relevant documents, followed by a neural RC component that extracted the most probable answer from the retrieved documents. While this methodology performs reasonably well for questions requiring single-hop reasoning, its performance decreases significantly when used for open-domain multi- hop reasoning. We propose a new approach to accomplishing this task, called iterative multi-hop retrieval, in which one iteratively retrieves the necessary evi-

Transcript of Multi-Hop Paragraph Retrieval for Open-Domain Question ... · Our solution, which we call MUPPET...

  • Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2296–2309Florence, Italy, July 28 - August 2, 2019. c©2019 Association for Computational Linguistics

    2296

    Multi-Hop Paragraph Retrieval for Open-Domain Question Answering

    Yair Feldman and Ran El-YanivDepartment of Computer Science

    Technion – Israel Institute of TechnologyHaifa, Israel

    {yairf11, rani}@cs.technion.ac.il

    Abstract

    This paper is concerned with the task ofmulti-hop open-domain Question Answering(QA). This task is particularly challengingsince it requires the simultaneous performanceof textual reasoning and efficient searching.We present a method for retrieving multiplesupporting paragraphs, nested amidst a largeknowledge base, which contain the necessaryevidence to answer a given question. Ourmethod iteratively retrieves supporting para-graphs by forming a joint vector representa-tion of both a question and a paragraph. Theretrieval is performed by considering contex-tualized sentence-level representations of theparagraphs in the knowledge source. Ourmethod achieves state-of-the-art performanceover two well-known datasets, SQuAD-Openand HotpotQA, which serve as our single- andmulti-hop open-domain QA benchmarks, re-spectively. 1

    1 Introduction

    Textual Question Answering (QA) is the task ofanswering natural language questions given a setof contexts from which the answers to these ques-tions can be inferred. This task, which falls un-der the domain of natural language understand-ing, has been attracting massive interest due to ex-tremely promising results that were achieved us-ing deep learning techniques. These results weremade possible by the recent creation of a variety oflarge-scale QA datasets, such as TriviaQA (Joshiet al., 2017) and SQuAD (Rajpurkar et al., 2016).The latest state-of-the-art methods are even capa-ble of outperforming humans on certain tasks (De-vlin et al., 2018)2.

    The basic and arguably the most popular task ofQA is often referred to as Reading Comprehension

    1Code is available at https://github.com/yairf11/MUPPET2https://rajpurkar.github.io/SQuAD-explorer/

    (RC), in which each question is paired with a rela-tively small number of paragraphs (or documents)from which the answer can potentially be inferred.The objective in RC is to extract the correct an-swer from the given contexts or, in some cases,deem the question unanswerable (Rajpurkar et al.,2018). Most large-scale RC datasets, however, arebuilt in such a way that the answer can be inferredusing a single paragraph or document. This kindof reasoning is termed single-hop reasoning, sinceit requires reasoning over a single piece of evi-dence. A more challenging task, called multi-hopreasoning, is one that requires combining evidencefrom multiple sources (Talmor and Berant, 2018;Welbl et al., 2018; Yang et al., 2018). Figure 1provides an example of a question requiring multi-hop reasoning. To answer the question, one mustfirst infer from the first context that Alex Fergusonis the manager in question, and only then can theanswer to the question be inferred with any confi-dence from the second context.

    Another setting for QA is open-domain QA, inwhich questions are given without any accompa-nying contexts, and one is required to locate therelevant contexts to the questions from a largeknowledge source (e.g., Wikipedia), and then ex-tract the correct answer using an RC component.This task has recently been resurged followingthe work of Chen et al. (2017), who used a TF-IDF based retriever to find potentially relevantdocuments, followed by a neural RC componentthat extracted the most probable answer from theretrieved documents. While this methodologyperforms reasonably well for questions requiringsingle-hop reasoning, its performance decreasessignificantly when used for open-domain multi-hop reasoning.

    We propose a new approach to accomplishingthis task, called iterative multi-hop retrieval, inwhich one iteratively retrieves the necessary evi-

  • 2297

    Question: The football manager who recruited DavidBeckham managed Manchester United during what time-frame?Context 1: The 1995–96 season was ManchesterUnited’s fourth season in the Premier League ... Theirtriumph was made all the more remarkable by the factthat Alex Ferguson ... had drafted in young playerslike Nicky Butt, David Beckham, Paul Scholes and theNeville brothers, Gary and Phil.Context 2: Sir Alexander Chapman Ferguson, CBE(born 31 December 1941) is a Scottish former footballmanager and player who managed Manchester Unitedfrom 1986 to 2013. He is regarded by many players,managers and analysts to be one of the greatest and mostsuccessful managers of all time.

    Figure 1: An example of a question and its answercontexts from the HotpotQA dataset requiring multi-hop reasoning and retrieval. The first reasoning hop ishighlighted in green, the second hop in purple, and theentity connecting the two is highlighted in blue bolditalics. In the first reasoning hop, one has to infer thatthe manager in question is Alex Ferguson. Withoutthis knowledge, the second context cannot possibly beretrieved with confidence, as the question could referto any of the club’s managers throughout its history.Therefore, an iterative retrieval is needed in order tocorrectly retrieve this context pair.

    dence to answer a question. We believe this iter-ative framework is essential for answering multi-hop questions, due to the nature of their reasoningrequirements.

    Our main contributions are the following:

    • We propose a novel multi-hop retrieval ap-proach, which we believe is imperative fortruly solving the open-domain multi-hop QAtask.

    • We show the effectiveness of our approach,which achieves state-of-the-art results inboth single- and multi-hop open-domain QAbenchmarks.

    • We also propose using sentence-level repre-sentations for retrieval, and show the possiblebenefits of this approach over paragraph-levelrepresentations.

    While there are several works that discuss so-lutions for multi-hop reasoning (Dhingra et al.,2018; Zhong et al., 2019), to the best of our knowl-edge, this work is the first to propose a viable so-lution for open-domain multi-hop QA.

    2 Task Definition

    We define the open-domain QA task by a triplet(KS,Q,A) where KS = {P1, P2, . . . , P|KS|}is a background knowledge source and Pi =(p1, p2, . . . , pli) is a textual paragraph consist-ing of li tokens, Q = (q1, q2, . . . , qm) is a tex-tual question consisting of m tokens, and A =(a1, a2, . . . , an) is a textual answer consisting ofn tokens, typically a span of tokens pj1 , . . . , pjnin some Pi ∈ KS, or optionally a choice from apredefined set of possible answers. The objectiveof this task is to find the answer A to the questionQ using the background knowledge source KS.Formally speaking, our task is to learn a functionφ such that A = φ(Q,KS).

    Single-Hop Retrieval In the classic and mostsimple form of QA, questions are formulated insuch a way that the evidence required to answerthem may be contained in a single paragraph, oreven in a single sentence. Thus, in the open-domain setting, it might be sufficient to retrievea single relevant paragraph Pi ∈ KS using theinformation present in the given question Q, andhave a reading comprehension model extract theanswer A from Pi. We call this task variationsingle-hop retrieval.

    Multi-Hop Retrieval In contrast to the single-hop case, there are types of questions whose an-swers can only be inferred by using at least twodifferent paragraphs. The ability to reason withinformation taken from more than one paragraphis known in the literature as multi-hop reasoning(Welbl et al., 2018). In multi-hop reasoning, notonly might the evidence be spread across multi-ple paragraphs, but it is often necessary to firstread a subset of these paragraphs in order to ex-tract the useful information from the other para-graphs, which might otherwise be understood asnot completely relevant to the question. This sit-uation becomes even more difficult in the open-domain setting, where one must first find an initialevidence paragraph in order to be able to retrievethe rest. This is demonstrated in Figure 1, whereone can observe that the second context alone mayappear to be irrelevant to the question at hand andthe information in the first context is necessary toretrieve the second part of the evidence correctly.

    We extend the multi-hop reasoning ability to theopen-domain setting, referring to it as multi-hopretrieval, in which the evidence paragraphs are re-

  • 2298

    trieved in an iterative fashion. We focus on thistask and limit ourselves to the case where two it-erations of retrieval are necessary and sufficient.

    3 Methodology

    Our solution, which we call MUPPET (multi-hopparagraph retrieval), relies on the following basicscheme consisting of two main components: (a) aparagraph and question encoder, and (b) a para-graph reader. The encoder is trained to encodeparagraphs into d-dimensional vectors, and to en-code questions into search vectors in the same vec-tor space. Then, a maximum inner product search(MIPS) algorithm is applied to find the most sim-ilar paragraphs to a given question. Several al-gorithms exist for fast (and possibly approximate)MIPS, such as the one proposed by Johnson et al.(2017). The most similar paragraphs are thenpassed to the paragraph reader, which, in turn, ex-tracts the most probable answer to the question.

    It is critical that the paragraph encodings do notdepend on the questions. This enables storing pre-computed paragraph encodings and executing effi-cient MIPS when given a new search vector. With-out this property, any new question would requirethe processing of the complete knowledge source(or a significant part of it).

    To support multi-hop retrieval, we proposethe following extension to the basic scheme.Given a question Q, we first obtain its encod-ing q ∈ Rd using the encoder. Then, we trans-form it into a search vector qs ∈ Rd, whichis used to retrieve the top-k relevant paragraphs{PQ1 , P

    Q2 , . . . , P

    Qk } ⊂ KS using MIPS. In each

    subsequent retrieval iteration, we use the para-graphs retrieved in its previous iteration to refor-mulate the search vector. This produces k newsearch vectors, {q̃s1, q̃s2, . . . , q̃sk}, where q̃si ∈ Rd,which are used in the same manner as in the first it-eration to retrieve the next top-k paragraphs, againusing MIPS. This method can be seen as perform-ing a beam search of width k in the encoded para-graphs’ space. A high-level view of the describedsolution is given in Figure 2.

    3.1 Paragraph and Question Encoder

    We define f , our encoder model, in the fol-lowing way. Given a paragraph P consistingof k sentences (s1, s2, . . . , sk) and m tokens(t1, t2, . . . , tm), such that si = (ti1 , ti2 , . . . , til),where l is the length of the sentence, our encoder

    generates k respective d-dimensional encodings(s1, s2, . . . , sk) = f(P ), one for each sentence.This is in contrast to previous work in paragraphretrieval in which only a single fixed-size repre-sentation is used for each paragraph (Lee et al.,2018; Das et al., 2019). The encodings are createdby passing (t1, t2, . . . , tm) through the followinglayers.

    Word Embedding We use the same embed-ding layer as the one suggested by Clark and Gard-ner (2018). Each token t is embedded into a vectort using both character-level and word-level infor-mation. The word-level embedding tw is obtainedvia pretrained word embeddings. The character-level embedding of a token t with lt characters(tc1, t

    c2, . . . , t

    clt) is obtained in the following man-

    ner: each character tci is embedded into a fixed-size vector tci . We then pass each token’s charac-ter embeddings through a one-dimensional convo-lutional neural network, followed by max-poolingover the filter dimension. This produces a fixed-size character-level representation for each to-ken, tc = max

    (CNN(tc1, tc2, . . . , tclt)

    ). Finally,

    we concatenate the word-level and character-levelembeddings to form the final word representation,t = [tw; tc].

    Recurrent Layer After obtaining the wordrepresentations, we use a bidirectional GRU (Choet al., 2014) to process the paragraph and obtainthe contextualized word representations,

    (c1, c2, . . . , cm) = BiGRU(t1, t2, . . . , tm).

    Sentence-wise max-pooling Finally, wechunk the contextualized representations ofthe paragraph tokens into their correspondingsentence groups, and apply max-pooling over thetime dimension of each sentence group to obtainthe parargaph’s d-dimensional sentence represen-tations, si = max(ci1 , ci2 , . . . , cil). A high-leveloutline of the sentence encoder is shown isFigure 3a, where we can see a series of m tokensbeing passed through the aforementioned layers,producing k sentence representations.

    The encoding q of a question Q is computedsimilarly, such that q = f(Q). Note that we pro-duce a single vector for any given question, thusthe max-pooling operation is applied over all ques-tion words at once, disregarding sentence informa-tion.

  • 2299

    Figure 2: A high-level overview of our solution, MUPPET.

    (a) Sentence Encoder (b) Reformulation Component

    Figure 3: Architecture of the main components of ourparagraph and question encoder. (a) Our sentence en-coder architecture. The model receives a series of to-kens as input and produces a sequence of sentence rep-resentations. (b) Our reformulation component archi-tecture. This layer receives contextualized representa-tions of a question and a paragraph, and produces a re-formulated representation of the question.

    Reformulation Component The reformula-tion component receives a paragraph P and aquestion Q, and produces a single vector q̃. First,contextualized word representations are obtainedusing the same embedding and recurrent layersused for the initial encoding, (cq1, c

    q2, . . . , c

    qnq) for

    Q and (cp1, cp2, . . . , c

    pnp) for P . We then pass the

    contextualized representations through a bidirec-tional attention layer, which we adopt from Clarkand Gardner (2018). The attention between ques-tion word i and paragraph word j is computed as:

    aij = wa1 · cqi + w

    a2 · c

    pj + w

    a3 · (c

    qi � c

    pj ),

    Context: One of the most famous people born in War-saw was Maria Skłodowska-Curie, who achieved inter-national recognition for her research on radioactivity andwas the first female recipient of the Nobel Prize. Famousmusicians include Władysław Szpilman and FrédéricChopin. Though Chopin was born in the village ofŻelazowa Wola, about 60 km (37 mi) from Warsaw, hemoved to the city with his family when he was sevenmonths old. Casimir Pulaski, a Polish general and hero ofthe American Revolutionary War, was born here in 1745.Question 1: What was Maria Curie the first female re-cipient of?Question 2: How old was Chopin when he moved toWarsaw with his family?

    Figure 4: An example from the SQuAD dataset ofa paragraph that acts as the context for two differentquestions. Question 1 and its evidence (highlightedin purple) have little relation to question 2 and itsevidence (highlighted in green). This motivates ourmethod of storing sentence-wise encodings instead ofa single representation for an entire paragraph.

    where wa1,wa2,wa3 ∈ Rd are learned vectors. Foreach question word, we compute the vector ai:

    αij =eaij∑npj=1 e

    aij, ai =

    np∑j=1

    αijcpj .

    A paragraph-to-question vector ap is computed asfollows:

    mi = max1≤j≤np

    aij , βi =emi∑nqi=1 e

    mi

    ap =nq∑i=1

    βicqi .

    We concatenate cqi , ai, cqi � ai and ap � ai and

    pass the result through a linear layer with ReLU

  • 2300

    activations to compute the final bidirectional at-tention vectors. We also use a residual connec-tion where we process these representations witha bidirectional GRU and another linear layer withReLU activations. Finally, we sum the outputs ofthe two linear layers. As before, we derive the d-dimensional reformulated question representationq̃ using a max-pooling layer on the outputs of theresidual layer. A high-level outline of the reformu-lation layer is given in Figure 3b, wherem contex-tualized token representations of the question andn contextualized token representations of the para-graph are passed through the component’s layersto produce the reformulated question representa-tion, q̃.

    Relevance Scores Given the sentence repre-sentations (s1, s2, . . . , sk) of a paragraph P , andthe question encoding q forQ, the relevance scoreof P with respect to a question Q is calculated inthe following way:

    rel(Q,P ) = maxi=1,...,k

    σ

    (si

    si � qsi · qq

    ·w1w2w3w4

    + b),

    where w1,w2,w4 ∈ Rd and w3, b ∈ R arelearned parameters.

    A similar max-pooling encoding approach,along with the scoring layer’s structure, were pro-posed by Conneau et al. (2017) who showed theirefficacy on various sentence-level tasks. We findthis sentence-wise formulation to be beneficial be-cause it suffices for one sentence in a paragraphto be relevant to a question for the whole para-graph to be considered as relevant. This allowsmore fine-grained representations for paragraphsand more accurate retrieval. An example of thebenefits of using this kind of sentence-level modelis given in Figure 4, where we see two questionsanswered by two different sentences. Our modelallows each question to be similar only to parts ofthe paragraph, and not necessarily to all of it.

    Search Vector Derivation Recall that our re-trieval algorithm is based on executing a MIPS inthe paragraph encoding space. To derive such asearch vector from the question encoding q, weobserve that:

    rel(Q,P ) ∝ maxi=1,...,k

    s>i (w1 +w2 � q+ w3 · q).

    Therefore, the final search vector of a question Qis qs = w1 + w2 � q + w3 · q. The same equa-tions apply when predicting the relevance score forthe second retrieval iteration, in which case q isswapped with q̃.

    Training and Loss Functions Each trainingsample consists of a question and two paragraphs,(Q,P 1, P 2), whereP 1 corresponds to a paragraphretrieved in the first iteration, and P 2 correspondsto a paragraph retrieved in the second iteration us-ing the reformulated vector q̃. P 1 is consideredrelevant if it constitutes one of the necessary ev-idence paragraphs to answer the question. P 2 isconsidered relevant only if P 1 and P 2 togetherconstitute the complete set of evidence paragraphsneeded to answer the question. Both iterationshave the same form of loss functions, and themodel is trained by optimizing the sum of the iter-ations’ losses.

    Our training objective for each iteration is com-posed of two components: a binary cross-entropyloss function and a ranking loss function. Thecross-entropy loss is defined as follows:

    LCE = −1

    N

    N∑i=1

    yi log(rel(Qi, Pi)

    )+ (1− yi) log

    (1− rel(Qi, Pi)

    ),

    where yi ∈ {0, 1} is a binary label indicating thetrue relevance of Pi to Qi in the iteration in whichrel(Qi, Pi) is calculated, and N is the number ofsamples in the current batch.

    The ranking loss is computed in the followingmanner. First, for each question Qi in a givenbatch, we find the mean of the scores given topositive and negative paragraphs for each ques-tion, qposi =

    1M1

    ∑M1j=1 rel(Qi, Pj) and q

    negi =

    1M2

    ∑M2j=1 rel(Qi, Pj), where M1 and M2 are the

    number of positive and negative samples for Qi,respectively. We then define the margin rankingloss (Socher et al., 2013) as

    LR =1

    M

    M∑i=1

    max(0, γ − qposi + qnegi ), (1)

    whereM is the number of distinct questions in thecurrent batch, and γ is a hyperparameter. The finalobjective is the sum of the two losses:

    L = LCE + λLR, (2)

    where λ is a hyperparameter.

  • 2301

    We note that we found it slightly beneficial toincorporate pretrained ELMo (Peters et al., 2018)embeddings in our model. For more detailed infor-mation of the implementation details and trainingprocess, please refer to Appendix C.

    3.2 Paragraph ReaderThe paragraph reader receives as input a questionQ and a paragraph P and extracts the most proba-ble answer span to Q from P . We use the S-normmodel proposed by Clark and Gardner (2018). Adetailed description of the model is given in Ap-pendix A.

    Training An input sample for the paragraphreader consists of a question and a single con-text (Q,P ). We optimize the same negative log-likelihood function used in the S-norm model forthe span start boundaries:

    Lstart = − log

    (∑j∈PQ

    ∑k∈Aj e

    skj∑j∈PQ

    ∑nji=1 e

    sij

    ),

    where PQ is the set of paragraphs paired with thesame question Q, Aj is the set of tokens that startan answer span in the j-th paragraph, and sij isthe score given to the i-th token in the j-th para-graph. The same formulation is used for the spanend boundaries, so that the final objective functionis the sum of the two: Lspan = Lstart + Lend.

    4 Experiments and Results

    We test our approach on two datasets, and mea-sure end-to-end QA performance using the stan-dard exact match (EM) and F1 metrics, as well asthe metrics proposed by Yang et al. (2018) for theHotpotQA dataset (see Appendix B).

    4.1 DatasetsHotpotQA Yang et al. (2018) introduced adataset of Wikipedia-based questions, which re-quire reasoning over multiple paragraphs to findthe correct answer. The dataset also includes hardsupervision on sentence-level supporting facts,which encourages the model to give explainableanswer predictions. Two benchmark settings areavailable for this dataset: (1) a distractor setting,in which the reader is given a question as well asa set of paragraphs that includes both the support-ing facts and irrelevant paragraphs; (2) a full wikisetting, which is an open-domain version of thedataset. We use this dataset as our benchmark for

    the multi-hop retrieval setting. Several extensionsmust be added to the reader from Section 3.2 in or-der for it to be suitable for the HotpotQA dataset.A detailed description of our proposed extensionsis given in Appendix B.

    SQuAD-Open Chen et al. (2017) decoupled thequestions from their corresponding contexts in theoriginal SQuAD dataset (Rajpurkar et al., 2016),and formed an open-domain version of the datasetby defining an entire Wikipedia dump to be thebackground knowledge source from which the an-swer to the question should be extracted. We usethis dataset to test the effectiveness of our methodin a classic single-hop retrieval setting.

    4.2 Experimental Setup

    Search Hyperparameters For our experi-ments in the multi-hop setting, we used a widthof 8 in the first retrieval iteration. In all our ex-periments, unless stated otherwise, the reader isfed the top 45 paragraphs through which it rea-sons independently and finds the most probableanswers. In addition, we found it beneficial tolimit the search space of our MIPS retriever to asubset of the knowledge source, which is deter-mined by a TF-IDF heuristic retriever. We defineni to be the size of the search space for retrievaliteration i. As we will see, there is a trade-off forchoosing various values of ni. A large value ofni offers the possibility of higher recall, whereas asmall value of ni introduces less noise in the formof irrelevant paragraphs.

    Knowledege Sources For HotpotQA, ourknowledge source is the same Wikipedia versionused by Yang et al. (2018)3. This version is a set ofall of the first paragraphs in the entire Wikipedia.For SQuAD-Open, we use the same Wikipediadump used by Chen et al. (2017). For both knowl-edge sources, the TF-IDF based retriever we usefor search space reduction is the one proposed byChen et al. (2017), which uses bigram hashing andTF-IDF matching. We note that in the HotpotQAWikipedia version each document is a single para-graph, while in SQuAD-Open, the full Wikipediadocuments are used.

    3It has recently come to our attention that during ourwork, some details of the Wikipedia version have changed.Due to time limitations, we use the initial version description.

  • 2302

    Setting Method Answer Sup Fact Joint

    EM F1 EM F1 EM F1

    distractor Baseline (Yang et al., 2018) 44.44 58.28 21.95 66.66 11.56 40.86

    Our Reader 51.56 65.32 44.54 75.27 28.68 54.08

    full wiki

    Baseline (Yang et al., 2018) 24.68 34.36 5.28 40.98 2.54 17.73

    TF-IDF + Reader 27.55 36.58 10.75 42.45 7.00 21.47MUPPET (sentence-level) 30.20 39.43 16.57 46.13 11.38 26.55MUPPET (paragraph-level) 31.07 40.42 17.00 47.71 11.76 27.62

    Table 1: Primary results for HotpotQA (dev set). At the top of the table, we compare our Paragraph Reader to thebaseline model of Yang et al. (2018) (as of writing this paper, no other published results are available other thanthe baseline results). At the bottom, we compare the end-to-end performance on the full wiki setting. TF-IDF +Reader refers to using the TF-IDF based retriever without our MIPS retriever. MUPPET (sentence-level) refersto our approach with sentence-level representations, and MUPPET (paragraph-level) refers to our approach withparagraph-level representations. For both sentence- and paragraph-level results, we set n1 = 32 and n2 = 512.

    Method EM F1

    DrQA (Chen et al., 2017) 28.4 -DrQA (Chen et al., 2017) (multitask) 29.8 -R3 (Wang et al., 2018a) 29.1 37.5DS-QA (Lin et al., 2018) 28.7 36.6Par. Ranker + Full Agg. (Lee et al., 2018) 30.2 -Minimal (Min et al., 2018) 34.7 42.6Multi-step (Das et al., 2019) 31.9 39.2BERTserini (Yang et al., 2019) 38.6 46.1

    TF-IDF + Reader 34.6 41.6MUPPET (sentence-level) 39.3 46.2MUPPET (paragraph-level) 35.6 42.5

    Table 2: Primary results for SQuAD-Open.

    4.3 Results

    Primary Results Tables 1 and 2 show ourmain results on the HotpotQA and SQuAD-Opendatasets, respectively. In the HotpotQA distrac-tor setting, our paragraph reader greatly improvesthe results of the baseline reader, increasing thejoint EM and F1 scores by 17.12 (148%) and 13.22(32%) points, respectively. In the full wiki set-ting, we compare three methods of retrieval: (1)TF-IDF, in which only the TF-IDF heuristic isused. The reader is fed all possible paragraphpairs from the top-10 paragraphs. (2) Sentence-level, in which we use MUPPET with sentence-level encodings. (3) Paragraph-level, in whichwe use MUPPET with paragraph-level encodings(no sentence information). We can see that bothmethods significantly outperform the naı̈ve TF-IDF retriever, indicating the efficacy of our ap-proach. As of writing this paper, we are placedsecond in the HotpotQA full wiki setting (test set)

    leaderboard4. For SQuAD-Open, our sentence-level method established state-of-the-art results,improving the current non-BERT (Devlin et al.,2018) state-of-the-art by 4.6 (13%) and 3.6 (8%)EM and F1 points, respectively. This shows thatour encoder can be useful not only for multi-hopquestions, but also for single-hop questions.

    Retrieval Recall Analysis We analyze the per-formance of the TF-IDF retriever for HotpotQAin Figure 5a. We can see that the retriever suc-ceeds in retrieving at least one of the gold para-graphs for each question (above 90% with the top-32 paragraphs), but fails at retrieving both goldparagraphs. This demonstrates the necessity of anefficient multi-hop retrieval approach to aid or re-place classic information retrieval methods.

    Effect of Narrowing the Search Space In Fig-ures 5b and 5c, we show the performance of ourmethod as a function of the size of the searchspace of the last retrieval iteration. For SQuAD-Open, the TF-IDF retriever initially retrieves a setof documents, which are then split into paragraphsto form the search space. Each search space oftop-k paragraphs limits the potential recall of themodel to that of the top-k paragraphs retrieved bythe TF-IDF retriever. This proves to be subopti-mal for very small values of k, as the performanceof the TF-IDF retriever is not good enough. Ourmodels, however, fail to benefit from increasingthe search space indefinitely, hinting that they arenot as robust to noise as we would want them tobe.

    4March 5, 2019. Leaderboard available athttps://hotpotqa.github.io/

  • 2303

    (a) TF-IDF retrieval results (b) SQuAD-Open (c) HotpotQA

    Figure 5: Various results based on the TF-IDF retriever. (a) Retrieval results of the TF-IDF hueristic retriever onHotpotQA. At Least One @ k is the number of questions for which at least one of the paragraphs containing thesupporting facts is retrieved in the top-k paragraphs. Potentially Perfect @ k is the number of questions for whichboth of the paragraphs containing the supporting facts are retrieved in the top-k paragraphs. (b) and (c) Performanceanalysis on the SQuAD-Open and HotpotQA datasets, respectively, as more documents/paragraphs are retrievedby the TF-IDF heuristic retriever. Note that for SQuAD-Open each document contains several paragraphs, and thereader is fed the top-k TF-IDF ranked paragraphs from within the documents in the search space.

    Effectiveness of Sentence-Level EncodingsOur method proposes using sentence-level en-codings for paragraph retrieval. We test thesignificance of this approach in Figures 5b and5c. While sentence-level encodings seem to bevital for improving state-of-the-art results onSQuAD-Open, the same cannot be said aboutHotpotQA. We hypothesize that this is a conse-quence of the way the datasets were created. InSQuAD, each paragraph serves as the contextof several questions, as shown in Figure 4. Thisleads to questions being asked about facts lessessential to the gist of the paragraph, and thus theywould not be encapsulated in a single paragraphrepresentation. In HotpotQA, however, most ofthe paragraphs in the training set serve as thecontext of at most one question.

    5 Related Work

    Chen et al. (2017) first introduced the use of neu-ral methods to the task of open-domain QA us-ing a textual knowledge source. They proposedDrQA, a pipeline approach with two components:a TF-IDF based retriever, and a multi-layer neu-ral network that was trained to find an answerspan given a question and a paragraph. In an at-tempt to improve the retrieval of the TF-IDF basedcomponent, many existing works have used Dis-tant Supervision (DS) to further re-rank the re-trieved paragraphs (Htut et al., 2018; Yan et al.,2018). Wang et al. (2018a) used reinforcementlearning to train a re-ranker and an RC componentin an end-to-end manner, and showed its advan-

    tage over the use of DS alone. Min et al. (2018)trained a sentence selector and demonstrated theeffectiveness of reading minimal contexts insteadof complete documents. As DS can often leadto wrong labeling, Lin et al. (2018) suggesteda denoising method for alleviating this problem.While these methods have proved to increase per-formance in various open-domain QA datasets,their re-ranking approach is limited in the num-ber of paragraphs it can process, as it requires thejoint reading of a question with all possible para-graphs. This is in contrast to our approach, inwhich all paragraph representations are precom-puted to allow efficient large-scale retrieval. Thereare some works that adopted a similar precom-putation scheme. Lee et al. (2018) learned anencoding function for questions and paragraphsand ranked paragraphs by their dot-product sim-ilarity with the question. Many of their improve-ments, however, can be attributed to the incorpora-tion of answer aggregation methods as suggestedby Wang et al. (2018b) in their model, which en-hanced their results significantly. Seo et al. (2018)proposed phrase-indexed QA (PI-QA), a new for-mulation of the QA task that requires the inde-pendent encoding of answers and questions. Thequestion encodings are then used to retrieve thecorrect answers by performing MIPS. This is moreof a challenge task rather than a solution for open-domain QA. A recent work by Das et al. (2019)proposed a new framework for open-domain QAthat employs a multi-step interaction between a re-triever and a reader. This interactive framework

  • 2304

    is used to refine a question representation in or-der for the retrieval to be more accurate. Theirmethod is complimentary to ours – the interac-tive framework is used to enhance retrieval per-formance for single-hop questions, and does nothandle the multi-hop domain.

    Another line of work reminiscent of our methodis the one of Memory Networks (Weston et al.,2015). Memory Networks consist of an array ofcells, each capable of storing a vector, and fourmodules (input, update, output and response) thatallow the manipulation of the memory for the taskat hand. Many variations of Memory Networkshave been proposed, such as end-to-end Mem-ory Networks (Sukhbaatar et al., 2015), Key-ValueMemory Networks (Miller et al., 2016), and Hier-archical Memory Networks (Chandar et al., 2016).

    6 Concluding Remarks

    We present MUPPET, a novel method for multi-hop paragraph retrieval, and show its efficacy inboth single- and multi-hop QA datasets. One dif-ficulty in the open-domain multi-hop setting is thelack of supervision, a difficulty that in the single-hop setting is alleviated to some extent by usingdistant supervision. We hope to tackle this prob-lem in future work to allow learning more than tworetrieval iterations. An interesting improvement toour approach would be to allow the retriever toautomatically determine whether or not more re-trieval iterations are needed. A promising direc-tion could be a multi-task approach, in which bothsingle- and multi-hop datasets are learned jointly.We leave this for future work.

    Acknowledgments

    This research was partially supported by the IsraelScience Foundation (grant No. 710/18).

    ReferencesSarath Chandar, Sungjin Ahn, Hugo Larochelle, Pascal

    Vincent, Gerald Tesauro, and Yoshua Bengio. 2016.Hierarchical memory networks. arXiv preprintarXiv:1605.07427.

    Danqi Chen, Adam Fisch, Jason Weston, and AntoineBordes. 2017. Reading wikipedia to answer open-domain questions. In ACL.

    Kyunghyun Cho, Bart van Merrienboer, ÇaglarGülçehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learning

    phrase representations using RNN encoder-decoderfor statistical machine translation. In EMNLP.

    Christopher Clark and Matt Gardner. 2018. Simpleand effective multi-paragraph reading comprehen-sion. In ACL.

    Alexis Conneau, Douwe Kiela, Holger Schwenk, Loı̈cBarrault, and Antoine Bordes. 2017. Supervisedlearning of universal sentence representations fromnatural language inference data. In EMNLP.

    Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer,and Andrew McCallum. 2019. Multi-step retriever-reader interaction for scalable open-domain questionanswering. In ICLR.

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. CoRR, abs/1810.04805.

    Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William Co-hen, and Ruslan Salakhutdinov. 2018. Neural mod-els for reasoning over multiple mentions using coref-erence. In NAACL-HLT.

    Yarin Gal and Zoubin Ghahramani. 2016. A theoret-ically grounded application of dropout in recurrentneural networks. In NIPS.

    Phu Mon Htut, Samuel R. Bowman, and KyunghyunCho. 2018. Training a ranking function for open-domain question answering. In NAACL-HLT.

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017.Billion-scale similarity search with gpus. CoRR,abs/1702.08734.

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and LukeZettlemoyer. 2017. Triviaqa: A large scale distantlysupervised challenge dataset for reading comprehen-sion. In ACL.

    Jinhyuk Lee, Seongjun Yun, Hyunjae Kim, MiyoungKo, and Jaewoo Kang. 2018. Ranking paragraphsfor improving answer recall in open-domain ques-tion answering. In EMNLP.

    Yankai Lin, Haozhe Ji, Zhiyuan Liu, and Maosong Sun.2018. Denoising distantly supervised open-domainquestion answering. In ACL.

    Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Rose Finkel, Steven Bethard, and David Mc-Closky. 2014. The stanford corenlp natural languageprocessing toolkit. In ACL.

    Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason We-ston. 2016. Key-value memory networks for directlyreading documents. In EMNLP.

    Sewon Min, Victor Zhong, Richard Socher, and Caim-ing Xiong. 2018. Efficient and robust question an-swering from minimal context over documents. InACL.

    https://arxiv.org/pdf/1605.07427https://doi.org/10.18653/v1/P17-1171https://doi.org/10.18653/v1/P17-1171http://aclweb.org/anthology/D/D14/D14-1179.pdfhttp://aclweb.org/anthology/D/D14/D14-1179.pdfhttp://aclweb.org/anthology/D/D14/D14-1179.pdfhttps://aclanthology.info/papers/P18-1078/p18-1078https://aclanthology.info/papers/P18-1078/p18-1078https://aclanthology.info/papers/P18-1078/p18-1078https://aclanthology.info/papers/D17-1070/d17-1070https://aclanthology.info/papers/D17-1070/d17-1070https://aclanthology.info/papers/D17-1070/d17-1070https://openreview.net/forum?id=HkfPSh05K7https://openreview.net/forum?id=HkfPSh05K7https://openreview.net/forum?id=HkfPSh05K7http://arxiv.org/abs/1810.04805http://arxiv.org/abs/1810.04805http://arxiv.org/abs/1810.04805https://doi.org/10.18653/v1/N18-2007https://doi.org/10.18653/v1/N18-2007https://doi.org/10.18653/v1/N18-2007http://papers.nips.cc/paper/6241-a-theoretically-grounded-application-of-dropout-in-recurrent-neural-networks.pdfhttp://papers.nips.cc/paper/6241-a-theoretically-grounded-application-of-dropout-in-recurrent-neural-networks.pdfhttp://papers.nips.cc/paper/6241-a-theoretically-grounded-application-of-dropout-in-recurrent-neural-networks.pdfhttps://aclanthology.info/papers/N18-4017/n18-4017https://aclanthology.info/papers/N18-4017/n18-4017http://arxiv.org/abs/1702.08734https://doi.org/10.18653/v1/P17-1147https://doi.org/10.18653/v1/P17-1147https://doi.org/10.18653/v1/P17-1147https://aclanthology.info/papers/D18-1053/d18-1053https://aclanthology.info/papers/D18-1053/d18-1053https://aclanthology.info/papers/D18-1053/d18-1053https://aclanthology.info/papers/P18-1161/p18-1161https://aclanthology.info/papers/P18-1161/p18-1161http://aclweb.org/anthology/P/P14/P14-5010.pdfhttp://aclweb.org/anthology/P/P14/P14-5010.pdfhttps://doi.org/10.18653/v1/D16-1147https://doi.org/10.18653/v1/D16-1147https://aclanthology.info/papers/P18-1160/p18-1160https://aclanthology.info/papers/P18-1160/p18-1160

  • 2305

    Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global vectors forword representation. In EMNLP.

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In NAACL-HLT.

    Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.Know what you don’t know: Unanswerable ques-tions for squad. In ACL.

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. 2016. Squad: 100, 000+ questions formachine comprehension of text. In EMNLP.

    Minjoon Seo, Tom Kwiatkowski, Ankur P. Parikh, AliFarhadi, and Hannaneh Hajishirzi. 2018. Phrase-indexed question answering: A new challenge forscalable document comprehension. In EMNLP.

    Richard Socher, Danqi Chen, Christopher D Manning,and Andrew Ng. 2013. Reasoning with neural ten-sor networks for knowledge base completion. InNIPS.

    Nitish Srivastava, Geoffrey E. Hinton, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. 2014. Dropout: a simple way to prevent neuralnetworks from overfitting. JMLR.

    Sainbayar Sukhbaatar, arthur szlam, Jason Weston, andRob Fergus. 2015. End-to-end memory networks.In NIPS.

    Alon Talmor and Jonathan Berant. 2018. The web asa knowledge-base for answering complex questions.In NAACL-HLT.

    Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang,Tim Klinger, Wei Zhang, Shiyu Chang, GerryTesauro, Bowen Zhou, and Jing Jiang. 2018a. R3:Reinforced ranker-reader for open-domain questionanswering. In AAAI.

    Shuohang Wang, Mo Yu, Jing Jiang, Wei Zhang,Xiaoxiao Guo, Shiyu Chang, Zhiguo Wang, TimKlinger, Gerald Tesauro, and Murray Campbell.2018b. Evidence aggregation for answer re-rankingin open-domain question answering. In ICLR.

    Johannes Welbl, Pontus Stenetorp, and SebastianRiedel. 2018. Constructing datasets for multi-hopreading comprehension across documents. TACL.

    Jason Weston, Sumit Chopra, and Antoine Bordes.2015. Memory networks. In ICLR.

    Ming Yan, Jiangnan Xia, Chen Wu, Bin Bi, ZhongzhouZhao, Ji Zhang, Luo Si, Rui Wang, Wei Wang,and Haiqing Chen. 2018. A deep cascade modelfor multi-document reading comprehension. CoRR,abs/1811.11374.

    Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, LuchenTan, Kun Xiong, Ming Li, and Jimmy Lin. 2019.End-to-end open-domain question answering withbertserini. arXiv preprint arXiv:1902.01718.

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-gio, William W. Cohen, Ruslan Salakhutdinov, andChristopher D. Manning. 2018. Hotpotqa: A datasetfor diverse, explainable multi-hop question answer-ing. In EMNLP.

    Matthew D. Zeiler. 2012. ADADELTA: an adaptivelearning rate method. CoRR, abs/1212.5701.

    Victor Zhong, Caiming Xiong, Nitish Keskar, andRichard Socher. 2019. Coarse-grain fine-grain coat-tention network for multi-evidence question answer-ing. In ICLR.

    http://aclweb.org/anthology/D/D14/D14-1162.pdfhttp://aclweb.org/anthology/D/D14/D14-1162.pdfhttps://arxiv.org/pdf/1802.05365.pdfhttps://arxiv.org/pdf/1802.05365.pdfhttp://aclweb.org/anthology/P18-2124http://aclweb.org/anthology/P18-2124http://aclweb.org/anthology/D/D16/D16-1264.pdfhttp://aclweb.org/anthology/D/D16/D16-1264.pdfhttps://aclanthology.info/papers/D18-1052/d18-1052https://aclanthology.info/papers/D18-1052/d18-1052https://aclanthology.info/papers/D18-1052/d18-1052http://papers.nips.cc/paper/5028-reasoning-with-neural-tensor-networks-for-knowledge-base-completion.pdfhttp://papers.nips.cc/paper/5028-reasoning-with-neural-tensor-networks-for-knowledge-base-completion.pdfhttp://dl.acm.org/citation.cfm?id=2670313http://dl.acm.org/citation.cfm?id=2670313http://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdfhttps://aclanthology.info/papers/N18-1059/n18-1059https://aclanthology.info/papers/N18-1059/n18-1059https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16712https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16712https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16712https://openreview.net/forum?id=rJl3yM-Abhttps://openreview.net/forum?id=rJl3yM-Abhttps://transacl.org/ojs/index.php/tacl/article/view/1325https://transacl.org/ojs/index.php/tacl/article/view/1325http://arxiv.org/abs/1410.3916http://arxiv.org/abs/1811.11374http://arxiv.org/abs/1811.11374https://arxiv.org/pdf/1902.01718.pdfhttps://arxiv.org/pdf/1902.01718.pdfhttps://aclanthology.info/papers/D18-1259/d18-1259https://aclanthology.info/papers/D18-1259/d18-1259https://aclanthology.info/papers/D18-1259/d18-1259http://arxiv.org/abs/1212.5701http://arxiv.org/abs/1212.5701https://openreview.net/forum?id=Syl7OsRqY7https://openreview.net/forum?id=Syl7OsRqY7https://openreview.net/forum?id=Syl7OsRqY7

  • 2306

    A Paragraph Reader

    In this section we describe in detail the readermentioned in Section 3.2. The paragraph readerreceives as input a question Q and a paragraphP and extracts the most probable answer span toQ from P . We use the shared-norm model pre-sented by Clark and Gardner (2018), which werefer to as S-norm. The model’s architecture isquite similar to the one we used for the encoder.First, we process Q and P seperately to obtaintheir contexualized token representations, in thesame manner as used in the encoder. We then passthe contextualized representations through a bidi-rectional attention layer similar to the one definedin the reformulation layer of the encoder, with theonly difference being that the roles of the ques-tion and the paragraph are switched. As before, wefurther pass the bidirectional attention representa-tions through a residual connection, this time us-ing a self-attention layer between the bidirectionalGRU and the linear layer. The self-attention mech-anism is similar to the bidirectional attention layer,only now it is between the paragraph and itself.Therefore, question-to-parargaph attention is notused, and we set aij = −∞ if i = j. The summedoutputs of the residual connection are passed to theprediction layer. The inputs to the prediction layerare passed through a bidirectional GRU followedby a linear layer that predicts the answer span startscores. The hidden layers of that GRU are con-catenated with the input and passed through an-other bidirectional GRU and linear layer to predictthe answer span end scores.

    Training An input sample for the paragraphreader consists of a question and a single con-text (Q,P ). We optimize the same negative log-likelihood function used in the S-norm model forthe span start boundaries:

    Lstart = − log

    (∑j∈PQ

    ∑k∈Aj e

    skj∑j∈PQ

    ∑nji=1 e

    sij

    ),

    where PQ is the set of paragraphs paired with thesame question Q, Aj is the set of tokens that startan answer span in the j-th paragraph, and sij isthe score given to the i-th token in the j-th para-graph. The same formulation is used for the spanend boundaries, so that the final objective functionis the sum of the two: Lspan = Lstart + Lend.

    B Paragraph Reader Extension forHotpotQA

    HotpotQA presents the challenge of not only pre-dicting an answer span, but also yes/no answers.This is a combination of span-based questionsand multiple-choice questions. In addition, one isalso required to provide explainability to the an-swer predictions by predicting the supporting factsleading to the answer. We extend the paragraphreader from Section 3.2 to support these predic-tions in the following manner.

    Yes/No Prediction We argue that one can de-cide whether the answer to a given question shouldbe span-based or yes/no-based without lookingat any context at all. Therefore, we first createa fixed-size vector representing the question us-ing max-pooling over the first bidirectional GRU’sstates of the question. We pass this representationthrough a linear layer that predicts whether thisis a yes/no-based question or a span-based ques-tion. If span-based, we predict the answer spanfrom the context using the original span predictionlayer. If yes/no-based, we encode the question-aware context representations to a fixed-size vec-tor by performing max-pooling over the outputsof the residual self-attention layer. As before, wethen pass this vector through a linear layer to pre-dict a yes/no answer.

    Supporting Fact Prediction As a context’ssupporting facts for a question are at the sentence-level, we encode the question-aware context rep-resentations to fixed-size sentence representationsby passing the outputs of the residual self-attentionlayer through another bidirectional GRU, followedby performing max-pooling over the sentencegroups of the GRU’s outputs. Each sentence repre-sentation is then passed through a multilayer per-ceptron with a single hidden layer equipped withReLU activations to predict whether it is indeed asupporting fact or not.

    Training An input sample for the paragraphreader consists of a question and a single context,(Q,P ). Nevertheless, as HotpotQA requires mul-tiple paragraphs to answer a question, we define Pto be the concatenation of these paragraphs.

    Our objective function comprises four lossfunctions, corresponding to the four possible pre-dictions of our model. For the span-based predic-tion we useLspan, as before. We use a similar neg-

  • 2307

    ative log likelihood loss for the answer type pre-diction (whether the answer should be span-basedor yes/no-based) and for a yes/no answer predic-tion:

    Ltype = − log

    ( ∑j∈PQ e

    stypej∑j∈PQ(e

    sbinaryj + esspanj )

    )

    Lyes/no = − log

    ( ∑j∈PQ e

    syes/noj∑

    j∈PQ(esyesj + es

    noj )

    ),

    where PQ is the set of paragraphs paired with thesame question Q, and es

    binaryj , es

    spanj and es

    typej

    are the likelihood scores of the j-th question-paragraph pair being a binary yes/no-based type,a span-based type, and its true type, respectively.

    esyesj , es

    noj and es

    yes/noj are the likelihood scores of

    the j-th question-paragraph pair having the answer‘yes’, the answer ‘no’, and its true answer, respec-tively. For span-based questions, Lyes/no is de-fined to be zero, and vice-versa.

    For the supporting fact prediction, we use abinary cross-entropy loss on each sentence, Lsp.The final loss function is the sum of these four ob-jectives,

    Lhotpot = Lspan + Ltype + Lyes/no + Lsp

    During inference, the supporting facts predictionis taken only from the paragraph from which theanswer is predicted.

    Metrics Three sets of metrics were proposed byYang et al. (2018) to evaluate performance on theHotpotQA dataset. The first set of metrics fo-cuses on evaluating the answer span. For thispurpose the exact match (EM) and F1 metrics areused, as suggested by Rajpurkar et al. (2016). Thesecond set of metrics focuses on the explainabil-ity of the models, by evaluating the supportingfacts directly using the EM and F1 metrics on theset of supporting fact sentences. The final set ofmetrics combines the evaluation of answer spansand supporting facts as follows. For each ex-ample, given its precision and recall on the an-swer span (P (ans), R(ans)) and the supporting facts(P (sup), R(sup)), respectively, the joint F1 is calcu-lated as

    P (joint) = P (ans)P (sup), R(joint) = R(ans)R(sup),

    Joint F1 =2P (joint)R(joint)

    P (joint) +R(joint).

    The joint EM is 1 only if both tasks achieve anexact match and otherwise 0. Intuitively, thesemetrics penalize systems that perform poorly oneither task. All metrics are evaluated example-by-example, and then averaged over examples in theevaluation set.

    C Implementation Details

    We use the Stanford CoreNLP toolkit (Manninget al., 2014) for tokenization. We implement allour models using TensorFlow.

    Architecture Details For the word-level embed-dings, we use the GloVe 300-dimensional embed-dings pretrained on the 840B Common Crawl cor-pus (Pennington et al., 2014). For the character-level embeddings, we use 20-dimensional char-acter embeddings, and use a 1-dimensional CNNwith 100 filters of size 5, with a dropout (Srivas-tava et al., 2014) rate of 0.2.

    For the encoder, we also concatenate ELMo(Peters et al., 2018) embeddings with a dropoutrate of 0.5 and the token representations from theoutput of embedding layer to form the final tokenrepresentations, before processing them throughthe first bidirectional GRU. We use the ELMoweights pretrained on the 5.5B dataset.5 To speedup computations, we cache the context indepen-dent token representations of all tokens that ap-pear at least once in the titles of the HotpotQAWikipedia version, or appear at least five times inthe entire Wikipedia version. Words not in this vo-cabulary are given a fixed OOV vector. We use alearned weighted average of all three ELMo lay-ers. Variational dropout (Gal and Ghahramani,2016), where the same dropout mask is applied ateach time step, is applied on the inputs of all re-current layers with a dropout rate of 0.2. We setthe encoding size to be d = 1024.

    For the paragraph reader used for HotpotQA,we use a state size of 150 for the bidirectionalGRUs. The size of the hidden layer in the MLPused for supporting fact prediction is set to 150as well. Here again variational dropout with adropout rate of 0.2 is applied on the inputs of allrecurrent layers and attention mechanisms. Thereader used for SQuAD is the shared-norm modeltrained on the SQuAD dataset by Clark and Gard-ner (2018).6

    5Available at https://allennlp.org/elmo6Available at https://github.com/allenai/document-qa

  • 2308

    Training Details We train all our models usingthe Adadelta optimizer (Zeiler, 2012) with a learn-ing rate of 1.0 and ρ = 0.95.

    SQuAD-Open: The training data is gathered asfollows. For each question in the original SQuADdataset, the original paragraph given as the ques-tion’s context is considered as the single relevant(positive) paragraph. We gather ∼12 irrelevant(negative) paragraphs for each question in the fol-lowing manner:

    • The three paragraphs with the highest TF-IDF similarity to the question in the sameSQuAD document as the relevant paragraph(excluding the relevant paragraph). The samemethod is applied to retrieve the three para-graphs most similar to the relevant paragraph.

    • The two paragraphs with the highest TF-IDFsimilarity to the question from the set of allfirst paragraphs in the entire Wikipedia (ex-cluding the relevant paragraph’s article). Thesame method is applied to retrieve the twoparagraphs most similar to the relevant para-graph.

    • Two randomly sampled paragraphs from theentire Wikipedia.

    Questions that contain only stop-words aredropped, as they are most likely too dependenton the original context and not suitable for open-domain. In each epoch, a question appears as atraining sample four times; once with the relevantparagraph, and three times with randomly sampledirrelevant paragraphs.

    We train with a batch size of 45, and do not usethe ranking loss by setting λ = 0 in Equation (2).We limit the length of the paragraphs to 600 to-kens.

    HotpotQA: The paragraphs used for trainingthe encoder are the gold and distractor paragraphssupplied in the original HotpotQA training set.As mentioned in Section 3.1, each training sam-ple consists of a question and two paragraphs,(Q,P 1, P 2), whereP 1 corresponds to a paragraphretrieved in the first iteration, and P 2 correspondsto a paragraph retrieved in the second iteration.For each question, we create the following sampletypes:

    1. Gold: The two paragraphs are the two goldparagraphs of the question. Both P 1 and P 2

    are considered positive.

    2. First gold, second distractor: P 1 is one ofthe gold paragraphs and considered positive,while P 2 can be a random paragraph from thetraining set, the same as P 1, or one of the dis-tractors, with probabilities 0.05, 0.1 and 0.85,respectively. P 2 is considered negative.

    3. First distractor, second gold: P 1 is either oneof the distractors or a random paragraph fromthe training set, with probabilities 0.9 and0.1, respectively. P 2 is one of the gold para-graphs. Both P 1 and P 2 are considered neg-ative.

    4. All distractors: Both P 1 and P 2 are sampledfrom the question’s distractors, and are con-sidered negative.

    5. Gold from another question: A gold para-graph pair taken from another question; bothparagraphs are considered negative.

    The use of the sample types from the above listmotivation is motivated as follows. Sample type1 is the only one that contains purely positive ex-amples and hence is mandatory. Sample type 2is necessary to allow the model to learn a valu-able reformulation, which does not give a relevantscore based solely on the first paragraph. Sampletype 3 is complementary to type 2; it allows themodel to learn that a paragraph pair is irrelevant ifthe first paragraph is irrelevant, regardless of thesecond. Sample type 3 is used for random nega-tive sampling, which is the most common case ofall. Sample type 4 is used to guarantee the modeldoes not determine relevancy solely based on theparagraph pair, but also based on the question.

    In each training batch, we include three samplesfor each question in the batch: a single gold sam-ple (type 1), and two samples from the other fourtypes, with sample probabilities of 0.35, 0.35, 0.25and 0.05, respectively.

    We use a batch size of 75 (25 unique questions).We set the margin to be γ = 1 in Equation (1)and λ = 1 in Equation (2), for both predictioniterations. We limit the length of the paragraphs to600 tokens.

    HotpotQA Reader: The reader receives aquestion and a concatenation of a paragraph pairas input. Each training batch consists of threesamples with three different paragraph pairs foreach question: a single gold pair, which is the

  • 2309

    two gold paragraphs of the question, and two ran-domly sampled paragraph pairs from the set of thedistractors and one of the gold paragraphs of thequestion. We label the correct answer spans to beevery text span that has an exact match with theground truth answer, even in the distractor para-graphs. We use a batch size of 75 (25 unique ques-tions), and limit the length of the paragraphs (be-fore concatenation) to 600 tokens.