Gated-Attention Readers for Text Comprehension · native compositional operators for imple-menting...

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1832–1846Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1168

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1832–1846Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1168

Gated-Attention Readers for Text Comprehension

Bhuwan Dhingra∗ Hanxiao Liu∗ Zhilin YangWilliam W. Cohen Ruslan Salakhutdinov

School of Computer ScienceCarnegie Mellon University

{bdhingra,hanxiaol,zhiliny,wcohen,rsalakhu}@cs.cmu.edu

Abstract

In this paper we study the problem of an-swering cloze-style questions over docu-ments. Our model, the Gated-Attention(GA) Reader1, integrates a multi-hop ar-chitecture with a novel attention mecha-nism, which is based on multiplicative in-teractions between the query embeddingand the intermediate states of a recurrentneural network document reader. Thisenables the reader to build query-specificrepresentations of tokens in the documentfor accurate answer selection. The GAReader obtains state-of-the-art results onthree benchmarks for this task–the CNN &Daily Mail news stories and the Who DidWhat dataset. The effectiveness of multi-plicative interaction is demonstrated by anablation study, and by comparing to alter-native compositional operators for imple-menting the gated-attention.

1 Introduction

A recent trend to measure progress towards ma-chine reading is to test a system’s ability to an-swer questions about a document it has to com-prehend. Towards this end, several large-scaledatasets of cloze-style questions over a contextdocument have been introduced recently, whichallow the training of supervised machine learningsystems (Hermann et al., 2015; Hill et al., 2016;Onishi et al., 2016). Such datasets can be eas-ily constructed automatically and the unambigu-ous nature of their queries provides an objectivebenchmark to measure a system’s performance attext comprehension.

∗BD and HL contributed equally to this work.1Source code is available on github: https://

github.com/bdhingra/ga-reader

Deep learning models have been shown to out-perform traditional shallow approaches on textcomprehension tasks (Hermann et al., 2015). Thesuccess of many recent models can be attributedprimarily to two factors: (1) Multi-hop architec-tures (Weston et al., 2015; Sordoni et al., 2016;Shen et al., 2016), allow a model to scan the doc-ument and the question iteratively for multiplepasses. (2) Attention mechanisms, (Chen et al.,2016; Hermann et al., 2015) borrowed from themachine translation literature (Bahdanau et al.,2014), allow the model to focus on appropriatesubparts of the context document. Intuitively, themulti-hop architecture allows the reader to incre-mentally refine token representations, and the at-tention mechanism re-weights different parts inthe document according to their relevance to thequery.

The effectiveness of multi-hop reasoning andattentions have been explored orthogonally so farin the literature. In this paper, we focus on com-bining both in a complementary manner, by de-signing a novel attention mechanism which gatesthe evolving token representations across hops.More specifically, unlike existing models wherethe query attention is applied either token-wise(Hermann et al., 2015; Kadlec et al., 2016; Chenet al., 2016; Hill et al., 2016) or sentence-wise(Weston et al., 2015; Sukhbaatar et al., 2015) toallow weighted aggregation, the Gated-Attention(GA) module proposed in this work allows thequery to directly interact with each dimension ofthe token embeddings at the semantic-level, and isapplied layer-wise as information filters during themulti-hop representation learning process. Such afine-grained attention enables our model to learnconditional token representations w.r.t. the givenquestion, leading to accurate answer selections.

We show in our experiments that the proposedGA reader, despite its relative simplicity, consis-

1832

https://doi.org/10.18653/v1/P17-1168

https://doi.org/10.18653/v1/P17-1168

tently improves over a variety of strong baselineson three benchmark datasets . Our key contribu-tion, the GA module, provides a significant im-provement for large datasets. Qualitatively, vi-sualization of the attentions at intermediate lay-ers of the GA reader shows that in each layer theGA reader attends to distinct salient aspects of thequery which help in determining the answer.

2 Related Work

The cloze-style QA task involves tuples of theform (d, q, a, C), where d is a document (context),q is a query over the contents of d, in which aphrase is replaced with a placeholder, and a is theanswer to q, which comes from a set of candidatesC. In this work we consider datasets where eachcandidate c ∈ C has at least one token which alsoappears in the document. The task can then bedescribed as: given a document-query pair (d, q),find a ∈ C which answers q. Below we provide anoverview of representative neural network archi-tectures which have been applied to this problem.

LSTMs with Attention: Several architectures in-troduced in Hermann et al. (2015) employ LSTMunits to compute a combined document-query rep-resentation g(d, q), which is used to rank the can-didate answers. These include the DeepLSTMReader which performs a single forward passthrough the concatenated (document, query) pairto obtain g(d, q); the Attentive Reader which firstcomputes a document vector d(q) by a weightedaggregation of words according to attentions basedon q, and then combines d(q) and q to obtaintheir joint representation g(d(q), q); and the Im-patient Reader where the document representa-tion is built incrementally. The architecture of theAttentive Reader has been simplified recently inStanford Attentive Reader, where shallower re-current units were used with a bilinear form for thequery-document attention (Chen et al., 2016).

Attention Sum: The Attention-Sum (AS)Reader (Kadlec et al., 2016) uses two bi-directional GRU networks (Cho et al., 2015) toencode both d and q into vectors. A probabilitydistribution over the entities in d is obtained bycomputing dot products between q and the entityembeddings and taking a softmax. Then, an ag-gregation scheme named pointer-sum attention isfurther applied to sum the probabilities of the sameentity, so that frequent entities the document willbe favored compared to rare ones. Building on the

AS Reader, the Attention-over-Attention (AoA)Reader (Cui et al., 2017) introduces a two-wayattention mechanism where the query and the doc-ument are mutually attentive to each other.

Mulit-hop Architectures: Memory Networks(MemNets) were proposed in Weston et al.(2015), where each sentence in the documentis encoded to a memory by aggregating nearbywords. Attention over the memory slots giventhe query is used to compute an overall memoryand to renew the query representation over multi-ple iterations, allowing certain types of reasoningover the salient facts in the memory and the query.Neural Semantic Encoders (NSE) (Munkhdalai& Yu, 2017a) extended MemNets by introducing awrite operation which can evolve the memory overtime during the course of reading. Iterative reason-ing has been found effective in several more recentmodels, including the Iterative Attentive Reader(Sordoni et al., 2016) and ReasoNet (Shen et al.,2016). The latter allows dynamic reasoning stepsand is trained with reinforcement learning.

Other related works include Dynamic En-tity Representation network (DER) (Kobayashiet al., 2016), which builds dynamic representa-tions of the candidate answers while reading thedocument, and accumulates the information aboutan entity by max-pooling; EpiReader (Trischleret al., 2016) consists of two networks, where oneproposes a small set of candidate answers, and theother reranks the proposed candidates conditionedon the query and the context; Bi-DirectionalAttention Flow network (BiDAF) (Seo et al.,2017) adopts a multi-stage hierarchical architec-ture along with a flow-based attention mechanism;Bajgar et al. (2016) showed a 10% improvementon the CBT corpus (Hill et al., 2016) by train-ing the AS Reader on an augmented training setof about 14 million examples, making a case forthe community to exploit data abundance. The fo-cus of this paper, however, is on designing modelswhich exploit the available data efficiently.

3 Gated-Attention Reader

Our proposed GA readers perform multiple hopsover the document (context), similar to the Mem-ory Networks architecture (Sukhbaatar et al.,2015). Multi-hop architectures mimic the multi-step comprehension process of human readers, andhave shown promising results in several recentmodels for text comprehension (Sordoni et al.,

1833

2016; Kumar et al., 2016; Shen et al., 2016). Thecontextual representations in GA readers, namelythe embeddings of words in the document, are it-eratively refined across hops until reaching a fi-nal attention-sum module (Kadlec et al., 2016)which maps the contextual representations in thelast hop to a probability distribution over candi-date answers.

The attention mechanism has been introducedrecently to model human focus, leading to signif-icant improvement in machine translation and im-age captioning (Bahdanau et al., 2014; Mnih et al.,2014). In reading comprehension tasks, ideally,the semantic meanings carried by the contextualembeddings should be aware of the query acrosshops. As an example, human readers are able tokeep the question in mind during multiple passesof reading, to successively mask away informationirrelevant to the query. However, existing neuralnetwork readers are restricted to either attend totokens (Hermann et al., 2015; Chen et al., 2016)or entire sentences (Weston et al., 2015), with theassumption that certain sub-parts of the documentare more important than others. In contrast, wepropose a finer-grained model which attends tocomponents of the semantic representation beingbuilt up by the GRU. The new attention mecha-nism, called gated-attention, is implemented viamultiplicative interactions between the query andthe contextual embeddings, and is applied per hopto act as fine-grained information filters during themulti-step reasoning. The filters weigh individualcomponents of the vector representation of eachtoken in the document separately.

The design of gated-attention layers is moti-vated by the effectiveness of multiplicative inter-action among vector-space representations, e.g.,in various types of recurrent units (Hochreiter &Schmidhuber, 1997; Wu et al., 2016) and in re-lational learning (Yang et al., 2014; Kiros et al.,2014). While other types of compositional opera-tors are possible, such as concatenation or addition(Mitchell & Lapata, 2008), we find that multipli-cation has strong empirical performance (section4.3), where query representations naturally serveas information filters across hops.

3.1 Model Details

Several components of the model use a Gated Re-current Unit (GRU) (Cho et al., 2015) which mapsan input sequence X = [x1, x2, . . . , xT ] to an

ouput sequence H = [h1, h2, . . . , hT ] as follows:

rt = σ(Wrxt + Urht−1 + br),

zt = σ(Wzxt + Uzht−1 + bz),

ht = tanh(Whxt + Uh(rt � ht−1) + bh),

ht = (1− zt)� ht−1 + zt � ht.

where � denotes the Hadamard product or theelement-wise multiplication. rt and zt are calledthe reset and update gates respectively, and htthe candidate output. A Bi-directional GRU (Bi-GRU) processes the sequence in both forward andbackward directions to produce two sequences[hf1 , h

f2 , . . . , h

fT ] and [hb1, h

b2, . . . , h

bT ], which are

concatenated at the output←→GRU(X) = [hf1‖hbT , . . . , hfT ‖hb1] (1)

where←→GRU(X) denotes the full output of the

Bi-GRU obtained by concatenating each forwardstate hfi and backward state hbT−i+1 at step i given

the inputX . Note←→GRU(X) is a matrix in R2nh×T

where nh is the number of hidden units in GRU.Let X(0) = [x

(0)1 , x

(0)2 , . . . x

(0)|D|] denote the to-

ken embeddings of the document, which are alsoinputs at layer 1 for the document reader below,and Y = [y1, y2, . . . y|Q|] denote the token embed-dings of the query. Here |D| and |Q| denote thedocument and query lengths respectively.

3.1.1 Multi-Hop ArchitectureFig. 1 illustrates the Gated-Attention (GA) reader.The model reads the document and the query overK horizontal layers, where layer k receives thecontextual embeddings X(k−1) of the documentfrom the previous layer. The document embed-dings are transformed by taking the full output ofa document Bi-GRU (indicated in blue in Fig. 1):

D(k) =←→GRU

(k)

D (X(k−1)) (2)

At the same time, a layer-specific query represen-tation is computed as the full output of a separatequery Bi-GRU (indicated in green in Figure 1):

Q(k) =←→GRU

(k)

Q (Y ) (3)

Next, Gated-Attention is applied to D(k) andQ(k) to compute inputs for the next layer X(k).

X(k) = GA(D(k), Q(k)) (4)

where GA is defined in the following subsection.

1834

Figure 1: Gated-Attention Reader. Dashed lines represent dropout connections.

3.1.2 Gated-Attention ModuleFor brevity, let us drop the superscript k in thissubsection as we are focusing on a particular layer.For each token di in D, the GA module forms atoken-specific representation of the query qi usingsoft attention, and then multiplies the query rep-resentation element-wise with the document tokenrepresentation. Specifically, for i = 1, . . . , |D|:

αi = softmax(Q>di) (5)

qi = Qαi

xi = di � qi (6)

In equation (6) we use the multiplication operatorto model the interactions between di and qi. Inthe experiments section, we also report results forother choices of gating functions, including addi-tion xi = di + qi and concatenation xi = di‖qi.3.1.3 Answer PredictionLet q(K)

` = qf` ‖qbT−`+1 be an intermediate out-put of the final layer query Bi-GRU at the loca-tion ` of the cloze token in the query, and D(K) =←→GRU

(K)

D (X(K−1)) be the full output of final layerdocument Bi-GRU. To obtain the probability thata particular token in the document answers thequery, we take an inner-product between thesetwo, and pass through a softmax layer:

s = softmax((q(K)` )TD(K)) (7)

where vector s defines a probability distributionover the |D| tokens in the document. The proba-bility of a particular candidate c ∈ C as being the

answer is then computed by aggregating the prob-abilities of all document tokens which appear in cand renormalizing over the candidates:

Pr(c|d, q) ∝∑

i∈I(c,d)si (8)

where I(c, d) is the set of positions where a tokenin c appears in the document d. This aggregationoperation is the same as the pointer sum attentionapplied in the AS Reader (Kadlec et al., 2016).

Finally, the candidate with maximum probabil-ity is selected as the predicted answer:

a∗ = argmaxc∈C Pr(c|d, q). (9)

During the training phase, model parameters ofGA are updated w.r.t. a cross-entropy loss betweenthe predicted probabilities and the true answers.

3.1.4 Further EnhancementsCharacter-level Embeddings: Given a token wfrom the document or query, its vector space repre-sentation is computed as x = L(w)||C(w). L(w)retrieves the word-embedding forw from a lookuptable L ∈ R|V |×nl , whose rows hold a vector foreach unique token in the vocabulary. We also uti-lize a character composition model C(w) whichgenerates an orthographic embedding of the token.Such embeddings have been previously shown tobe helpful for tasks like Named Entity Recognition(Yang et al., 2016) and dealing with OOV tokensat test time (Dhingra et al., 2016). The embeddingC(w) is generated by taking the final outputs zfnc

and zbncof a Bi-GRU applied to embeddings from

1835

a lookup table of characters in the token, and ap-plying a linear transformation:

z = zfnc||zbnc

C(w) =Wz + b

Question Evidence Common Word Feature (qe-comm): Li et al. (2016) recently proposed a sim-ple token level indicator feature which signifi-cantly boosts reading comprehension performancein some cases. For each token in the document weconstruct a one-hot vector fi ∈ {0, 1}2 indicatingits presence in the query. It can be incorporatedinto the GA reader by assigning a feature lookuptable F ∈ RnF×2 (we use nF = 2), taking thefeature embedding ei = fTi F and appending itto the inputs of the last layer document BiGRUas, x(K)

i ‖fi for all i. We conducted several ex-periments both with and without this feature andobserved some interesting trends, which are dis-cussed below. Henceforth, we refer to this featureas the qe-comm feature or just feature.

4 Experiments and Results

4.1 Datasets

We evaluate the GA reader on five large-scaledatasets recently proposed in the literature. Thefirst two, CNN and Daily Mail news stories2 con-sist of articles from the popular CNN and DailyMail websites (Hermann et al., 2015). A queryover each article is formed by removing an en-tity from the short summary which follows thearticle. Further, entities within each article wereanonymized to make the task purely a comprehen-sion one. N-gram statistics, for instance, com-puted over the entire corpus are no longer usefulin such an anonymized corpus.

The next two datasets are formed from two dif-ferent subsets of the Children’s Book Test (CBT)3

(Hill et al., 2016). Documents consist of 20 con-tiguous sentences from the body of a popular chil-dren’s book, and queries are formed by deleting atoken from the 21st sentence. We only focus onsubsets where the deleted token is either a com-mon noun (CN) or named entity (NE) since simplelanguage models already give human-level perfor-mance on the other types (cf. (Hill et al., 2016)).

2https://github.com/deepmind/rc-data

3http://www.thespermwhale.com/jaseweston/babi/

CBTest.tgz

The final dataset is Who Did What4 (WDW)(Onishi et al., 2016), constructed from the LDCEnglish Gigaword newswire corpus. First, articlepairs which appeared around the same time andwith overlapping entities are chosen, and then onearticle forms the document and a cloze query isconstructed from the other. Missing tokens are al-ways person named entities. Questions which areeasily answered by simple baselines are filteredout, to make the task more challenging. There aretwo versions of the training set—a small but fo-cused “Strict” version and a large but noisy “Re-laxed” version. We report results on both set-tings which share the same validation and test sets.Statistics of all the datasets used in our experi-ments are summarized in the Appendix (Table 5).

4.2 Performance Comparison

Tables 1 and 3 show a comparison of the perfor-mance of GA Reader with previously publishedresults on WDW and CNN, Daily Mail, CBTdatasets respectively. The numbers reported forGA Reader are for single best models, thoughwe compare to both ensembles and single modelsfrom prior work. GA Reader-- refers to an earlierversion of the model, unpublished but describedin a preprint, with the following differences—(1)it does not utilize token-specific attentions withinthe GA module, as described in equation (5), (2)it does not use a character composition model, (3)it is initialized with word embeddings pretrainedon the corpus itself rather than GloVe. A detailedanalysis of these differences is studied in the nextsection. Here we present 4 variants of the latestGA Reader, using combinations of whether theqe-comm feature is used (+feature) or not, andwhether the word lookup table L(w) is updatedduring training or fixed to its initial value. Otherhyperparameters are listed in Appendix A.

Interestingly, we observe that feature engineer-ing leads to significant improvements for WDWand CBT datasets, but not for CNN and Daily Maildatasets. We note that anonymization of the latterdatasets means that there is already some featureengineering (it adds hints about whether a tokenis an entity), and these are much larger than theother four. In machine learning it is common to seethe effect of feature engineering diminish with in-creasing data size. Similarly, fixing the word em-beddings provides an improvement for the WDW

4https://tticnlp.github.io/who_did_what/

1836

Table 1: Validation/Test accuracy (%) on WDW dataset for both “Strict”and “Relaxed” settings. Results with “†” are cf previously published works.

Model Strict Relaxed

Val Test Val Test

Human † – 84 – –

Attentive Reader † – 53 – 55AS Reader † – 57 – 59Stanford AR † – 64 – 65NSE † 66.5 66.2 67.0 66.7

GA-- † – 57 – 60.0GA (update L(w)) 67.8 67.0 67.0 66.6GA (fix L(w)) 68.3 68.0 69.6 69.1GA (+feature, update L(w)) 70.1 69.5 70.9 71.0GA (+feature, fix L(w)) 71.6 71.2 72.6 72.6

Table 2: Top: Performance of different gatingfunctions. Bottom: Effect of varying the num-ber of hops K. Results on WDW without usingthe qe-comm feature and with fixed L(w).

Gating Function Accuracy

Val Test

Sum 64.9 64.5Concatenate 64.4 63.7Multiply 68.3 68.0

K

1 (AS) † – 572 65.6 65.63 68.3 68.04 68.3 68.2

and CBT, but not for CNN and Daily Mail. Thisis not surprising given that the latter datasets arelarger and less prone to overfitting.

Comparing with prior work, on the WDWdataset the basic version of the GA Reader out-performs all previously published models whentrained on the Strict setting. By adding the qe-comm feature the performance increases by 3.2%and 3.5% on the Strict and Relaxed settings re-spectively to set a new state of the art on thisdataset. On the CNN and Daily Mail datasets theGA Reader leads to an improvement of 3.2% and4.3% respectively over the best previous singlemodels. They also outperform previous ensem-ble models, setting a new state of that art for bothdatasets. For CBT-NE, GA Reader with the qe-comm feature outperforms all previous single andensemble models except the AS Reader trained onthe much larger BookTest Corpus (Bajgar et al.,2016). Lastly, on CBT-CN the GA Reader withthe qe-comm feature outperforms all previouslypublished single models except the NSE, and ASReader trained on a larger corpus. For each of the4 datasets on which GA achieves the top perfor-mance, we conducted one-sample proportion teststo test whether GA is significantly better than thesecond-best baseline. The p-values are 0.319 forCNN, <0.00001 for DailyMail, 0.028 for CBT-NE, and <0.00001 for WDW. In other words,GA statistically significantly outperforms all otherbaselines on 3 out of those 4 datasets at the 5%significance level. The results could be even moresignificant under paired tests, however we did nothave access to the predictions from the baselines.

4.3 GA Reader Analysis

In this section we do an ablation study to see theeffect of Gated Attention. We compare the GAReader as described here to a model which is ex-actly the same in all aspects, except that it passesdocument embeddings D(k) in each layer directlyto the inputs of the next layer without using theGA module. In other words X(k) = D(k) for allk > 0. This model ends up using only one queryGRU at the output layer for selecting the answerfrom the document. We compare these two vari-ants both with and without the qe-comm featureon CNN and WDW datasets for three subsets ofthe training data - 50%, 75% and 100%. Test setaccuracies for these settings are shown in Figure 2.On CNN when tested without feature engineering,we observe that GA provides a significant boostin performance compared to without GA. Whentested with the feature it still gives an improve-ment, but the improvement is significant only with100% training data. On WDW-Strict, which is athird of the size of CNN, without the feature wesee an improvement when using GA versus with-out using GA, which becomes significant as thetraining set size increases. When tested with thefeature on WDW, for a small data size without GAdoes better than with GA, but as the dataset sizeincreases they become equivalent. We concludethat GA provides a boost in the absence of featureengineering, or as the training set size increases.

Next we look at the question of how to gate in-termediate document reader states from the query,i.e. what operation to use in equation 6. Table

1837

Table 3: Validation/Test accuracy (%) on CNN, Daily Mail and CBT. Results marked with “†” are cf previously publishedworks. Results marked with “‡” were obtained by training on a larger training set. Best performance on standard training setsis in bold, and on larger training sets in italics.

Model CNN Daily Mail CBT-NE CBT-CN

Val Test Val Test Val Test Val Test

Humans (query) † – – – – – 52.0 – 64.4Humans (context + query) † – – – – – 81.6 – 81.6

LSTMs (context + query) † – – – – 51.2 41.8 62.6 56.0Deep LSTM Reader † 55.0 57.0 63.3 62.2 – – – –Attentive Reader † 61.6 63.0 70.5 69.0 – – – –Impatient Reader † 61.8 63.8 69.0 68.0 – – – –MemNets † 63.4 66.8 – – 70.4 66.6 64.2 63.0AS Reader † 68.6 69.5 75.0 73.9 73.8 68.6 68.8 63.4DER Network † 71.3 72.9 – – – – – –Stanford AR (relabeling) † 73.8 73.6 77.6 76.6 – – – –Iterative Attentive Reader † 72.6 73.3 – – 75.2 68.6 72.1 69.2EpiReader † 73.4 74.0 – – 75.3 69.7 71.5 67.4AoA Reader † 73.1 74.4 – – 77.8 72.0 72.2 69.4ReasoNet † 72.9 74.7 77.6 76.6 – – – –NSE † – – – – 78.2 73.2 74.3 71.9BiDAF † 76.3 76.9 80.3 79.6 – – – –

MemNets (ensemble) † 66.2 69.4 – – – – – –AS Reader (ensemble) † 73.9 75.4 78.7 77.7 76.2 71.0 71.1 68.9Stanford AR (relabeling,ensemble) † 77.2 77.6 80.2 79.2 – – – –Iterative Attentive Reader (ensemble) † 75.2 76.1 – – 76.9 72.0 74.1 71.0EpiReader (ensemble) † – – – – 76.6 71.8 73.6 70.6

AS Reader (+BookTest) † ‡ – – – – 80.5 76.2 83.2 80.8AS Reader (+BookTest,ensemble) † ‡ – – – – 82.3 78.4 85.7 83.7

GA-- 73.0 73.8 76.7 75.7 74.9 69.0 69.0 63.9GA (update L(w)) 77.9 77.9 81.5 80.9 76.7 70.1 69.8 67.3GA (fix L(w)) 77.9 77.8 80.4 79.6 77.2 71.4 71.6 68.0GA (+feature, update L(w)) 77.3 76.9 80.7 80.0 77.2 73.3 73.0 69.8GA (+feature, fix L(w)) 76.7 77.4 80.0 79.3 78.5 74.9 74.4 70.7

2 (top) shows the performance on WDW datasetfor three common choices – sum (x = d + q),concatenate (x = d‖q) and multiply (x =d�q). Empirically we find element-wise multipli-cation does significantly better than the other two,which justifies our motivation to “filter” out docu-ment features which are irrelevant to the query.

At the bottom of Table 2 we show the effect ofvarying the number of hops K of the GA Readeron the final performance. We note that for K = 1,our model is equivalent to the AS Reader with-out any GA modules. We see a steep and steadyrise in accuracy as the number of hops is increasedfrom K = 1 to 3, which remains constant beyond

that. This is a common trend in machine learn-ing as model complexity is increased, however wenote that a multi-hop architecture is important toachieve a high performance for this task, and pro-vide further evidence for this in the next section.

4.4 Ablation Study for Model Components

Table 4 shows accuracy on WDW by removingone component at a time. The steepest reduc-tion is observed when we replace pretrained GloVevectors with those pretrained on the corpus itself.GloVe vectors were trained on a large corpus ofabout 6 billion tokens (Pennington et al., 2014),and provide an important source of prior knowl-

1838

Figure 2: Performance in accuracy with and without the Gated-Attention module over different trainingsizes. p-values for an exact one-sided Mcnemar’s test are given inside the parentheses for each setting.

0.66

0.68

0.7

0.72

0.74

0.76

0.78

50% (<0.01)

75% (<0.01)

100% (<0.01)

CNN (w/o qe-comm feature)

No GatingWith Gating

0.66

0.68

0.7

0.72

0.74

0.76

0.78

50% (0.07)

75% (0.13)

100% (<0.01)

CNN (w qe-comm feature)


0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7

50% (0.28)

75% (<0.01)

100% (<0.01)

WDW (w/o qe-comm feature)


0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7

50% (<0.01)

75% (0.42)

100% (0.27)

WDW (w qe-comm feature)


Table 4: Ablation study on WDW dataset, without usingthe qe-comm feature and with fixed L(w). Results markedwith † are cf Onishi et al. (2016).

Model Accuracy

Val Test

GA 68.3 68.0−char 66.9 66.9−token-attentions (eq. 5) 65.7 65.0−glove, +corpus 64.0 62.5

GA--† – 57

edge for the model. Note that the strongest base-line on WDW, NSE (Munkhdalai & Yu, 2017b),also uses pretrained GloVe vectors, hence thecomparison is fair in that respect. Next, we ob-serve a substantial drop when removing token-specific attentions over the query in the GA mod-ule, which allow gating individual tokens in thedocument only by parts of the query relevant tothat token rather than the overall query representa-tion. Finally, removing the character embeddings,which were only used for WDW and CBT, leadsto a reduction of about 1% in the performance.

4.5 Attention Visualization

To gain an insight into the reading process em-ployed by the model we analyzed the attention dis-tributions at intermediate layers of the reader. Fig-ure 3 shows an example from the validation set ofWDW dataset (several more are in the Appendix).In each figure, the left and middle plots visualizeattention over the query (equation 5) for candi-dates in the document after layers 1 & 2 respec-tively. The right plot shows attention over candi-

dates in the document of cloze placeholder (XXX)in the query at the final layer. The full document,query and correct answer are shown at the bottom.

A generic pattern observed in these examplesis that in intermediate layers, candidates in thedocument (shown along rows) tend to pick outsalient tokens in the query which provide cluesabout the cloze, and in the final layer the candi-date with the highest match with these tokens isselected as the answer. In Figure 3 there is a highattention of the correct answer on financialregulatory standards in the first layer, andon us president in the second layer. The in-correct answer, in contrast, only attends to one ofthese aspects, and hence receives a lower score inthe final layer despite the n-gram overlap it haswith the cloze token in the query. Importantly, dif-ferent layers tend to focus on different tokens inthe query, supporting the hypothesis that the multi-hop architecture of GA Reader is able to combinedistinct pieces of information to answer the query.

5 Conclusion

We presented the Gated-Attention reader for an-swering cloze-style questions over documents.The GA reader features a novel multiplicative gat-ing mechanism, combined with a multi-hop ar-chitecture. Our model achieves the state-of-the-art performance on several large-scale benchmarkdatasets with more than 4% improvements overcompetitive baselines. Our model design is backedup by an ablation study showing statistically sig-nificant improvements of using Gated Attentionas information filters. We also showed empiri-cally that multiplicative gating is superior to addi-

1839

Figure 3: Layer-wise attention visualization of GA Reader trained on WDW-Strict. See text for details.

tion and concatenation operations for implement-ing gated-attentions, though a theoretical justifica-tion remains part of future research goals. Anal-ysis of document and query attentions in interme-diate layers of the reader further reveals that themodel iteratively attends to different aspects of thequery to arrive at the final answer. In this paperwe have focused on text comprehension, but webelieve that the Gated-Attention mechanism maybenefit other tasks as well where multiple sourcesof information interact.

Acknowledgments

This work was funded by NSF under CCF1414030and Google Research.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua

Bengio. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473, 2014.

Ondrej Bajgar, Rudolf Kadlec, and Jan Kleindi-enst. Embracing data abundance: Booktestdataset for reading comprehension. arXiv preprintarXiv:1610.00956, 2016.

Danqi Chen, Jason Bolton, and Christopher D Man-ning. A thorough examination of the cnn/daily mailreading comprehension task. ACL, 2016.

Kyunghyun Cho, Bart Van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, Holger

Schwenk, and Yoshua Bengio. Learning phrase rep-resentations using rnn encoder-decoder for statisti-cal machine translation. ACL, 2015.

Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang,Ting Liu, and Guoping Hu. Attention-over-attentionneural networks for reading comprehension. ACL,2017.

Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick,Michael Muehl, and William W Cohen. Tweet2vec:Character-based distributed representations for so-cial media. ACL, 2016.

Karl Moritz Hermann, Tomas Kocisky, EdwardGrefenstette, Lasse Espeholt, Will Kay, Mustafa Su-leyman, and Phil Blunsom. Teaching machines toread and comprehend. In Advances in Neural Infor-mation Processing Systems, pp. 1684–1692, 2015.

Felix Hill, Antoine Bordes, Sumit Chopra, and JasonWeston. The goldilocks principle: Reading chil-dren’s books with explicit memory representations.ICLR, 2016.

Sepp Hochreiter and Jurgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

Rudolf Kadlec, Martin Schmid, Ondrej Bajgar, and JanKleindienst. Text understanding with the attentionsum reader network. ACL, 2016.

Diederik Kingma and Jimmy Ba. Adam: A method forstochastic optimization. ICLR, 2015.

Ryan Kiros, Richard Zemel, and Ruslan R Salakhutdi-nov. A multiplicative model for learning distributedtext-based attribute representations. In Advances inNeural Information Processing Systems, pp. 2348–2356, 2014.

1840

Sosuke Kobayashi, Ran Tian, Naoaki Okazaki, andKentaro Inui. Dynamic entity representations withmax-pooling improves machine reading. In NAACL-HLT, 2016.

Ankit Kumar, Ozan Irsoy, Jonathan Su, James Brad-bury, Robert English, Brian Pierce, Peter Ondruska,Ishaan Gulrajani, and Richard Socher. Ask me any-thing: Dynamic memory networks for natural lan-guage processing. ICML, 2016.

Peng Li, Wei Li, Zhengyan He, Xuguang Wang, YingCao, Jie Zhou, and Wei Xu. Dataset and neu-ral recurrent sequence labeling model for open-domain factoid question answering. arXiv preprintarXiv:1607.06275, 2016.

Jeff Mitchell and Mirella Lapata. Vector-based mod-els of semantic composition. In ACL, pp. 236–244,2008.

Volodymyr Mnih, Nicolas Heess, Alex Graves, et al.Recurrent models of visual attention. In Advances inNeural Information Processing Systems, pp. 2204–2212, 2014.

Tsendsuren Munkhdalai and Hong Yu. Neural seman-tic encoders. EACL, 2017a.

Tsendsuren Munkhdalai and Hong Yu. Reasoning withmemory augmented neural networks for languagecomprehension. ICLR, 2017b.

Takeshi Onishi, Hai Wang, Mohit Bansal, Kevin Gim-pel, and David McAllester. Who did what: A large-scale person-centered cloze dataset. EMNLP, 2016.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.On the difficulty of training recurrent neural net-works. ICML (3), 28:1310–1318, 2013.

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. Glove: Global vectors forword representation. In Empirical Methods in Nat-ural Language Processing (EMNLP), pp. 1532–1543, 2014. URL http://www.aclweb.org/anthology/D14-1162.

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, andHannaneh Hajishirzi. Bidirectional attention flowfor machine comprehension. ICLR, 2017.

Yelong Shen, Po-Sen Huang, Jianfeng Gao, andWeizhu Chen. Reasonet: Learning to stop read-ing in machine comprehension. arXiv preprintarXiv:1609.05284, 2016.

Alessandro Sordoni, Phillip Bachman, and YoshuaBengio. Iterative alternating neural attention for ma-chine reading. arXiv preprint arXiv:1606.02245,2016.

Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al.End-to-end memory networks. In Advances in Neu-ral Information Processing Systems, pp. 2431–2439,2015.

Theano Development Team. Theano: A Pythonframework for fast computation of mathematical ex-pressions. arXiv e-prints, abs/1605.02688, May2016. URL http://arxiv.org/abs/1605.02688.

Adam Trischler, Zheng Ye, Xingdi Yuan, and KaheerSuleman. Natural language comprehension with theepireader. EMNLP, 2016.

Jason Weston, Sumit Chopra, and Antoine Bordes.Memory networks. ICLR, 2015.

Yuhuai Wu, Saizheng Zhang, Ying Zhang, YoshuaBengio, and Ruslan Salakhutdinov. On multiplica-tive integration with recurrent neural networks. Ad-vances in Neural Information Processing Systems,2016.

Bishan Yang, Wen-tau Yih, Xiaodong He, JianfengGao, and Li Deng. Learning multi-relational seman-tics using neural-embedding models. NIPS Work-shop on Learning Semantics, 2014.

Zhilin Yang, Ruslan Salakhutdinov, and William Co-hen. Multi-task cross-lingual sequence tagging fromscratch. arXiv preprint arXiv:1603.06270, 2016.

A Implementation Details

Our model was implemented using the Theano(Theano Development Team, 2016) and Lasagne5

Python libraries. We used stochastic gradient de-scent with ADAM updates for optimization, whichcombines classical momentum and adaptive gradi-ents (Kingma & Ba, 2015). The batch size was 32and the initial learning rate was 5 × 10−4 whichwas halved every epoch after the second epoch.The same setting is applied to all models anddatasets. We also used gradient clipping with athreshold of 10 to stabilize GRU training (Pascanuet al., 2013). We set the number of layers K to be3 for all experiments. The number of hidden unitsfor the character GRU was set to 50. The remain-ing two hyperparameters—size of document andquery GRUs, and dropout rate—were tuned on thevalidation set, and their optimal values are shownin Table 6. In general, the optimal GRU size in-creases and the dropout rate decreases as the cor-pus size increases.

The word lookup table was initialized with 100dGloVe vectors6 (Pennington et al., 2014) and OOVtokens at test time were assigned unique randomvectors. We empirically observed that initializingwith pre-trained embeddings gives higher perfor-mance compared to random initialization for all

5https://lasagne.readthedocs.io/en/latest/

6http://nlp.stanford.edu/projects/glove/

1841

Table 5: Dataset statistics.

CNN Daily Mail CBT-NE CBT-CN WDW-Strict WDW-Relaxed

# train 380,298 879,450 108,719 120,769 127,786 185,978# validation 3,924 64,835 2,000 2,000 10,000 10,000

# test 3,198 53,182 2,500 2,500 10,000 10,000# vocab 118,497 208,045 53,063 53,185 347,406 308,602

max doc length 2,000 2,000 1,338 1,338 3,085 3,085

Table 6: Hyperparameter settings for each dataset. dim() indicates hidden state size of GRU.

Hyperparameter CNN Daily Mail CBT-NE CBT-CN WDW-Strict WDW-Relaxed

Dropout 0.2 0.1 0.4 0.4 0.3 0.3

dim(←→GRU∗) 256 256 128 128 128 128

datasets. Furthermore, for smaller datasets (WDWand CBT) we found that fixing these embeddingsto their pretrained values led to higher test perfor-mance, possibly since it avoids overfitting. We donot use the character composition model for CNNand Daily Mail, since their entities (and hence can-didate answers) are anonymized to generic tokens.For other datasets the character lookup table wasrandomly initialized with 25d vectors. All otherparameters were initialized to their default valuesas specified in the Lasagne library.

B Attention Plots

1842


1843


1844


1845


1846

Gated-Attention Readers for Text Comprehension · native compositional operators for imple-menting...

Documents

Transcript of Gated-Attention Readers for Text Comprehension · native compositional operators for imple-menting...