PeTra: A Sparsely Supervised Memory Model for People …Memory Variants In vanilla PeTra, each...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5415–5428July 5 - 10, 2020. c©2020 Association for Computational Linguistics

5415

PeTra: A Sparsely Supervised Memory Model for People Tracking

Shubham Toshniwal1, Allyson Ettinger2, Kevin Gimpel1, Karen Livescu1

1Toyota Technological Institute at Chicago2Department of Linguistics, University of Chicago

{shtoshni, kgimpel, klivescu}@ttic.edu, [email protected]

Abstract

We propose PeTra, a memory-augmented neu-ral network designed to track entities in itsmemory slots. PeTra is trained using sparseannotation from the GAP pronoun resolutiondataset and outperforms a prior memory modelon the task while using a simpler architec-ture. We empirically compare key modelingchoices, finding that we can simplify severalaspects of the design of the memory modulewhile retaining strong performance. To mea-sure the people tracking capability of memorymodels, we (a) propose a new diagnostic evalu-ation based on counting the number of uniqueentities in text, and (b) conduct a small scalehuman evaluation to compare evidence of peo-ple tracking in the memory logs of PeTra rela-tive to a previous approach. PeTra is highly ef-fective in both evaluations, demonstrating itsability to track people in its memory despitebeing trained with limited annotation.

1 Introduction

Understanding text narratives requires maintainingand resolving entity references over arbitrary-lengthspans. Current approaches for coreference resolu-tion (Clark and Manning, 2016b; Lee et al., 2017,2018; Wu et al., 2019) scale quadratically (withoutheuristics) with length of text, and hence are im-practical for long narratives. These models are alsocognitively implausible, lacking the incrementalityof human language processing (Tanenhaus et al.,1995; Keller, 2010). Memory models with finitememory and online/quasi-online entity resolutionhave linear runtime complexity, offering more scal-ability, cognitive plausibility, and interpretability.

Memory models can be viewed as general prob-lem solvers with external memory mimicking aTuring tape (Graves et al., 2014, 2016). Someof the earliest applications of memory networks

in language understanding were for question an-swering, where the external memory simply storedall of the word/sentence embeddings for a docu-ment (Sukhbaatar et al., 2015; Kumar et al., 2016).To endow more structure and interpretability tomemory, key-value memory networks were intro-duced by Miller et al. (2016). The key-value archi-tecture has since been used for narrative understand-ing and other tasks where the memory is intendedto learn to track entities while being guided by vary-ing degrees of supervision (Henaff et al., 2017; Liuet al., 2018a,b, 2019a).

We propose a new memory model, PeTra, forentity tracking and coreference resolution, inspiredby the recent Referential Reader model (Liu et al.,2019a) but substantially simpler. Experiments onthe GAP (Webster et al., 2018) pronoun resolu-tion task show that PeTra outperforms the Refer-ential Reader with fewer parameters and simplerarchitecture. Importantly, while Referential Readerperformance degrades with larger memory, PeTraimproves with increase in memory capacity (beforesaturation), which should enable tracking of a largernumber of entities. We conduct experiments to as-sess various memory architecture decisions, suchas learning of memory initialization and separationof memory slots into key/value pairs.

To test interpretability of memory models’ entitytracking, we propose a new diagnostic evaluationbased on entity counting—a task that the modelsare not explicitly trained for—using a small amountof annotated data. Additionally, we conduct a smallscale human evaluation to assess quality of peopletracking based on model memory logs. PeTra sub-stantially outperforms Referential Reader on bothmeasures, indicating better and more interpretabletracking of people.1

1Code available at https://github.com/shtoshni92/petra

https://github.com/shtoshni92/petra

https://github.com/shtoshni92/petra

5416

The

IG

character

IG

Amelia

OW

Shepherd

CR

, portrayed by

. . .

IG

Caterina

OW

Scorsone

CR

, visits

. . .

IG

her

CR

friend

IG

Addison

OW

Figure 1: Illustration of memory cell updates in an example sentence where IG = ignore, OW = overwrite, CR =coref. Different patterns indicate the different entities, and an empty pattern indicates that the cell has not beenused. The updated memory cells at each time step are highlighted.

2 Model

Figure 2 depicts PeTra, which consists of threecomponents: an input encoder that given the tokensgenerates the token embeddings, a memory modulethat tracks information about the entities presentin the text, and a controller network that acts as aninterface between the encoder and the memory.

BERT Encoder

wt

ht

Mt−1 Mt. . . . . .

Controller

et, ot, ct

. . . . . .

. . . . . .

GRU HiddenStates

Input Tokens

Memory

ControllerOutputs

Figure 2: Proposed model.

2.1 Input EncoderGiven a document consisting of a sequence of to-kens {w1, · · · , wT }, we first pass the documentthrough a fixed pretrained BERT model (Devlinet al., 2019) to extract contextual token embed-dings. Next, the BERT-based token embeddingsare fed into a single-layer unidirectional Gated Re-current Unit (GRU) (Cho et al., 2014) runningleft-to-right to get task-specific token embeddings{h1, · · · ,hT }.

2.2 MemoryThe memoryMt consists of N memory cells. Theith memory cell state at time step t consists of atuple (mi

t, uit) where the vector mi

t represents thecontent of the memory cell, and the scalar uit ∈

[0, 1] represents its recency of usage. A high valueof uit is intended to mean that the cell is tracking anentity that has been recently mentioned.

Initialization Memory cells are initialized to thenull tuple, i.e. (0, 0); thus, our memory is parameter-free. This is in contrast with previous entity trackingmodels such as EntNet (Henaff et al., 2017) andthe Referential Reader (Liu et al., 2019a) wherememory initialization is learned and the cells arerepresented with separate key and value vectors. Wewill later discuss variants of our memory with someof these changes.

2.3 ControllerAt each time step t the controller network deter-mines whether token t is part of an entity span and,if so, whether the token is coreferent with any ofthe entities already being tracked by the memory.Depending on these two variables, there are threepossible actions:

(i) IGNORE: The token is not part of any entityspan, in which case we simply ignore it.

(ii) OVERWRITE: The token is part of an entityspan but is not already being tracked in thememory.

(iii) COREF: The token is part of an entity spanand the entity is being tracked in the memory.

Therefore, the two ways of updating the memory areOVERWRITE and COREF. There is a strict orderingconstraint to the two operations: OVERWRITE pre-cedes COREF, because it is not possible to coreferwith a memory cell that is not yet tracking anything.That is, the COREF operation cannot be applied toa previously unwritten memory cell, i.e. one withuit = 0. Figure 1 illustrates an idealized version ofthis process.

Next we describe in detail the computation of theprobabilities of the two operations for each memorycell at each time step t.

5417

First, the entity mention probability et, whichreflects the probability that the current token wt ispart of an entity mention, is computed by:

et = σ(MLP1(ht)) (1)

where MLP1 is a multi-layer perceptron and σ isthe logistic function.

Overwrite and Coref If the current token wt ispart of an entity mention, we need to determinewhether it corresponds to an entity being currentlytracked by the memory or not. For this we computethe similarity between the token embedding ht andthe contents of the memory cells currently trackingentities. For the ith memory cell with memoryvector mi

t−1 the similarity with ht is given by:

simit = MLP2([ht;m

it−1;ht �mi

t−1;uit−1])

(2)where MLP2 is a second MLP and � is theHadamard (elementwise) product. The usage scalaruit−1 in the above expression provides a notion ofdistance between the last mention of the entity incell i and the potential current mention. The higherthe value of uit−1, the more likely there was a recentmention of the entity being tracked by the cell. Thusuit−1 provides an alternative to distance-based fea-tures commonly used in pairwise scores for spans(Lee et al., 2017).

Given the entity mention probability et and simi-larity score simi

t, we define the coref score cs it as:

cs it = simit −∞ · 1[uit−1 = 0] (3)

where the second term ensures that the model doesnot predict coreference with a memory cell that hasnot been previously used, something not enforcedby Liu et al. (2019a).2 Assuming the coref scorefor a new entity to be 0,3 we compute the corefprobability cit and new entity probability nt asfollows:

c1t...cNtnt

= et · softmax

cs1t

...csNt0

(4)

Based on the memory usage scalars uit and the newentity probability nt, the overwrite probability for

2A threshold higher than 0 can also be used to limit coref-erence to only more recent mentions.

3The new entity coref score is a free variable that can beassigned any value, since only the relative value matters.

each memory cell is determined as follows:

oit = nt · 1i=argminj ujt−1

(5)

Thus we pick the cell with the lowest usage scalarujt−1 to OVERWRITE. In case of a tie, a cell is pickedrandomly among the ones with the lowest usagescalar. The above operation is non-differentiable,so during training we instead use

oit = nt · GS(1− uit−1

τ

)i

(6)

where GS(.) refers to Gumbel-Softmax (Jang et al.,2017), which makes overwrites differentiable.

For each memory cell, the memory vector isupdated based on the three possibilities of ignoringthe current token, being coreferent with the token,or considering the token to represent a new entity(causing an overwrite):

mit =

IGNORE︷︸︸︷(1− (oit + cit))m

it−1 +

OVERWRITE︷︸︸︷oit · ht

+ cit ·MLP3([ht;mit−1])︸︷︷︸

COREF

(7)

In this expression, the coreference term takes intoaccount both the previous cell vector mi

t−1 and thecurrent token representation ht, while the overwriteterm is based only on ht. In contrast to a similarmemory update equation in the Referential Readerwhich employs a pair of GRUs and MLPs for eachmemory cell, our update parameter uses just MLP3

which is memory cell-agnostic.Finally, the memory usage scalar is updated as

uit = min(1, oit + cit + γ · uit−1) (8)

where γ ∈ (0, 1) is the decay rate for the usagescalar. Thus the usage scalar uit keeps decaying withtime unless the memory is updated via OVERWRITE

or COREF in which case the value is increased toreflect the memory cell’s recent use.

Memory Variants In vanilla PeTra, each mem-ory cell is represented as a single vector and thememory is parameter-free, so the total number ofmodel parameters is independent of memory size.This is a property that is shared with, for exam-ple, differentiable neural computers (Graves et al.,2016). On the other hand, recent models for entitytracking, such as the EntNet (Henaff et al., 2017)and the Referential Reader (Liu et al., 2019a), learn

5418

memory initialization parameters and separate thememory cell into key-value pairs. To comparethese memory cell architectures, we investigate thefollowing two variants of PeTra:

1. PeTra + Learned Initialization: memory cellsare initialized at t = 0 to learned parametervectors.

2. PeTra + Fixed Key: a fixed dimensions of eachmemory cell are initialized with learned param-eters and kept fixed throughout the documentread, as in EntNet (Henaff et al., 2017).

Apart from initialization, the initial cell vectors arealso used to break ties for overwrites in Eqs. (5)and (6) when deciding among unused cells (withuit = 0). The criterion for breaking the tie is thesimilarity score computed using Eq. (2).

2.4 Coreference Link Probability

The probability that the tokens wt1 and wt2 arecoreferential according to, say, cell i of the memorydepends on three things: (a) wt1 is identified as partof an entity mention and is either overwritten to celli or is part of an earlier coreference chain for anentity tracked by cell i, (b) Cell i is not overwrittenby any other entity mention from t = t1 + 1 tot = t2, and (c) wt2 is also predicted to be part ofan entity mention and is coreferential with cell i.Combining these factors and marginalizing overthe cell index results in the following expressionfor the coreference link probability:

PCL(wt1 , wt2)

=

N∑i=1

(oit1 + cit1) ·t2∏

j=t1+1

(1− oij) · cit2 (9)

2.5 Losses

The GAP (Webster et al., 2018) training dataset issmall and provides sparse supervision with labelsfor only two coreference links per instance. In orderto compensate for this lack of supervision, we use aheuristic loss Lent over entity mention probabilitiesin combination with the end task loss Lcoref forcoreference. The two losses are combined with atunable hyperparameter λ resulting in the followingtotal loss: L = Lcoref + λLent .

2.5.1 Coreference LossThe coreference loss is the binary cross entropybetween the ground truth labels for mention pairs

and the coreference link probability PCL in Eq. (9).Eq. (9) expects a pair of tokens while the annota-tions are on pairs of spans, so we compute the lossfor all ground truth token pairs: Lcoref =

∑(sa,sb,yab)∈G

( ∑wa∈sa

∑wb∈sb

H(yab, PCL(wa, wb))

)

where G is the set of annotated span pairs andH(p, q) represents the cross entropy of the distribu-tion q relative to distribution p.

Apart from the ground truth labels, we use “im-plied labels” in the coreference loss calculation.For handling multi-token spans, we assume that alltokens following the head token are coreferentialwith the head token (self-links). We infer moresupervision based on knowledge of the setup of theGAP task. Each GAP instance has two candidatenames and a pronoun mention with supervisionprovided for the {name, pronoun} pairs. By designthe two names are different, and therefore we usethem as a negative coreference pair.

Even after the addition of this implied supervi-sion, our coreference loss calculation is restrictedto the three mention spans in each training instance;therefore, the running time is O(T ) for finite-sizedmention spans. In contrast, Liu et al. (2019a) com-pute the above coreference loss for all token pairs(assuming a negative label for all pairs outside ofthe mentions), which results in a runtime of O(T 3)due to the O(T 2) pairs and O(T ) computation perpair, and thus will scale poorly to long documents.

2.5.2 Entity Mention LossWe use the inductive bias that most tokens do notcorrespond to entities by imposing a loss on theaverage of the entity mention probabilities predictedacross time steps, after masking out the labeledentity spans. For a training instance where spanssA and sB correspond to the person mentions andspan sP is a pronoun, the entity mention loss is

Lent =∑T

t=1 et ·mt∑Tt=1mt

where mt = 0 if wt ∈ sA ∪ sB ∪ sP and mt = 1otherwise.

Each GAP instance has only 3 labeled entitymention spans, but the text typically has other en-tity mentions that are not labeled. Unlabeled entitymentions will be inhibited by this loss. However,on average there are far more tokens outside entityspans than inside the spans. In experiments without

5419

this loss, we observed that the model is suscepti-ble to predicting a high entity probability for alltokens while still performing well on the end taskof pronoun resolution. We are interested in trackingpeople beyond just the entities that are labeled inthe GAP task, for which this loss is very helpful.

3 Experimental Setup

3.1 Data

GAP is a gender-balanced pronoun resolutiondataset introduced by Webster et al. (2018). Eachinstance consists of a small snippet of text fromWikipedia, two spans corresponding to candidatenames along with a pronoun span, and two binary la-bels indicating the coreference relationship betweenthe pronoun and the two candidate names. Relativeto other popular coreference datasets (Pradhan et al.,2012; Chen et al., 2018), GAP is comparativelysmall and sparsely annotated. We choose GAPbecause its small size allows us to do extensiveexperiments.

3.2 Model Details

For the input BERT embeddings, we concatenateeither the last four layers of BERTBASE, or layers19–22 of BERTLARGE since those layers have beenfound to carry the most information related to coref-erence (Liu et al., 2019b). The BERT embeddingsare fed to a 300-dimensional GRU model, whichmatches the dimensionality of the memory vectors.

We vary the number of memory cells N from2 to 20. The decay rate for the memory usagescalar γ is 0.98. The MLPs used for predicting theentity probability and similarity score consist oftwo 300-dimensional ReLU hidden layers. For theFixed Key variant of PeTra we use 20 dimensionsfor the learned key vector and the remaining 280dimensions as the value vector.

3.3 Training

All models are trained for a maximum of 100 epochswith the Adam optimizer (Kingma and Ba, 2015).The learning rate is initialized to 10−3 and is re-duced by half, until a minimum of 10−4, when-ever there is no improvement on the validationperformance for the last 5 epochs. Training stopswhen there is no improvement in validation per-formance for the last 15 epochs. The temperatureτ of the Gumbel-Softmax distribution used in theOVERWRITE operation is initialized to 1 and halvedevery 10 epochs. The coreference loss terms in

Section 2.5.1 are weighted differently for differentcoreference links: (a) self-link losses for multi-to-ken spans are given a weight of 1, (b) positive coref-erence link losses are weighted by 5, and (c) nega-tive coreference link losses are multiplied by 50. Toprevent overfitting: (a) we use early stopping basedon validation performance, and (b) apply dropoutat a rate of 0.5 on the output of the GRU model.Finally, we choose λ = 0.1 to weight the entityprediction loss described in Section 2.5.2.

3.4 People Tracking Evaluation

One of the goals of this work is to develop memorymodels that not only do well on the coreferenceresolution task, but also are interpretable in thesense that the memory cells actually track entities.Hence in addition to reporting the standard metricson GAP, we consider two other ways to evaluatememory models.

As our first task, we propose an auxiliary entity-counting task. We take 100 examples from theGAP validation set and annotate them with thenumber of unique people mentioned in them.4 Wetest the models by predicting the number of peoplefrom their memory logs as explained in Section 3.5.The motivation behind this exercise is that if amemory model is truly tracking entities, then itsmemory usage logs should allow us to recover thisinformation.

To assess the people tracking performance moreholistically, we conduct a human evaluation inwhich we ask annotators to assess the memory mod-els on people tracking performance, defined as:(a)detecting references to people including pronouns,and (b) maintaining a 1-to-1 correspondence be-tween people and memory cells. For this study, wepick the best run (among 5 runs) of PeTra and theReferential Reader for the 8-cell configuration us-ing BERTBASE (PeTra: 81 F1; Referential Reader:79 F1). Next we randomly pick 50 documents(without replacement) from the GAP dev set andsplit those into groups of 10 to get 5 evaluation sets.We shuffle the original 50 documents and followthe same steps to get another 5 evaluation sets. Inthe end, we have a total of 10 evaluation sets with10 documents each, where each unique documentbelongs to exactly 2 evaluation sets.

We recruit 10 annotators for the 10 evaluationsets. The annotators are shown memory log visual-izations as in Figure 5, and instructed to compare

4In the GAP dataset, the only relevant entities are people.

5420

(a) BERTBASE (b) BERTLARGE

Figure 3: Mean F1 score on the GAP validation set as a function of the number of memory cells.

the models on their people tracking performance(detailed instructions in Appendix A.3). For eachdocument the annotators are presented memorylogs of the two models (ordered randomly) andasked whether they prefer the first model, prefer thesecond model, or have no preference (neutral).

3.5 InferenceGAP Given a pronoun span sP and two candidatename spans sA & sB , we have to predict binarylabels for potential coreference links between (sA,sP ) and (sB , sP ). Thus, for a pair of entity spans,say sA and sP , we predict the coreference linkprobability as:

PCL(sA, sP ) = maxwA∈sA,wP∈sP

PCL(wA, wP )

where PCL(wA, wP ) is calculated using the proce-dure described in Section 2.45. The final binaryprediction is made by comparing the probabilityagainst a threshold.

Counting unique people For the test of uniquepeople counting, we discretize the overwrite opera-tion, which corresponds to new entities, against athreshold α and sum over all tokens and all memorycells to predict the count as follows:

# unique people =

T∑t=1

N∑i=1

1[oit ≥ α]

3.6 Evaluation MetricsFor GAP we evaluate models using F-score.6 First,we pick a threshold from the set {0.01, 0.02, · · · ,

5The computation of this probability includes the mentiondetection steps required byWebster et al. (2018).

6GAP also includes evaluation related to gender bias, butthis is not a focus of this paper so we do not report it.

1.00} which maximizes the validation F-score. Thisthreshold is then used to evaluate performance onthe GAP test set.

For the interpretability task of counting uniquepeople, we choose a threshold that minimizes theabsolute difference between ground truth count andpredicted count summed over the 100 annotatedexamples. We select the best threshold from the set{0.01, 0.02, · · · , 1.00}. The metric is then the num-ber of errors corresponding to the best threshold.7

3.7 Baselines

The Referential Reader (Liu et al., 2019a) is themost relevant baseline in the literature, and the mostsimilar to PeTra. The numbers reported by Liu et al.(2019a) are obtained by a version of the modelusing BERTBASE, with only two memory cells. Tocompare against PeTra for other configurations, weretrain the Referential Reader using the code madeavailable by the authors.8

We also report the results of Joshi et al. (2019)and Wu et al. (2019), although these numbers arenot comparable since both of them train on the muchlarger OntoNotes corpus and just test on GAP.

4 Results

4.1 GAP results

We train all the memory models, including the Ref-erential Reader, with memory size varying from {2,4, · · · , 20} memory cells for both BERTBASE andBERTLARGE, with each configuration being trained5 times. Figure 3 shows the performance of the

7Note that the error we report is therefore a best-case result.We are not proposing a way of counting unique people in newtest data, but rather using this task for analysis.

8https://github.com/liufly/refreader

https://github.com/liufly/refreader

5421

(a) BERTBASE (b) BERTLARGE

Figure 4: Error in counting unique people as a function of number of memory cells; lower is better.

BERTBASE BERTLARGE

PeTra 81.5 ± 0.6 85.3 ± 0.6+ Learned Init. 80.9 ± 0.7 84.4 ± 1.2+ Fixed Key 81.1 ± 0.7 85.1 ± 0.8

Ref. Reader 78.9 ± 1.3 83.7 ± 0.8Ref. Reader (2019a) 78.8 -

Joshi et al. (2019) 82.8 85.0Wu et al. (2019) - 87.5 (SpanBERT)

Table 1: Results (%F1) on the GAP test set.

models on the GAP validation set as a functionof memory size. The Referential Reader outper-forms PeTra (and its memory variants) when usinga small number of memory cells, but its perfor-mance starts degrading after 4 and 6 memory cellsfor BERTBASE and BERTLARGE respectively. PeTraand its memory variants, in contrast, keep improv-ing with increased memory size (before saturationat a higher number of cells) and outperform the bestReferential Reader performance for all memorysizes ≥ 6 cells. With larger numbers of memorycells, we see a higher variance, but the curves forPeTra and its memory variants are still consistentlyhigher than those of the Referential Reader.

Among different memory variants of PeTra,when using BERTBASE the performances are com-parable with no clear advantage for any particularchoice. For BERTLARGE, however, vanilla PeTrahas a clear edge for almost all memory sizes, sug-gesting the limited utility of initialization. Theresults show that PeTra works well without learn-ing vectors for initializing the key or memory cellcontents. Rather, we can remove the key/valuedistinction and simply initialize all memory cellswith the zero vector.

To evaluate on the GAP test set, we pick thememory size corresponding to the best validation

performance for all memory models. Table 1 showsthat the trends from validation hold true for testas well, with PeTra outperforming the ReferentialReader and the other memory variants of PeTra.

4.2 Counting unique peopleFigure 4 shows the results for the proposed inter-pretability task of counting unique people. For bothBERTBASE and BERTLARGE, PeTra achieves thelowest error count. Interestingly, from Figure 4bwe can see that for ≥ 14 memory cells, the othermemory variants of PeTra perform worse than theReferential Reader while being better at the GAPvalidation task (see Figure 3b). This shows that abetter performing model is not necessarily better attracking people.

BERTBASE BERTLARGE

PeTra 0.76 0.69+ Learned Init 0.72 0.60+ Fixed Key 0.72 0.65

Ref. Reader 0.49 0.54

Table 2: Spearman’s correlation between GAP valida-tion F1 and negative error count for unique people.

To test the relationship between the GAP taskand the proposed interpretability task, we computethe correlation between the GAP F-score and thenegative count of unique people for each model sep-arately.9 Table 2 shows the Spearman’s correlationbetween these measures. For all models we see apositive correlation, indicating that a dip in coref-erence performance corresponds to an increase inerror on counting unique people. The correlationsfor PeTra are especially high, again suggesting it’sgreater interpretability.

9Each correlation is computed over 50 runs (5 runs eachfor 10 memory sizes).

5422

Amelia Shepherd1 , M.D. is a fictional character on the ABC American television medical drama Private Practice, and the

spinoff series’ progenitor show, Grey’s Anatomy, portrayed by Caterina Scorsone2 . In her1 debut appearance in season

three, Amelia1 visited her former sister-in-law, Addison Montgomery3 , and became a partner at the Oceanside WellnessGroup.

CROW

CROW

CROW

[CLS

]Am

elia

Shep

herd

, M . D . is afic

tiona

lch

arac

ter .

.....

por

traye

d by Cat

##er

ina Sc

##or

s##

one . In her

debu

tap

pear

ance in

seas

onth

ree ,

Amel

iavis

ited

her

form

ersis

ter - in -

law ,

Addi

son

Mon

tgom

ery

, ...

... W

ell

##ne

ssGr

oup .

[SEP

]

CROWM

emor

y Ce

lls

(a) A successful run of PeTra with 4 memory cells. The model accurately links all the mentions of “Amelia” to the same memorycell while also detecting other people in the discourse.

Bethenny1 calls a meeting to get everyone on the same page, but Jason2 is hostile with the group, making things worse

and forcing Bethenny1 to play referee. Emotions are running high with Bethenny1 ’s assistant, Julie3 , who breaks

down at a lunch meeting when asked if she3 is committed to the company for the long haul.

CROW

CROW

CROW

CROW

CROW

CROW

CROW

[CLS

]Be

th##

en##

nyca

lls ..

....

wor

se and

forc

ing

Beth

##en

##ny to

play

refe

ree .

Em##

otio

n##

sar

eru

nnin

ghi

ghw

ithBe

th##

en##

ny

' sas

sista

nt ,Ju

lie

,w

hobr

eaks

dow

n at alu

nch

mee

ting

whe

nas

ked if

she is

com

mitt

ed to the

com

pany fo

rth

elo

ngha

ul .[S

EP]

CROW

Mem

ory

Cells

(b) Memory log of PeTra with 8 memory cells. The model correctly links “she” and “Julie” but fails at linking the three “Bethenny”mentions, and also fails at detecting “Jason”.

Figure 5: Visualization of memory logs for different configurations of PeTra. The documents have their GAPannotations highlighted in red (italics) and blue (bold), with blue (bold) corresponding to the right answer. Forillustration purposes only, we highlight all the spans corresponding to mentions of people and mark cluster indicesas subscript. In the plot, X-axis corresponds to document tokens, and Y-axis corresponds to memory cells. Eachmemory cell has the OW=OVERWRITE and CR=COREF labels. Darker color implies higher value. We skip text,indicated via ellipsis, when the model doesn’t detect people for extended lengths of text.

4.3 Human Evaluation for People Tracking

Model Preference (in %)

PeTra 74Ref. Reader 08Neutral 18

Table 3: Human Evaluation results for people tracking.

Table 3 summarizes the results of the humanevaluation for people tracking. The annotators

prefer PeTra in 74% cases while the ReferentialReader for only 8% instances (see Appendix A.4for visualizations comparing the two). Thus, PeTraeasily outperforms the Referential Reader on thistask even though they are quite close on the GAPevaluation. The annotators agree on 68% of thedocuments, disagree between PeTra and Neutral for24% of the documents, and disagree between PeTraand the Referential Reader for the remaining 8%documents. For more details, see Appendix A.2.

5423

4.4 Model Runs

We visualize two runs of PeTra with different con-figurations in Figure 5. For both instances the modelgets the right pronoun resolution, but clearly in Fig-ure 5b the model fails at correctly tracking repeatedmentions of “Bethenny”. We believe these errorshappen because (a) GAP supervision is limited topronoun-proper name pairs, so the model is neverexplicitly supervised to link proper names, and (b)there is a lack of span-level features, which hurtsthe model when a name is split across multipletokens.

5 Related Work

There are several strands of related work, includ-ing prior work in developing neural models withexternal memory as well as variants that focus onmodeling entities and entity relations, and neuralmodels for coreference resolution.

Memory-augmented models. Neural networkarchitectures with external memory include mem-ory networks (Weston et al., 2015; Sukhbaatar et al.,2015), neural Turing machines (Graves et al., 2014),and differentiable neural computers (Graves et al.,2016). This paper focuses on models with induc-tive biases that produce particular structures in thememory, specifically those related to entities.

Models for tracking and relating entities. Anumber of existing models have targeted entitytracking and coreference links for a variety of tasks.EntNet (Henaff et al., 2017) aims to track enti-ties via a memory model. EntityNLM (Ji et al.,2017) represents entities dynamically within a neu-ral language model. Hoang et al. (2018) augmenta reading comprehension model to track entities,incorporating a set of auxiliary losses to encouragecapturing of reference relations in the text. Dhingraet al. (2018) introduce a modified GRU layer de-signed to aggregate information across coreferentmentions.

Memory models for NLP tasks. Memory mod-els have been applied to several other NLP tasksin addition to coreference resolution, including tar-geted aspect-based sentiment analysis (Liu et al.,2018b), machine translation (Maruf and Haffari,2018), narrative modeling (Liu et al., 2018a), anddialog state tracking (Perez and Liu, 2017). Ourstudy of architectural choices for memory may alsobe relevant to models for these tasks.

Neural models for coreference resolution. Sev-eral neural models have been developed for corefer-ence resolution, most of them focused on modelingpairwise interactions among mentions or spans in adocument (Wiseman et al., 2015; Clark and Man-ning, 2016a; Lee et al., 2017, 2018). These modelsuse heuristics to avoid computing scores for allpossible span pairs in a document, an operationwhich is quadratic in the document length T assum-ing a maximum span length. Memory models forcoreference resolution, including our model, differby seeking to store information about entities inmemory cells and then modeling the relationshipbetween a token and a memory cell. This reducescomputation from O(T 2) to O(TN), where N isthe number of memory cells, allowing memorymodels to be applied to longer texts by using theglobal entity information. Past work (Wisemanet al., 2016) have used global features, but in con-junction with other features to score span pairs.

Referential Reader. Most closely related to thepresent work is the Referential Reader (Liu et al.,2019a), which uses a memory model to performcoreference resolution incrementally. We signifi-cantly simplify this model to accomplish the samegoal with far fewer parameters.

6 Conclusion and Future Work

We propose a new memory model for entity track-ing, which is trained using sparse coreference res-olution supervision. The proposed model outper-forms a previous approach with far fewer parame-ters and a simpler architecture. We propose a newdiagnostic evaluation and conduct a human evalu-ation to test the interpretability of the model, andfind that our model again does better on this evalua-tion. In future work, we plan to extend this workto longer documents such as the recently releaseddataset of Bamman et al. (2019).

Acknowledgments

This material is based upon work supported by the National

Science Foundation under Award Nos. 1941178 and 1941160.

We thank the ACL reviewers, Sam Wiseman, and Mrinmaya

Sachan for their valuable feedback. We thank Fei Liu and Jacob

Eisenstein for answering questions regarding the Referential

Reader. Finally, we want to thank all the annotators at TTIC

who participated in the human evaluation study.

5424

ReferencesDavid Bamman, Olivia Lewke, and Anya Mansoor.

2019. An Annotated Dataset of Coreference in En-glish Literature. arXiv preprint arXiv:1912.01140.

Hong Chen, Zhenhua Fan, Hao Lu, Alan Yuille, andShu Rong. 2018. PreCo: A Large-scale Dataset inPreschool Vocabulary for Coreference Resolution.In EMNLP.

Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learn-ing Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. InEMNLP.

Kevin Clark and Christopher D. Manning. 2016a.Deep Reinforcement Learning for Mention-RankingCoreference Models. In EMNLP.

Kevin Clark and Christopher D Manning. 2016b. Im-proving Coreference Resolution by Learning Entity-Level Distributed Representations. In ACL.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In NAACL-HLT.

Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William Co-hen, and Ruslan Salakhutdinov. 2018. Neural Mod-els for Reasoning over Multiple Mentions UsingCoreference. In NAACL.

Alex Graves, Greg Wayne, and Ivo Danihelka.2014. Neural Turing machines. arXiv preprintarXiv:1410.5401.

Alex Graves, Greg Wayne, Malcolm Reynolds,Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez Colmenarejo, EdwardGrefenstette, Tiago Ramalho, John Agapiou,AdriaPuigdomenech Badia, Karl Moritz Hermann,Yori Zwols, Georg Ostrovski, Adam Cain, HelenKing, Christopher Summerfield, Phil Blunsom,Koray Kavukcuoglu, and Demis Hassabis. 2016.Hybrid computing using a neural network withdynamic external memory. Nature, 538(7626):471–476.

Mikael Henaff, Jason Weston, Arthur Szlam, AntoineBordes, and Yann LeCun. 2017. Tracking the worldstate with recurrent entity networks. In ICLR.

Luong Hoang, Sam Wiseman, and Alexander Rush.2018. Entity Tracking Improves Cloze-style Read-ing Comprehension. In EMNLP.

Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categor-ical Reparameterization with Gumbel-Softmax. InICLR.

Yangfeng Ji, Chenhao Tan, Sebastian Martschat, YejinChoi, and Noah A. Smith. 2017. Dynamic EntityRepresentations in Neural Language Models. InEMNLP.

Mandar Joshi, Omer Levy, Luke Zettlemoyer, andDaniel Weld. 2019. BERT for Coreference Resolu-tion: Baselines and Analysis. In EMNLP.

Frank Keller. 2010. Cognitively Plausible Models ofHuman Language Processing. In ACL.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: AMethod for Stochastic Optimization. In ICLR.

Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer,James Bradbury, Ishaan Gulrajani, Victor Zhong,Romain Paulus, and Richard Socher. 2016. Askme anything: Dynamic memory networks for natu-ral language processing. In ICML.

Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end Neural Coreference Reso-lution. In EMNLP.

Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.Higher-Order Coreference Resolution with Coarse-to-Fine Inference. In NAACL-HLT.

Fei Liu, Trevor Cohn, and Timothy Baldwin. 2018a.Narrative Modeling with Memory Chains and Se-mantic Supervision. In ACL.

Fei Liu, Trevor Cohn, and Timothy Baldwin. 2018b.Recurrent Entity Networks with Delayed MemoryUpdate for Targeted Aspect-Based Sentiment Anal-ysis. In NAACL-HLT.

Fei Liu, Luke Zettlemoyer, and Jacob Eisenstein.2019a. The Referential Reader: A recurrent entitynetwork for anaphora resolution. In ACL.

Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019b. Lin-guistic Knowledge and Transferability of ContextualRepresentations. In NAACL-HLT.

Sameen Maruf and Gholamreza Haffari. 2018. Doc-ument Context Neural Machine Translation withMemory Networks. In ACL.

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston.2016. Key-Value Memory Networks for DirectlyReading Documents. In EMNLP.

Julien Perez and Fei Liu. 2017. Dialog state tracking, amachine reading approach using Memory Network.In EACL.

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Olga Uryupina, and Yuchen Zhang. 2012. CoNLL-2012 Shared Task: Modeling multilingual unre-stricted coreference in ontonotes. In Joint Confer-ence on EMNLP and CoNLL - Shared Task, CoNLL’12.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston,and Rob Fergus. 2015. End-to-end memory net-works. In NeurIPS.

5425

MK Tanenhaus, MJ Spivey-Knowlton, KM Eberhard,and JC Sedivy. 1995. Integration of visual and lin-guistic information in spoken language comprehen-sion. Science, 268(5217).

Kellie Webster, Marta Recasens, Vera Axelrod, andJason Baldridge. 2018. Mind the GAP: A Bal-anced Corpus of Gendered Ambiguous Pronouns. InTACL, volume 6.

Jason Weston, Sumit Chopra, and Antoine Bordes.2015. Memory Networks. In ICLR.

Sam Wiseman, Alexander M. Rush, Stuart Shieber, andJason Weston. 2015. Learning Anaphoricity and An-tecedent Ranking Features for Coreference Resolu-tion. In ACL-IJCNLP.

Sam Wiseman, Alexander M. Rush, and Stuart M.Shieber. 2016. Learning Global Features for Coref-erence Resolution. In NAACL.

Wei Wu, Fei Wang, Arianna Yuan, Fei Wu, and JiweiLi. 2019. Coreference Resolution as Query-basedSpan Prediction. arXiv preprint arXiv:1911.01746.

A Appendix

A.1 Best Runs vs. Worst Runs

As Table 1 shows, there is significant variance in theperformance of these memory models. To analyzehow the best runs diverge from the worst runs, weanalyze how the controller network is using thedifferent memory cells in terms of overwrites. Forthis analysis, we choose the best and worst amongthe 5 runs for each configuration, as determinedby GAP validation performance. For the selectedruns, we calculate the KL-divergence of the averageoverwrite probability distribution from the uniformdistribution and average it for each model type.Table 4 shows that for the memory variants LearnedInit and Fixed Key, the worst runs overwrite more tosome memory cells than others (high average KL-divergence). Note that both PeTra and ReferentialReader are by design intended to have no preferencefor any particular memory cell (which the numberssupport), hence the low KL-divergence.

Avg KL-divBest run Worst run

PeTra 0.00 0.01+ Learned Init. 0.3 0.83+ Fixed Key 0.2 0.8

Ref. Reader 0.05 0.04

Table 4: A comparison of best runs vs. worst runs.

A.2 Human Evaluation ResultsThe agreement matrix for the human evaluationstudy described in Section 4.3 is shown in Figure 6.This agreement matrix is a result of the two anno-tations per document that we get as per the setupdescribed in Section 3.4. Note that the annotationsare coming from two sets of annotators rather thantwo individual annotators. This is also the rea-son why we don’t report standard inter-annotatoragreement coefficients.

PeTra Ref Reader Neutral

Annotation 2

PeTra

Ref Reader

NeutralA

nnot

atio

n1

29

4 2

12 0 3

Figure 6: Agreement matrix for human evaluationstudy.

A.3 Instructions for Human EvaluationThe detailed instructions for the human evaluationstudy described in Section 4.3 are shown in Figure 7.We simplified certain memory model specific termssuch as “overwrite” to “new person” since the studywas really about people tracking.

A.4 Comparative visualization of memorylogs of PeTra and the Referential Reader

Figure 8 and 9 compare the memory logs of PeTraand the Referential Reader.

5426

• In this user study we will be comparing memory models at tracking people.

• What are memory models? Memory models are neural networks coupled with an externalmemory which can be used for reading/writing.

• (IMPORTANT) What does it mean to track people for memory models?

– Detect all references to people which includes pronouns.– A 1-to-1 correspondence between people and memory cells i.e. all references corresponding

to a person should be associated with the same memory cell AND each memory cell shouldbe associated with at most 1 person.

• The memory models use the following scores (which are visualized) to indicate the trackingdecisions:

– New Person Probability (Cell i): Probability that the token refers to a new person (notintroduced in the text till now) and we start tracking it in cell i.

– Coreference Probability (Cell i): Probability that the token refers to a person already beingtracked in cell i.

• The objective of this study is to compare the models on the interpretability of their memory logsi.e. are the models actually tracking entities or not. You can choose how you weigh the differentrequirements for tracking people (from 3).

• For this study, you will compare two memory models with 8 memory cells (represented via 8rows). The models are ordered randomly for each instance.

• For each document, you can choose model A or model B, or stay neutral in case both the modelsperform similarly.

Figure 7: Instructions for the human evaluation study.

5427

Neef1 took an individual silver medal at the 1994 European Cup behind Russia’sSvetlana Goncharenko2 and returned the following year to win gold. She1 was a finalist individu-

ally at the 1994 European Championships and came sixth for Scotland at the 1994 CommonwealthGames.

(a) GAP validation instance 293. The ground truth GAP annotation is indicated via colors.

CROW

CROW

CROW

CROW

CROW

CROW

CROW

[CLS

] N##

ee ##f

took an

indi

vidua

lsil

ver

med

al at the

1994

Euro

pean Cup

behi

ndRu

ssia ' s S

##ve

t##

lana Go

##nc

h##

are

##nk

oan

dre

turn

ed the

follo

win

gye

ar to win

gold .

She

was a

final

istin

divid

ually at the

1994

Euro

pean

Cham

pion

ship

san

dca

me

sixth for

Scot

land at th

e19

94

Com

mon

wea

lthGa

mes

.[S

EP]

CROW

Mem

ory

Cells

(b) Memory log of PeTra with 8 memory cells. PeTra uses only 2 memory cells for the 2 unique people, namely Neef andSvetlana Goncharenko, and correctly resolves the pronoun.

CROW

CROW

CROW

CROW

CROW

CROW

CROW

[CLS

] N##

ee ##f

took an

indi

vidua

lsil

ver

med

al at the

1994

Euro

pean Cup

behi

ndRu

ssia ' s S

##ve

t##

lana Go

##nc

h##

are

##nk

oan

dre

turn

ed the

follo

win

gye

ar to win

gold .

She

was a

final

istin

divid

ually at the

1994

Euro

pean

Cham

pion

ship

san

dca

me

sixth for

Scot

land at th

e19

94

Com

mon

wea

lthGa

mes

.[S

EP]

CROW

Mem

ory

Cells

(c) Memory log of the Referential Reader with 8-memory cells. The Referential Reader does successfully resolve the pronoun inthe topmost memory cell but it ends up tracking Neef in as many as 4 memory cells.

Figure 8: Both the models only weakly detect “Svetlana Goncharenko” which could be due to lack of span model-ing.

5428

Fripp1 has performed Soundscapes in several situations: * Fripp1 has featured Soundscapes on

various King Crimson albums. He1 has also released pure Soundscape recordings as well: * On May4, 2006, Steve Ball2 invited Robert Fripp1 back to the Microsoft campus for a second full day of

work on Windows Vista following up on his1 first visit in the Fall of 2005.

(a) GAP validation instance 17. The ground truth GAP annotation is indicated via colors.

CROW

CROW

CROW

CROW

CROW

CROW

CROW

[CLS

] Fr##

ip##

pha

spe

rform

edSo

unds

##ca

pe ##s in

seve

ral

situa

tions

: * Fr##

ip##

pha

sfe

atur

edSo

unds

##ca

pe ##s on

vario

usKi

ngCr

imso

nal

bum

s .He ha

sal

sore

leas

ed ..

....

4 ,20

06,

Stev

eBa

llin

vited

Robe

rt Fr##

ip##

pba

ck to the

Micr

osof

tca

mpu

s ...

... o

nW

indo

ws

Vist

afo

llow

ing up on his

first

visit in the

Fall of

2005

.[S

EP]

CROW

Mem

ory

Cells

(b) Memory log of PeTra with 8-memory cells. PeTra is pretty accurate at tracking Robert Fripp but it misses out on connecting“Fripp” from the earlier part of the document to “Robert Fripp”.

CROW

CROW

CROW

CROW

CROW

CROW

CROW

[CLS

] Fr##

ip##

pha

spe

rform

edSo

unds

##ca

pe ##s in

seve

ral

situa

tions

: * Fr##

ip##

pha

sfe

atur

edSo

unds

##ca

pe ##s on

vario

usKi

ngCr

imso

nal

bum

s .He ha

sal

sore

leas

ed ..

....

4 ,20

06,

Stev

eBa

llin

vited

Robe

rt Fr##

ip##

pba

ck to the

Micr

osof

tca

mpu

s ...

... o

nW

indo

ws

Vist

afo

llow

ing up on his

first

visit in the

Fall of

2005

.[S

EP]

CROW

Mem

ory

Cells

(c) Memory log of the Referential Reader with 8-memory cells. The Referential Reader completely misses out on all the mentionsin the first half of the document (which is not penalized in GAP evaluations where the relevant annotations are typically towardsthe end of the document). Apart from this, the model ends up tracking Robert Fripp in as many as 6 memory cells, and Steve Ballin 3 memory cells.

Figure 9: PeTra clearly performs better than the Referential Reader at people tracking for this instance. PeTra’soutput is more sparse, detects more relevant mentions, and is better at maintaining a 1-to-1 correspondence betweenmemory cells and people.

PeTra: A Sparsely Supervised Memory Model for People …Memory Variants In vanilla PeTra, each...

Documents

Transcript of PeTra: A Sparsely Supervised Memory Model for People …Memory Variants In vanilla PeTra, each...