PeTra: A Sparsely Supervised Memory Model for People …Memory Variants In vanilla PeTra, each...
Transcript of PeTra: A Sparsely Supervised Memory Model for People …Memory Variants In vanilla PeTra, each...
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5415–5428July 5 - 10, 2020. c©2020 Association for Computational Linguistics
5415
PeTra: A Sparsely Supervised Memory Model for People Tracking
Shubham Toshniwal1, Allyson Ettinger2, Kevin Gimpel1, Karen Livescu1
1Toyota Technological Institute at Chicago2Department of Linguistics, University of Chicago
{shtoshni, kgimpel, klivescu}@ttic.edu, [email protected]
Abstract
We propose PeTra, a memory-augmented neu-ral network designed to track entities in itsmemory slots. PeTra is trained using sparseannotation from the GAP pronoun resolutiondataset and outperforms a prior memory modelon the task while using a simpler architec-ture. We empirically compare key modelingchoices, finding that we can simplify severalaspects of the design of the memory modulewhile retaining strong performance. To mea-sure the people tracking capability of memorymodels, we (a) propose a new diagnostic evalu-ation based on counting the number of uniqueentities in text, and (b) conduct a small scalehuman evaluation to compare evidence of peo-ple tracking in the memory logs of PeTra rela-tive to a previous approach. PeTra is highly ef-fective in both evaluations, demonstrating itsability to track people in its memory despitebeing trained with limited annotation.
1 Introduction
Understanding text narratives requires maintainingand resolving entity references over arbitrary-lengthspans. Current approaches for coreference resolu-tion (Clark and Manning, 2016b; Lee et al., 2017,2018; Wu et al., 2019) scale quadratically (withoutheuristics) with length of text, and hence are im-practical for long narratives. These models are alsocognitively implausible, lacking the incrementalityof human language processing (Tanenhaus et al.,1995; Keller, 2010). Memory models with finitememory and online/quasi-online entity resolutionhave linear runtime complexity, offering more scal-ability, cognitive plausibility, and interpretability.
Memory models can be viewed as general prob-lem solvers with external memory mimicking aTuring tape (Graves et al., 2014, 2016). Someof the earliest applications of memory networks
in language understanding were for question an-swering, where the external memory simply storedall of the word/sentence embeddings for a docu-ment (Sukhbaatar et al., 2015; Kumar et al., 2016).To endow more structure and interpretability tomemory, key-value memory networks were intro-duced by Miller et al. (2016). The key-value archi-tecture has since been used for narrative understand-ing and other tasks where the memory is intendedto learn to track entities while being guided by vary-ing degrees of supervision (Henaff et al., 2017; Liuet al., 2018a,b, 2019a).
We propose a new memory model, PeTra, forentity tracking and coreference resolution, inspiredby the recent Referential Reader model (Liu et al.,2019a) but substantially simpler. Experiments onthe GAP (Webster et al., 2018) pronoun resolu-tion task show that PeTra outperforms the Refer-ential Reader with fewer parameters and simplerarchitecture. Importantly, while Referential Readerperformance degrades with larger memory, PeTraimproves with increase in memory capacity (beforesaturation), which should enable tracking of a largernumber of entities. We conduct experiments to as-sess various memory architecture decisions, suchas learning of memory initialization and separationof memory slots into key/value pairs.
To test interpretability of memory models’ entitytracking, we propose a new diagnostic evaluationbased on entity counting—a task that the modelsare not explicitly trained for—using a small amountof annotated data. Additionally, we conduct a smallscale human evaluation to assess quality of peopletracking based on model memory logs. PeTra sub-stantially outperforms Referential Reader on bothmeasures, indicating better and more interpretabletracking of people.1
1Code available at https://github.com/shtoshni92/petra
5416
The
IG
character
IG
Amelia
OW
Shepherd
CR
, portrayed by
. . .
IG
Caterina
OW
Scorsone
CR
, visits
. . .
IG
her
CR
friend
IG
Addison
OW
Figure 1: Illustration of memory cell updates in an example sentence where IG = ignore, OW = overwrite, CR =coref. Different patterns indicate the different entities, and an empty pattern indicates that the cell has not beenused. The updated memory cells at each time step are highlighted.
2 Model
Figure 2 depicts PeTra, which consists of threecomponents: an input encoder that given the tokensgenerates the token embeddings, a memory modulethat tracks information about the entities presentin the text, and a controller network that acts as aninterface between the encoder and the memory.
BERT Encoder
wt
ht
Mt−1 Mt. . . . . .
Controller
et, ot, ct
. . . . . .
. . . . . .
GRU HiddenStates
Input Tokens
Memory
ControllerOutputs
Figure 2: Proposed model.
2.1 Input EncoderGiven a document consisting of a sequence of to-kens {w1, · · · , wT }, we first pass the documentthrough a fixed pretrained BERT model (Devlinet al., 2019) to extract contextual token embed-dings. Next, the BERT-based token embeddingsare fed into a single-layer unidirectional Gated Re-current Unit (GRU) (Cho et al., 2014) runningleft-to-right to get task-specific token embeddings{h1, · · · ,hT }.
2.2 MemoryThe memoryMt consists of N memory cells. Theith memory cell state at time step t consists of atuple (mi
t, uit) where the vector mi
t represents thecontent of the memory cell, and the scalar uit ∈
[0, 1] represents its recency of usage. A high valueof uit is intended to mean that the cell is tracking anentity that has been recently mentioned.
Initialization Memory cells are initialized to thenull tuple, i.e. (0, 0); thus, our memory is parameter-free. This is in contrast with previous entity trackingmodels such as EntNet (Henaff et al., 2017) andthe Referential Reader (Liu et al., 2019a) wherememory initialization is learned and the cells arerepresented with separate key and value vectors. Wewill later discuss variants of our memory with someof these changes.
2.3 ControllerAt each time step t the controller network deter-mines whether token t is part of an entity span and,if so, whether the token is coreferent with any ofthe entities already being tracked by the memory.Depending on these two variables, there are threepossible actions:
(i) IGNORE: The token is not part of any entityspan, in which case we simply ignore it.
(ii) OVERWRITE: The token is part of an entityspan but is not already being tracked in thememory.
(iii) COREF: The token is part of an entity spanand the entity is being tracked in the memory.
Therefore, the two ways of updating the memory areOVERWRITE and COREF. There is a strict orderingconstraint to the two operations: OVERWRITE pre-cedes COREF, because it is not possible to coreferwith a memory cell that is not yet tracking anything.That is, the COREF operation cannot be applied toa previously unwritten memory cell, i.e. one withuit = 0. Figure 1 illustrates an idealized version ofthis process.
Next we describe in detail the computation of theprobabilities of the two operations for each memorycell at each time step t.
5417
First, the entity mention probability et, whichreflects the probability that the current token wt ispart of an entity mention, is computed by:
et = σ(MLP1(ht)) (1)
where MLP1 is a multi-layer perceptron and σ isthe logistic function.
Overwrite and Coref If the current token wt ispart of an entity mention, we need to determinewhether it corresponds to an entity being currentlytracked by the memory or not. For this we computethe similarity between the token embedding ht andthe contents of the memory cells currently trackingentities. For the ith memory cell with memoryvector mi
t−1 the similarity with ht is given by:
simit = MLP2([ht;m
it−1;ht �mi
t−1;uit−1])
(2)where MLP2 is a second MLP and � is theHadamard (elementwise) product. The usage scalaruit−1 in the above expression provides a notion ofdistance between the last mention of the entity incell i and the potential current mention. The higherthe value of uit−1, the more likely there was a recentmention of the entity being tracked by the cell. Thusuit−1 provides an alternative to distance-based fea-tures commonly used in pairwise scores for spans(Lee et al., 2017).
Given the entity mention probability et and simi-larity score simi
t, we define the coref score cs it as:
cs it = simit −∞ · 1[uit−1 = 0] (3)
where the second term ensures that the model doesnot predict coreference with a memory cell that hasnot been previously used, something not enforcedby Liu et al. (2019a).2 Assuming the coref scorefor a new entity to be 0,3 we compute the corefprobability cit and new entity probability nt asfollows:
c1t...cNtnt
= et · softmax
cs1t
...csNt0
(4)
Based on the memory usage scalars uit and the newentity probability nt, the overwrite probability for
2A threshold higher than 0 can also be used to limit coref-erence to only more recent mentions.
3The new entity coref score is a free variable that can beassigned any value, since only the relative value matters.
each memory cell is determined as follows:
oit = nt · 1i=argminj ujt−1
(5)
Thus we pick the cell with the lowest usage scalarujt−1 to OVERWRITE. In case of a tie, a cell is pickedrandomly among the ones with the lowest usagescalar. The above operation is non-differentiable,so during training we instead use
oit = nt · GS(1− uit−1
τ
)i
(6)
where GS(.) refers to Gumbel-Softmax (Jang et al.,2017), which makes overwrites differentiable.
For each memory cell, the memory vector isupdated based on the three possibilities of ignoringthe current token, being coreferent with the token,or considering the token to represent a new entity(causing an overwrite):
mit =
IGNORE︷ ︸︸ ︷(1− (oit + cit))m
it−1 +
OVERWRITE︷ ︸︸ ︷oit · ht
+ cit ·MLP3([ht;mit−1])︸ ︷︷ ︸
COREF
(7)
In this expression, the coreference term takes intoaccount both the previous cell vector mi
t−1 and thecurrent token representation ht, while the overwriteterm is based only on ht. In contrast to a similarmemory update equation in the Referential Readerwhich employs a pair of GRUs and MLPs for eachmemory cell, our update parameter uses just MLP3
which is memory cell-agnostic.Finally, the memory usage scalar is updated as
uit = min(1, oit + cit + γ · uit−1) (8)
where γ ∈ (0, 1) is the decay rate for the usagescalar. Thus the usage scalar uit keeps decaying withtime unless the memory is updated via OVERWRITE
or COREF in which case the value is increased toreflect the memory cell’s recent use.
Memory Variants In vanilla PeTra, each mem-ory cell is represented as a single vector and thememory is parameter-free, so the total number ofmodel parameters is independent of memory size.This is a property that is shared with, for exam-ple, differentiable neural computers (Graves et al.,2016). On the other hand, recent models for entitytracking, such as the EntNet (Henaff et al., 2017)and the Referential Reader (Liu et al., 2019a), learn
5418
memory initialization parameters and separate thememory cell into key-value pairs. To comparethese memory cell architectures, we investigate thefollowing two variants of PeTra:
1. PeTra + Learned Initialization: memory cellsare initialized at t = 0 to learned parametervectors.
2. PeTra + Fixed Key: a fixed dimensions of eachmemory cell are initialized with learned param-eters and kept fixed throughout the documentread, as in EntNet (Henaff et al., 2017).
Apart from initialization, the initial cell vectors arealso used to break ties for overwrites in Eqs. (5)and (6) when deciding among unused cells (withuit = 0). The criterion for breaking the tie is thesimilarity score computed using Eq. (2).
2.4 Coreference Link Probability
The probability that the tokens wt1 and wt2 arecoreferential according to, say, cell i of the memorydepends on three things: (a) wt1 is identified as partof an entity mention and is either overwritten to celli or is part of an earlier coreference chain for anentity tracked by cell i, (b) Cell i is not overwrittenby any other entity mention from t = t1 + 1 tot = t2, and (c) wt2 is also predicted to be part ofan entity mention and is coreferential with cell i.Combining these factors and marginalizing overthe cell index results in the following expressionfor the coreference link probability:
PCL(wt1 , wt2)
=
N∑i=1
(oit1 + cit1) ·t2∏
j=t1+1
(1− oij) · cit2 (9)
2.5 Losses
The GAP (Webster et al., 2018) training dataset issmall and provides sparse supervision with labelsfor only two coreference links per instance. In orderto compensate for this lack of supervision, we use aheuristic loss Lent over entity mention probabilitiesin combination with the end task loss Lcoref forcoreference. The two losses are combined with atunable hyperparameter λ resulting in the followingtotal loss: L = Lcoref + λLent .
2.5.1 Coreference LossThe coreference loss is the binary cross entropybetween the ground truth labels for mention pairs
and the coreference link probability PCL in Eq. (9).Eq. (9) expects a pair of tokens while the annota-tions are on pairs of spans, so we compute the lossfor all ground truth token pairs: Lcoref =
∑(sa,sb,yab)∈G
( ∑wa∈sa
∑wb∈sb
H(yab, PCL(wa, wb))
)
where G is the set of annotated span pairs andH(p, q) represents the cross entropy of the distribu-tion q relative to distribution p.
Apart from the ground truth labels, we use “im-plied labels” in the coreference loss calculation.For handling multi-token spans, we assume that alltokens following the head token are coreferentialwith the head token (self-links). We infer moresupervision based on knowledge of the setup of theGAP task. Each GAP instance has two candidatenames and a pronoun mention with supervisionprovided for the {name, pronoun} pairs. By designthe two names are different, and therefore we usethem as a negative coreference pair.
Even after the addition of this implied supervi-sion, our coreference loss calculation is restrictedto the three mention spans in each training instance;therefore, the running time is O(T ) for finite-sizedmention spans. In contrast, Liu et al. (2019a) com-pute the above coreference loss for all token pairs(assuming a negative label for all pairs outside ofthe mentions), which results in a runtime of O(T 3)due to the O(T 2) pairs and O(T ) computation perpair, and thus will scale poorly to long documents.
2.5.2 Entity Mention LossWe use the inductive bias that most tokens do notcorrespond to entities by imposing a loss on theaverage of the entity mention probabilities predictedacross time steps, after masking out the labeledentity spans. For a training instance where spanssA and sB correspond to the person mentions andspan sP is a pronoun, the entity mention loss is
Lent =∑T
t=1 et ·mt∑Tt=1mt
where mt = 0 if wt ∈ sA ∪ sB ∪ sP and mt = 1otherwise.
Each GAP instance has only 3 labeled entitymention spans, but the text typically has other en-tity mentions that are not labeled. Unlabeled entitymentions will be inhibited by this loss. However,on average there are far more tokens outside entityspans than inside the spans. In experiments without
5419
this loss, we observed that the model is suscepti-ble to predicting a high entity probability for alltokens while still performing well on the end taskof pronoun resolution. We are interested in trackingpeople beyond just the entities that are labeled inthe GAP task, for which this loss is very helpful.
3 Experimental Setup
3.1 Data
GAP is a gender-balanced pronoun resolutiondataset introduced by Webster et al. (2018). Eachinstance consists of a small snippet of text fromWikipedia, two spans corresponding to candidatenames along with a pronoun span, and two binary la-bels indicating the coreference relationship betweenthe pronoun and the two candidate names. Relativeto other popular coreference datasets (Pradhan et al.,2012; Chen et al., 2018), GAP is comparativelysmall and sparsely annotated. We choose GAPbecause its small size allows us to do extensiveexperiments.
3.2 Model Details
For the input BERT embeddings, we concatenateeither the last four layers of BERTBASE, or layers19–22 of BERTLARGE since those layers have beenfound to carry the most information related to coref-erence (Liu et al., 2019b). The BERT embeddingsare fed to a 300-dimensional GRU model, whichmatches the dimensionality of the memory vectors.
We vary the number of memory cells N from2 to 20. The decay rate for the memory usagescalar γ is 0.98. The MLPs used for predicting theentity probability and similarity score consist oftwo 300-dimensional ReLU hidden layers. For theFixed Key variant of PeTra we use 20 dimensionsfor the learned key vector and the remaining 280dimensions as the value vector.
3.3 Training
All models are trained for a maximum of 100 epochswith the Adam optimizer (Kingma and Ba, 2015).The learning rate is initialized to 10−3 and is re-duced by half, until a minimum of 10−4, when-ever there is no improvement on the validationperformance for the last 5 epochs. Training stopswhen there is no improvement in validation per-formance for the last 15 epochs. The temperatureτ of the Gumbel-Softmax distribution used in theOVERWRITE operation is initialized to 1 and halvedevery 10 epochs. The coreference loss terms in
Section 2.5.1 are weighted differently for differentcoreference links: (a) self-link losses for multi-to-ken spans are given a weight of 1, (b) positive coref-erence link losses are weighted by 5, and (c) nega-tive coreference link losses are multiplied by 50. Toprevent overfitting: (a) we use early stopping basedon validation performance, and (b) apply dropoutat a rate of 0.5 on the output of the GRU model.Finally, we choose λ = 0.1 to weight the entityprediction loss described in Section 2.5.2.
3.4 People Tracking Evaluation
One of the goals of this work is to develop memorymodels that not only do well on the coreferenceresolution task, but also are interpretable in thesense that the memory cells actually track entities.Hence in addition to reporting the standard metricson GAP, we consider two other ways to evaluatememory models.
As our first task, we propose an auxiliary entity-counting task. We take 100 examples from theGAP validation set and annotate them with thenumber of unique people mentioned in them.4 Wetest the models by predicting the number of peoplefrom their memory logs as explained in Section 3.5.The motivation behind this exercise is that if amemory model is truly tracking entities, then itsmemory usage logs should allow us to recover thisinformation.
To assess the people tracking performance moreholistically, we conduct a human evaluation inwhich we ask annotators to assess the memory mod-els on people tracking performance, defined as:(a)detecting references to people including pronouns,and (b) maintaining a 1-to-1 correspondence be-tween people and memory cells. For this study, wepick the best run (among 5 runs) of PeTra and theReferential Reader for the 8-cell configuration us-ing BERTBASE (PeTra: 81 F1; Referential Reader:79 F1). Next we randomly pick 50 documents(without replacement) from the GAP dev set andsplit those into groups of 10 to get 5 evaluation sets.We shuffle the original 50 documents and followthe same steps to get another 5 evaluation sets. Inthe end, we have a total of 10 evaluation sets with10 documents each, where each unique documentbelongs to exactly 2 evaluation sets.
We recruit 10 annotators for the 10 evaluationsets. The annotators are shown memory log visual-izations as in Figure 5, and instructed to compare
4In the GAP dataset, the only relevant entities are people.
5420
(a) BERTBASE (b) BERTLARGE
Figure 3: Mean F1 score on the GAP validation set as a function of the number of memory cells.
the models on their people tracking performance(detailed instructions in Appendix A.3). For eachdocument the annotators are presented memorylogs of the two models (ordered randomly) andasked whether they prefer the first model, prefer thesecond model, or have no preference (neutral).
3.5 InferenceGAP Given a pronoun span sP and two candidatename spans sA & sB , we have to predict binarylabels for potential coreference links between (sA,sP ) and (sB , sP ). Thus, for a pair of entity spans,say sA and sP , we predict the coreference linkprobability as:
PCL(sA, sP ) = maxwA∈sA,wP∈sP
PCL(wA, wP )
where PCL(wA, wP ) is calculated using the proce-dure described in Section 2.45. The final binaryprediction is made by comparing the probabilityagainst a threshold.
Counting unique people For the test of uniquepeople counting, we discretize the overwrite opera-tion, which corresponds to new entities, against athreshold α and sum over all tokens and all memorycells to predict the count as follows:
# unique people =
T∑t=1
N∑i=1
1[oit ≥ α]
3.6 Evaluation MetricsFor GAP we evaluate models using F-score.6 First,we pick a threshold from the set {0.01, 0.02, · · · ,
5The computation of this probability includes the mentiondetection steps required byWebster et al. (2018).
6GAP also includes evaluation related to gender bias, butthis is not a focus of this paper so we do not report it.
1.00} which maximizes the validation F-score. Thisthreshold is then used to evaluate performance onthe GAP test set.
For the interpretability task of counting uniquepeople, we choose a threshold that minimizes theabsolute difference between ground truth count andpredicted count summed over the 100 annotatedexamples. We select the best threshold from the set{0.01, 0.02, · · · , 1.00}. The metric is then the num-ber of errors corresponding to the best threshold.7
3.7 Baselines
The Referential Reader (Liu et al., 2019a) is themost relevant baseline in the literature, and the mostsimilar to PeTra. The numbers reported by Liu et al.(2019a) are obtained by a version of the modelusing BERTBASE, with only two memory cells. Tocompare against PeTra for other configurations, weretrain the Referential Reader using the code madeavailable by the authors.8
We also report the results of Joshi et al. (2019)and Wu et al. (2019), although these numbers arenot comparable since both of them train on the muchlarger OntoNotes corpus and just test on GAP.
4 Results
4.1 GAP results
We train all the memory models, including the Ref-erential Reader, with memory size varying from {2,4, · · · , 20} memory cells for both BERTBASE andBERTLARGE, with each configuration being trained5 times. Figure 3 shows the performance of the
7Note that the error we report is therefore a best-case result.We are not proposing a way of counting unique people in newtest data, but rather using this task for analysis.
8https://github.com/liufly/refreader
5421
(a) BERTBASE (b) BERTLARGE
Figure 4: Error in counting unique people as a function of number of memory cells; lower is better.
BERTBASE BERTLARGE
PeTra 81.5 ± 0.6 85.3 ± 0.6+ Learned Init. 80.9 ± 0.7 84.4 ± 1.2+ Fixed Key 81.1 ± 0.7 85.1 ± 0.8
Ref. Reader 78.9 ± 1.3 83.7 ± 0.8Ref. Reader (2019a) 78.8 -
Joshi et al. (2019) 82.8 85.0Wu et al. (2019) - 87.5 (SpanBERT)
Table 1: Results (%F1) on the GAP test set.
models on the GAP validation set as a functionof memory size. The Referential Reader outper-forms PeTra (and its memory variants) when usinga small number of memory cells, but its perfor-mance starts degrading after 4 and 6 memory cellsfor BERTBASE and BERTLARGE respectively. PeTraand its memory variants, in contrast, keep improv-ing with increased memory size (before saturationat a higher number of cells) and outperform the bestReferential Reader performance for all memorysizes ≥ 6 cells. With larger numbers of memorycells, we see a higher variance, but the curves forPeTra and its memory variants are still consistentlyhigher than those of the Referential Reader.
Among different memory variants of PeTra,when using BERTBASE the performances are com-parable with no clear advantage for any particularchoice. For BERTLARGE, however, vanilla PeTrahas a clear edge for almost all memory sizes, sug-gesting the limited utility of initialization. Theresults show that PeTra works well without learn-ing vectors for initializing the key or memory cellcontents. Rather, we can remove the key/valuedistinction and simply initialize all memory cellswith the zero vector.
To evaluate on the GAP test set, we pick thememory size corresponding to the best validation
performance for all memory models. Table 1 showsthat the trends from validation hold true for testas well, with PeTra outperforming the ReferentialReader and the other memory variants of PeTra.
4.2 Counting unique peopleFigure 4 shows the results for the proposed inter-pretability task of counting unique people. For bothBERTBASE and BERTLARGE, PeTra achieves thelowest error count. Interestingly, from Figure 4bwe can see that for ≥ 14 memory cells, the othermemory variants of PeTra perform worse than theReferential Reader while being better at the GAPvalidation task (see Figure 3b). This shows that abetter performing model is not necessarily better attracking people.
BERTBASE BERTLARGE
PeTra 0.76 0.69+ Learned Init 0.72 0.60+ Fixed Key 0.72 0.65
Ref. Reader 0.49 0.54
Table 2: Spearman’s correlation between GAP valida-tion F1 and negative error count for unique people.
To test the relationship between the GAP taskand the proposed interpretability task, we computethe correlation between the GAP F-score and thenegative count of unique people for each model sep-arately.9 Table 2 shows the Spearman’s correlationbetween these measures. For all models we see apositive correlation, indicating that a dip in coref-erence performance corresponds to an increase inerror on counting unique people. The correlationsfor PeTra are especially high, again suggesting it’sgreater interpretability.
9Each correlation is computed over 50 runs (5 runs eachfor 10 memory sizes).
5422
Amelia Shepherd1 , M.D. is a fictional character on the ABC American television medical drama Private Practice, and the
spinoff series’ progenitor show, Grey’s Anatomy, portrayed by Caterina Scorsone2 . In her1 debut appearance in season
three, Amelia1 visited her former sister-in-law, Addison Montgomery3 , and became a partner at the Oceanside WellnessGroup.
CROW
CROW
CROW
[CLS
]Am
elia
Shep
herd
, M . D . is afic
tiona
lch
arac
ter .
.....
por
traye
d by Cat
##er
ina Sc
##or
s##
one . In her
debu
tap
pear
ance in
seas
onth
ree ,
Amel
iavis
ited
her
form
ersis
ter - in -
law ,
Addi
son
Mon
tgom
ery
, ...
... W
ell
##ne
ssGr
oup .
[SEP
]
CROWM
emor
y Ce
lls
(a) A successful run of PeTra with 4 memory cells. The model accurately links all the mentions of “Amelia” to the same memorycell while also detecting other people in the discourse.
Bethenny1 calls a meeting to get everyone on the same page, but Jason2 is hostile with the group, making things worse
and forcing Bethenny1 to play referee. Emotions are running high with Bethenny1 ’s assistant, Julie3 , who breaks
down at a lunch meeting when asked if she3 is committed to the company for the long haul.
CROW
CROW
CROW
CROW
CROW
CROW
CROW
[CLS
]Be
th##
en##
nyca
lls ..
....
wor
se and
forc
ing
Beth
##en
##ny to
play
refe
ree .
Em##
otio
n##
sar
eru
nnin
ghi
ghw
ithBe
th##
en##
ny
' sas
sista
nt ,Ju
lie
,w
hobr
eaks
dow
n at alu
nch
mee
ting
whe
nas
ked if
she is
com
mitt
ed to the
com
pany fo
rth
elo
ngha
ul .[S
EP]
CROW
Mem
ory
Cells
(b) Memory log of PeTra with 8 memory cells. The model correctly links “she” and “Julie” but fails at linking the three “Bethenny”mentions, and also fails at detecting “Jason”.
Figure 5: Visualization of memory logs for different configurations of PeTra. The documents have their GAPannotations highlighted in red (italics) and blue (bold), with blue (bold) corresponding to the right answer. Forillustration purposes only, we highlight all the spans corresponding to mentions of people and mark cluster indicesas subscript. In the plot, X-axis corresponds to document tokens, and Y-axis corresponds to memory cells. Eachmemory cell has the OW=OVERWRITE and CR=COREF labels. Darker color implies higher value. We skip text,indicated via ellipsis, when the model doesn’t detect people for extended lengths of text.
4.3 Human Evaluation for People Tracking
Model Preference (in %)
PeTra 74Ref. Reader 08Neutral 18
Table 3: Human Evaluation results for people tracking.
Table 3 summarizes the results of the humanevaluation for people tracking. The annotators
prefer PeTra in 74% cases while the ReferentialReader for only 8% instances (see Appendix A.4for visualizations comparing the two). Thus, PeTraeasily outperforms the Referential Reader on thistask even though they are quite close on the GAPevaluation. The annotators agree on 68% of thedocuments, disagree between PeTra and Neutral for24% of the documents, and disagree between PeTraand the Referential Reader for the remaining 8%documents. For more details, see Appendix A.2.
5423
4.4 Model Runs
We visualize two runs of PeTra with different con-figurations in Figure 5. For both instances the modelgets the right pronoun resolution, but clearly in Fig-ure 5b the model fails at correctly tracking repeatedmentions of “Bethenny”. We believe these errorshappen because (a) GAP supervision is limited topronoun-proper name pairs, so the model is neverexplicitly supervised to link proper names, and (b)there is a lack of span-level features, which hurtsthe model when a name is split across multipletokens.
5 Related Work
There are several strands of related work, includ-ing prior work in developing neural models withexternal memory as well as variants that focus onmodeling entities and entity relations, and neuralmodels for coreference resolution.
Memory-augmented models. Neural networkarchitectures with external memory include mem-ory networks (Weston et al., 2015; Sukhbaatar et al.,2015), neural Turing machines (Graves et al., 2014),and differentiable neural computers (Graves et al.,2016). This paper focuses on models with induc-tive biases that produce particular structures in thememory, specifically those related to entities.
Models for tracking and relating entities. Anumber of existing models have targeted entitytracking and coreference links for a variety of tasks.EntNet (Henaff et al., 2017) aims to track enti-ties via a memory model. EntityNLM (Ji et al.,2017) represents entities dynamically within a neu-ral language model. Hoang et al. (2018) augmenta reading comprehension model to track entities,incorporating a set of auxiliary losses to encouragecapturing of reference relations in the text. Dhingraet al. (2018) introduce a modified GRU layer de-signed to aggregate information across coreferentmentions.
Memory models for NLP tasks. Memory mod-els have been applied to several other NLP tasksin addition to coreference resolution, including tar-geted aspect-based sentiment analysis (Liu et al.,2018b), machine translation (Maruf and Haffari,2018), narrative modeling (Liu et al., 2018a), anddialog state tracking (Perez and Liu, 2017). Ourstudy of architectural choices for memory may alsobe relevant to models for these tasks.
Neural models for coreference resolution. Sev-eral neural models have been developed for corefer-ence resolution, most of them focused on modelingpairwise interactions among mentions or spans in adocument (Wiseman et al., 2015; Clark and Man-ning, 2016a; Lee et al., 2017, 2018). These modelsuse heuristics to avoid computing scores for allpossible span pairs in a document, an operationwhich is quadratic in the document length T assum-ing a maximum span length. Memory models forcoreference resolution, including our model, differby seeking to store information about entities inmemory cells and then modeling the relationshipbetween a token and a memory cell. This reducescomputation from O(T 2) to O(TN), where N isthe number of memory cells, allowing memorymodels to be applied to longer texts by using theglobal entity information. Past work (Wisemanet al., 2016) have used global features, but in con-junction with other features to score span pairs.
Referential Reader. Most closely related to thepresent work is the Referential Reader (Liu et al.,2019a), which uses a memory model to performcoreference resolution incrementally. We signifi-cantly simplify this model to accomplish the samegoal with far fewer parameters.
6 Conclusion and Future Work
We propose a new memory model for entity track-ing, which is trained using sparse coreference res-olution supervision. The proposed model outper-forms a previous approach with far fewer parame-ters and a simpler architecture. We propose a newdiagnostic evaluation and conduct a human evalu-ation to test the interpretability of the model, andfind that our model again does better on this evalua-tion. In future work, we plan to extend this workto longer documents such as the recently releaseddataset of Bamman et al. (2019).
Acknowledgments
This material is based upon work supported by the National
Science Foundation under Award Nos. 1941178 and 1941160.
We thank the ACL reviewers, Sam Wiseman, and Mrinmaya
Sachan for their valuable feedback. We thank Fei Liu and Jacob
Eisenstein for answering questions regarding the Referential
Reader. Finally, we want to thank all the annotators at TTIC
who participated in the human evaluation study.
5424
ReferencesDavid Bamman, Olivia Lewke, and Anya Mansoor.
2019. An Annotated Dataset of Coreference in En-glish Literature. arXiv preprint arXiv:1912.01140.
Hong Chen, Zhenhua Fan, Hao Lu, Alan Yuille, andShu Rong. 2018. PreCo: A Large-scale Dataset inPreschool Vocabulary for Coreference Resolution.In EMNLP.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, Hol-ger Schwenk, and Yoshua Bengio. 2014. Learn-ing Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. InEMNLP.
Kevin Clark and Christopher D. Manning. 2016a.Deep Reinforcement Learning for Mention-RankingCoreference Models. In EMNLP.
Kevin Clark and Christopher D Manning. 2016b. Im-proving Coreference Resolution by Learning Entity-Level Distributed Representations. In ACL.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofDeep Bidirectional Transformers for Language Un-derstanding. In NAACL-HLT.
Bhuwan Dhingra, Qiao Jin, Zhilin Yang, William Co-hen, and Ruslan Salakhutdinov. 2018. Neural Mod-els for Reasoning over Multiple Mentions UsingCoreference. In NAACL.
Alex Graves, Greg Wayne, and Ivo Danihelka.2014. Neural Turing machines. arXiv preprintarXiv:1410.5401.
Alex Graves, Greg Wayne, Malcolm Reynolds,Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gomez Colmenarejo, EdwardGrefenstette, Tiago Ramalho, John Agapiou,AdriaPuigdomenech Badia, Karl Moritz Hermann,Yori Zwols, Georg Ostrovski, Adam Cain, HelenKing, Christopher Summerfield, Phil Blunsom,Koray Kavukcuoglu, and Demis Hassabis. 2016.Hybrid computing using a neural network withdynamic external memory. Nature, 538(7626):471–476.
Mikael Henaff, Jason Weston, Arthur Szlam, AntoineBordes, and Yann LeCun. 2017. Tracking the worldstate with recurrent entity networks. In ICLR.
Luong Hoang, Sam Wiseman, and Alexander Rush.2018. Entity Tracking Improves Cloze-style Read-ing Comprehension. In EMNLP.
Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categor-ical Reparameterization with Gumbel-Softmax. InICLR.
Yangfeng Ji, Chenhao Tan, Sebastian Martschat, YejinChoi, and Noah A. Smith. 2017. Dynamic EntityRepresentations in Neural Language Models. InEMNLP.
Mandar Joshi, Omer Levy, Luke Zettlemoyer, andDaniel Weld. 2019. BERT for Coreference Resolu-tion: Baselines and Analysis. In EMNLP.
Frank Keller. 2010. Cognitively Plausible Models ofHuman Language Processing. In ACL.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: AMethod for Stochastic Optimization. In ICLR.
Ankit Kumar, Ozan Irsoy, Peter Ondruska, Mohit Iyyer,James Bradbury, Ishaan Gulrajani, Victor Zhong,Romain Paulus, and Richard Socher. 2016. Askme anything: Dynamic memory networks for natu-ral language processing. In ICML.
Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle-moyer. 2017. End-to-end Neural Coreference Reso-lution. In EMNLP.
Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.Higher-Order Coreference Resolution with Coarse-to-Fine Inference. In NAACL-HLT.
Fei Liu, Trevor Cohn, and Timothy Baldwin. 2018a.Narrative Modeling with Memory Chains and Se-mantic Supervision. In ACL.
Fei Liu, Trevor Cohn, and Timothy Baldwin. 2018b.Recurrent Entity Networks with Delayed MemoryUpdate for Targeted Aspect-Based Sentiment Anal-ysis. In NAACL-HLT.
Fei Liu, Luke Zettlemoyer, and Jacob Eisenstein.2019a. The Referential Reader: A recurrent entitynetwork for anaphora resolution. In ACL.
Nelson F. Liu, Matt Gardner, Yonatan Belinkov,Matthew E. Peters, and Noah A. Smith. 2019b. Lin-guistic Knowledge and Transferability of ContextualRepresentations. In NAACL-HLT.
Sameen Maruf and Gholamreza Haffari. 2018. Doc-ument Context Neural Machine Translation withMemory Networks. In ACL.
Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston.2016. Key-Value Memory Networks for DirectlyReading Documents. In EMNLP.
Julien Perez and Fei Liu. 2017. Dialog state tracking, amachine reading approach using Memory Network.In EACL.
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,Olga Uryupina, and Yuchen Zhang. 2012. CoNLL-2012 Shared Task: Modeling multilingual unre-stricted coreference in ontonotes. In Joint Confer-ence on EMNLP and CoNLL - Shared Task, CoNLL’12.
Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston,and Rob Fergus. 2015. End-to-end memory net-works. In NeurIPS.
5425
MK Tanenhaus, MJ Spivey-Knowlton, KM Eberhard,and JC Sedivy. 1995. Integration of visual and lin-guistic information in spoken language comprehen-sion. Science, 268(5217).
Kellie Webster, Marta Recasens, Vera Axelrod, andJason Baldridge. 2018. Mind the GAP: A Bal-anced Corpus of Gendered Ambiguous Pronouns. InTACL, volume 6.
Jason Weston, Sumit Chopra, and Antoine Bordes.2015. Memory Networks. In ICLR.
Sam Wiseman, Alexander M. Rush, Stuart Shieber, andJason Weston. 2015. Learning Anaphoricity and An-tecedent Ranking Features for Coreference Resolu-tion. In ACL-IJCNLP.
Sam Wiseman, Alexander M. Rush, and Stuart M.Shieber. 2016. Learning Global Features for Coref-erence Resolution. In NAACL.
Wei Wu, Fei Wang, Arianna Yuan, Fei Wu, and JiweiLi. 2019. Coreference Resolution as Query-basedSpan Prediction. arXiv preprint arXiv:1911.01746.
A Appendix
A.1 Best Runs vs. Worst Runs
As Table 1 shows, there is significant variance in theperformance of these memory models. To analyzehow the best runs diverge from the worst runs, weanalyze how the controller network is using thedifferent memory cells in terms of overwrites. Forthis analysis, we choose the best and worst amongthe 5 runs for each configuration, as determinedby GAP validation performance. For the selectedruns, we calculate the KL-divergence of the averageoverwrite probability distribution from the uniformdistribution and average it for each model type.Table 4 shows that for the memory variants LearnedInit and Fixed Key, the worst runs overwrite more tosome memory cells than others (high average KL-divergence). Note that both PeTra and ReferentialReader are by design intended to have no preferencefor any particular memory cell (which the numberssupport), hence the low KL-divergence.
Avg KL-divBest run Worst run
PeTra 0.00 0.01+ Learned Init. 0.3 0.83+ Fixed Key 0.2 0.8
Ref. Reader 0.05 0.04
Table 4: A comparison of best runs vs. worst runs.
A.2 Human Evaluation ResultsThe agreement matrix for the human evaluationstudy described in Section 4.3 is shown in Figure 6.This agreement matrix is a result of the two anno-tations per document that we get as per the setupdescribed in Section 3.4. Note that the annotationsare coming from two sets of annotators rather thantwo individual annotators. This is also the rea-son why we don’t report standard inter-annotatoragreement coefficients.
PeTra Ref Reader Neutral
Annotation 2
PeTra
Ref Reader
NeutralA
nnot
atio
n1
29
4 2
12 0 3
Figure 6: Agreement matrix for human evaluationstudy.
A.3 Instructions for Human EvaluationThe detailed instructions for the human evaluationstudy described in Section 4.3 are shown in Figure 7.We simplified certain memory model specific termssuch as “overwrite” to “new person” since the studywas really about people tracking.
A.4 Comparative visualization of memorylogs of PeTra and the Referential Reader
Figure 8 and 9 compare the memory logs of PeTraand the Referential Reader.
5426
• In this user study we will be comparing memory models at tracking people.
• What are memory models? Memory models are neural networks coupled with an externalmemory which can be used for reading/writing.
• (IMPORTANT) What does it mean to track people for memory models?
– Detect all references to people which includes pronouns.– A 1-to-1 correspondence between people and memory cells i.e. all references corresponding
to a person should be associated with the same memory cell AND each memory cell shouldbe associated with at most 1 person.
• The memory models use the following scores (which are visualized) to indicate the trackingdecisions:
– New Person Probability (Cell i): Probability that the token refers to a new person (notintroduced in the text till now) and we start tracking it in cell i.
– Coreference Probability (Cell i): Probability that the token refers to a person already beingtracked in cell i.
• The objective of this study is to compare the models on the interpretability of their memory logsi.e. are the models actually tracking entities or not. You can choose how you weigh the differentrequirements for tracking people (from 3).
• For this study, you will compare two memory models with 8 memory cells (represented via 8rows). The models are ordered randomly for each instance.
• For each document, you can choose model A or model B, or stay neutral in case both the modelsperform similarly.
Figure 7: Instructions for the human evaluation study.
5427
Neef1 took an individual silver medal at the 1994 European Cup behind Russia’sSvetlana Goncharenko2 and returned the following year to win gold. She1 was a finalist individu-
ally at the 1994 European Championships and came sixth for Scotland at the 1994 CommonwealthGames.
(a) GAP validation instance 293. The ground truth GAP annotation is indicated via colors.
CROW
CROW
CROW
CROW
CROW
CROW
CROW
[CLS
] N##
ee ##f
took an
indi
vidua
lsil
ver
med
al at the
1994
Euro
pean Cup
behi
ndRu
ssia ' s S
##ve
t##
lana Go
##nc
h##
are
##nk
oan
dre
turn
ed the
follo
win
gye
ar to win
gold .
She
was a
final
istin
divid
ually at the
1994
Euro
pean
Cham
pion
ship
san
dca
me
sixth for
Scot
land at th
e19
94
Com
mon
wea
lthGa
mes
.[S
EP]
CROW
Mem
ory
Cells
(b) Memory log of PeTra with 8 memory cells. PeTra uses only 2 memory cells for the 2 unique people, namely Neef andSvetlana Goncharenko, and correctly resolves the pronoun.
CROW
CROW
CROW
CROW
CROW
CROW
CROW
[CLS
] N##
ee ##f
took an
indi
vidua
lsil
ver
med
al at the
1994
Euro
pean Cup
behi
ndRu
ssia ' s S
##ve
t##
lana Go
##nc
h##
are
##nk
oan
dre
turn
ed the
follo
win
gye
ar to win
gold .
She
was a
final
istin
divid
ually at the
1994
Euro
pean
Cham
pion
ship
san
dca
me
sixth for
Scot
land at th
e19
94
Com
mon
wea
lthGa
mes
.[S
EP]
CROW
Mem
ory
Cells
(c) Memory log of the Referential Reader with 8-memory cells. The Referential Reader does successfully resolve the pronoun inthe topmost memory cell but it ends up tracking Neef in as many as 4 memory cells.
Figure 8: Both the models only weakly detect “Svetlana Goncharenko” which could be due to lack of span model-ing.
5428
Fripp1 has performed Soundscapes in several situations: * Fripp1 has featured Soundscapes on
various King Crimson albums. He1 has also released pure Soundscape recordings as well: * On May4, 2006, Steve Ball2 invited Robert Fripp1 back to the Microsoft campus for a second full day of
work on Windows Vista following up on his1 first visit in the Fall of 2005.
(a) GAP validation instance 17. The ground truth GAP annotation is indicated via colors.
CROW
CROW
CROW
CROW
CROW
CROW
CROW
[CLS
] Fr##
ip##
pha
spe
rform
edSo
unds
##ca
pe ##s in
seve
ral
situa
tions
: * Fr##
ip##
pha
sfe
atur
edSo
unds
##ca
pe ##s on
vario
usKi
ngCr
imso
nal
bum
s .He ha
sal
sore
leas
ed ..
....
4 ,20
06,
Stev
eBa
llin
vited
Robe
rt Fr##
ip##
pba
ck to the
Micr
osof
tca
mpu
s ...
... o
nW
indo
ws
Vist
afo
llow
ing up on his
first
visit in the
Fall of
2005
.[S
EP]
CROW
Mem
ory
Cells
(b) Memory log of PeTra with 8-memory cells. PeTra is pretty accurate at tracking Robert Fripp but it misses out on connecting“Fripp” from the earlier part of the document to “Robert Fripp”.
CROW
CROW
CROW
CROW
CROW
CROW
CROW
[CLS
] Fr##
ip##
pha
spe
rform
edSo
unds
##ca
pe ##s in
seve
ral
situa
tions
: * Fr##
ip##
pha
sfe
atur
edSo
unds
##ca
pe ##s on
vario
usKi
ngCr
imso
nal
bum
s .He ha
sal
sore
leas
ed ..
....
4 ,20
06,
Stev
eBa
llin
vited
Robe
rt Fr##
ip##
pba
ck to the
Micr
osof
tca
mpu
s ...
... o
nW
indo
ws
Vist
afo
llow
ing up on his
first
visit in the
Fall of
2005
.[S
EP]
CROW
Mem
ory
Cells
(c) Memory log of the Referential Reader with 8-memory cells. The Referential Reader completely misses out on all the mentionsin the first half of the document (which is not penalized in GAP evaluations where the relevant annotations are typically towardsthe end of the document). Apart from this, the model ends up tracking Robert Fripp in as many as 6 memory cells, and Steve Ballin 3 memory cells.
Figure 9: PeTra clearly performs better than the Referential Reader at people tracking for this instance. PeTra’soutput is more sparse, detects more relevant mentions, and is better at maintaining a 1-to-1 correspondence betweenmemory cells and people.