Assessing Phrasal Representation and Composition in ...spondence, we use the BiRD (Asaadi et...

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 4896–4907,November 16–20, 2020. c©2020 Association for Computational Linguistics

4896

Assessing Phrasal Representation and Composition in Transformers

Lang YuDeptartment of Computer Science

University of [email protected]

Allyson EttingerDepartment of Linguistics

University of [email protected]

Abstract

Deep transformer models have pushed perfor-mance on NLP tasks to new limits, suggestingsophisticated treatment of complex linguisticinputs, such as phrases. However, we have lim-ited understanding of how these models han-dle representation of phrases, and whether thisreflects sophisticated composition of phrasemeaning like that done by humans. In this pa-per, we present systematic analysis of phrasalrepresentations in state-of-the-art pre-trainedtransformers. We use tests leveraging humanjudgments of phrase similarity and meaningshift, and compare results before and after con-trol of word overlap, to tease apart lexical ef-fects versus composition effects. We find thatphrase representation in these models reliesheavily on word content, with little evidenceof nuanced composition. We also identify vari-ations in phrase representation quality acrossmodels, layers, and representation types, andmake corresponding recommendations for us-age of representations from these models.

1 Introduction

A fundamental component of language understand-ing is the capacity to combine meaning units intolarger units—a phenomenon known as composi-tion—and to do so in a way that reflects the nuancesof meaning as understood by humans. Transform-ers (Vaswani et al., 2017) have shown impressiveperformance in NLP, particularly transformers us-ing pre-training, like BERT (Devlin et al., 2019)and GPT (Radford et al., 2018, 2019), suggestingthat these models may be succeeding at composi-tion of complex meanings. However, because trans-formers (like other contextual embedding models)typically maintain representations for every token,it is unclear how and at what points they mightbe combining word meanings into phrase mean-ings. This contrasts with models that incorporateexplicit phrasal composition into their architecture,

e.g. RNNG (Dyer et al., 2016; Kim et al., 2019),recursive models for semantic composition (Socheret al., 2013), or transformers with attention-basedcomposition modules (Yin et al., 2020).

In this paper we take steps to clarify the natureof phrasal representation in transformers. We fo-cus on representation of two-word phrases, and weprioritize identifying and teasing apart two impor-tant but distinct notions: how faithfully the mod-els are representing information about the wordsthat make up the phrase, and how faithfully themodels are representing the nuances of the com-posed phrase meaning itself, over and above a sim-ple account of the component words. To do this,we begin with existing methods for testing howwell representations align with human judgmentsof meaning similarity: similarity correlations andparaphrase classification. We then introduce con-trolled variants of these datasets, removing cuesof word overlap, in order to distinguish effects ofword content from effects of more sophisticatedcomposition. We complement these phrase simi-larity analyses with classic sense selection tests ofphrasal composition (Kintsch, 2001).

We apply these tests for systematic analysis ofseveral state-of-the-art transformers: BERT (De-vlin et al., 2019), RoBERTa (Liu et al., 2019b),DistilBERT (Sanh et al., 2019), XLNet (Yang et al.,2019b) and XLM-RoBERTa (Conneau et al., 2019).We run the tests in layerwise fashion, to estab-lish the evolution of phrase information as lay-ers progress, and we test various tokens and to-ken combinations as phrase representations. Wefind that when word overlap is not controlled, mod-els show strong correspondence with human judg-ments, with noteworthy patterns of variation acrossmodels, layers, and representation types. However,we find that correspondence drops substantiallyonce word overlap is controlled, suggesting thatalthough these transformers contain faithful repre-

4897

sentations of the lexical content of phrases, thereis little evidence that these representations capturesophisticated details of meaning composition be-yond word content. Based on the observed repre-sentation patterns, we make recommendations forselection of representations from these models. Allcode and controlled datesets are made available forreplication and application to additional models.1

2 Related work

This paper contributes to a growing body of workon analysis of neural network models. Much workhas studied recurrent neural network language mod-els (Linzen et al., 2016; Wilcox et al., 2018; Chowd-hury and Zamparelli, 2018; Gulordava et al., 2018;Futrell et al., 2019) and sentence encoders (Adiet al., 2016; Conneau et al., 2018; Ettinger et al.,2016). Our work builds in particular on analysis ofinformation encoded in contextualized token repre-sentations (Bacon and Regier, 2019; Tenney et al.,2019b; Peters et al., 2018; Hewitt and Manning,2019; Klafka and Ettinger, 2020) and in differentlayers of transformers (Tenney et al., 2019a; Jawa-har et al., 2019). The BERT model has been aparticular focus of analysis work since its intro-duction. Previous work has focused on analyzingthe attention mechanism (Vig and Belinkov, 2019;Clark et al., 2019), parameters (Roberts et al., 2020;Radford et al., 2019; Raffel et al., 2020) and embed-dings (Shwartz and Dagan, 2019; Liu et al., 2019a).We build on this work with a particular, controlledfocus on the evolution of phrasal representation ina variety of state-of-the-art transformers.

Composition has been a topic of frequent in-terest when examining neural networks and theirrepresentations. One common practice relies onanalysis of internal representations via downstreamtasks (Baan et al., 2019; Ettinger et al., 2018; Con-neau et al., 2019; Nandakumar et al., 2019; McCoyet al., 2019). One line of work analyzes word in-teractions in neural networks’ internal gates as thecomposition signal (Saphra and Lopez, 2020; Mur-doch et al., 2018), extending the Contextual De-composition algorithm proposed by Jumelet et al.(2019). Another notable branch of work constructssynthetic datasets of small size to investigate com-positionality in neural networks (Liska et al., 2018;Hupkes et al., 2018; Baan et al., 2019). Some work

1Datasets and code available athttps://github.com/yulang/phrasal-composition-in-transformers

controls for word content, as we do, to study com-position at the sentence level (Ettinger et al., 2018;Dasgupta et al., 2018). We complement this workwith a targeted and systematic study of phrase-levelrepresentations in transformers, with a focus onteasing apart lexical properties versus reflectionsof accurate compositional phrase meaning.

Our work relates closely to classic work ontwo-word phrases, which have used methods likelandmark tests (Kintsch, 2001; Mitchell and Lap-ata, 2008, 2010), or compared against distribution-based phrase representations (Baroni and Zampar-elli, 2010; Fyshe et al., 2015). Our work alsodraws on work using correlation with similarityjudgments (Finkelstein et al., 2001; Gerz et al.,2016; Hill et al., 2015; Conneau and Kiela, 2018)and paraphrase classification (Ganitkevitch et al.,2013; Wang et al., 2018; Zhang et al., 2019; Yanget al., 2019a) to assess quality of models and rep-resentations. We build on this work by combiningthese methods together, applying them to a system-atic analysis of transformers and their components,and introducing controlled variants of existing tasksto isolate accurate composition of phrase meaningfrom capturing of lexical information.

3 Testing phrase meaning similarity

Our methods begin with familiar approaches for as-sessing representations via meaning similarity: cor-relation with human phrase similarity judgments,and ability to identify paraphrases. The goal is togauge the extent to which models arrive at represen-tations reflecting the nuances of composed phrasemeaning understood by humans. We draw on ex-isting datasets, and begin by testing models on theoriginal versions of these datasets—then we teaseapart effects of word content from effects of moresophisticated meaning composition by introducingcontrolled variants of the datasets. The reasoningis that strong correlations with human similarityjudgments, or strong paraphrase classification per-formance, could be influenced by artifacts that arenot reflective of accurate phrase meaning composi-tion per se. In particular, we may see strong perfor-mance simply on the basis of the amount of overlapin word content between phrases. To address thispossibility, we create controlled datasets in whichword overlap is no longer a cue to similarity.

As a starting point we focus on two-wordphrases, as these are the smallest phrasal unit andthe most conducive to these types of lexical con-

https://github.com/yulang/phrasal-composition-in-transformers

https://github.com/yulang/phrasal-composition-in-transformers

4898

Normal ExamplesSource Phrase Target Phrase & Score

ordinary citizen (0.724)average person person average (0.518)

country (0.255)AB-BA Examples

Source Phrase Target Phrase & Scorelaw school school law (0.382)

adult female female adult (0.812)arms control control arms (0.473)

Table 1: Examples of correlation items. Numbers inparentheses are similarity scores between target phraseand source phrase. Upper half shows normal examples,and lower half shows controlled items.

trols, and because this allows us to leverage largeramounts of annotated phrase similarity data.

3.1 Phrase similarity correlationWe first evaluate phrase representations by as-sessing their alignment with human judgments ofphrase meaning similarity. For testing this corre-spondence, we use the BiRD (Asaadi et al., 2019)dataset. BiRD is a bigram relatedness dataset de-signed to evaluate composition, consisting of 3,345bigram pairs (examples in Table 1), with sourcephrases paired with numerous target phrases, andhuman-rated similarity scores ranging from 0 to 1.

In addition to testing on the full dataset, we de-sign a controlled experiment to remove effects ofword overlap, by filtering the dataset to pairs inwhich the two phrases consist of the same words.We refer to these pairs as “AB-BA” pairs (followingterminology of the authors of the BiRD dataset),and show examples in the lower half of Table 1.

We run similarity tests as follows: given amodel M with layers L, for ith layer li ∈ L anda source-target phrase pair, we compute repre-sentations of source phrase pirep(src) and targetphrase pirep(trg), where rep is a representationtype from Section 4, and we compute their co-sine cos(pirep(src), pirep(trg)). Pearson correlationri of layer li is then computed between cosine andhuman-rated score for all source-target pairs.

3.2 Paraphrase classificationWe further investigate the nature of phrase represen-tations by testing their capacity to support binaryparaphrase classification. This test allows us toexplore whether we will see better alignment withhuman judgments of meaning similarity if we use

more complicated operations than cosine similar-ity comparison. For the classification tasks, wedraw on PPDB 2.0 (Pavlick et al., 2015), a widely-used database consisting of paraphrases with scoresgenerated by a regression model. To formulateour binary classification task, after filtering outlow-quality paraphrases (discussed in Section 5),we use phrase pairs (source phrase, target phrase)from PPDB as positive pairs, and randomly samplephrases from the complete PPDB dataset to formnegative pairs (source phrase, random phrase).

Because word overlap is also a likely cue forparaphrase classification, we filter to a controlledversion of this dataset as well, as illustrated in Ta-ble 2. We formulate the controlled experiment hereas holding word overlap between source phrase andtarget phrase to be exactly 50% for both positiveand negative samples. Our choice of 50% wordoverlap in this case is necessary for construction ofa sufficiently large, balanced classification dataset(AB-BA pairs in PPDB are too few to support clas-sifier training, and AB-BA pairs are more likelyto be non-paraphrases). Note, however, that bycontrolling word overlap to be exactly 50% for allphrase pairs, we still hold constant the amount ofword overlap between phrases, which is the cuethat we wish to remove. As an additional control,each source phrase is paired with an equal numberof paraphrases and non-paraphrases, to avoid theclassifier inferring labels based on phrase identity.

Formally, for each model layer li and representa-tion type rep, we train

CLFirep = MLP([pairirep])

where pairirep represents embedding concatena-tions of each source phrase and target phrase:

pairirep = [pirep(src);p

irep(trg)]

The classifier is trained on binary classification ofwhether concatenated inputs represent paraphrases.

4 Representation types

A variety of approaches have been taken for repre-senting sentences and phrases when all tokens out-put contextualized representations, as in our testedtransformers. To clarify the phrasal informationpresent in different forms of phrase representation,we experiment with a number of different combina-tions of token embeddings as representation types.

Formally, let [T0, · · · , Tk] be an input sequenceof length k + 1, with corresponding embeddings

4899

Normal ExamplesSource Phrase Target Phrase

are crucial

is absolutely vital (pos)was a matter of concern (neg)

is an essential part (pos)are exacerbating (neg)

Controlled ExamplesSource Phrase Target Phrase

communication infrastructuretelecommunications infrastructure (pos)

data infrastructure (neg)

Table 2: Examples of classification items. Classification labels between target phrase and source phrase are inparentheses. Upper half shows normal examples, and lower half shows controlled items.

Figure 1: Example input sequences (BERT format).CLS is a special token at beginning of sequence. To-kens in yellow correspond to Head-Word. Avg-Phrasecontains element-wise average of phrase word embed-dings. Avg-All averages embeddings of all tokens.

at ith layer [ei0, · · · , eik]. Assume the phrase spansthe sequence [a, b], where 0 ≤ a ≤ b ≤ k. Be-cause two-word phrases are atypical inputs forthese models, we experiment both with inputs ofthe two-word phrases alone (“phrase-only”), aswell as inputs with the phrases embedded in sen-tences (“context-available”). This is illustrated inFigure 1 along with phrase representation types.

We test the following forms of phrase representa-tion, drawn from each model and layer separately:

CLS Depending on specific models, this specialtoken can be the first or last token of the inputsequence (i.e. ei0 or eik). In many applications, thistoken is used to represent the full input sequence.

Head-Word In each phrase, the head word is thesemantic center the phrase. For instance, in thephrase “public service”, “service” is the head word,expressing the central meaning of the phrase, while“public” is a modifier. Because phrase heads arenot annotated in our datasets, we approximate thehead by taking the embedding of the final wordof the phrase. This representation is proposed as

a potential representation of the whole phrase, ifinformation is being composed into a central word:

pihw = eib

Avg-Phrase For this representation type we av-erage the embeddings of the tokens in the targetphrase (dashed box in Figure 1). This type of aver-aging of token embeddings is a common means ofaggregate representation (Wieting et al., 2015).

piap =

1

b− a+ 1

b∑x=a

eix

Avg-All Expanding beyond the tokens in “Avg-Phrase”, this representation averages embeddingsfrom the full input sequence.

piaa =

1

k + 1

k∑x=0

eix

SEP With some variation between models, theSEP token is typically a separator for distinguishinginput sentences, and is often the last token (eik) orsecond to last token (eik−1) of a sequence.

5 Experimental setup

Embeddings of each token are obtained by feed-ing input sequences through pre-trained contex-tual encoders. We investigate the “base” versionof five transformers: BERT (Devlin et al., 2019),RoBERTa (Liu et al., 2019b), DistilBERT (Sanhet al., 2019), XLNet (Yang et al., 2019b) and XLM-RoBERTa (Conneau et al., 2019). For the modelsanalyzed in this paper, we are using the implemen-tation of Wolf et al. (2019),2 which is based onPyTorch (Paszke et al., 2019).

2https://github.com/huggingface/transformers

4900

Figure 2: Correlation on BiRD dataset, phrase-only input setting. First row shows results on full dataset, andsecond row on controlled AB-BA pairs. Layer 0 corresponds to input embeddings passing to the model.

For correlation analysis, we first use the com-plete BiRD dataset, consisting of 3,345 phrasepairs.3 We then test with our controlled subset ofthe data, consisting of 410 AB-BA pairs. For clas-sification tasks, we first do preprocessing on PPDB2.0,4 filtering out pairs containing hyperlinks, non-alphabetical symbols, and trivial paraphrases basedon abbreviation or tense change. For our initial clas-sification test, we use 13,050 source-target phrasepairs (of varying word overlap) from this prepro-cessed dataset. We then test with our controlleddataset, consisting of 11,770 source-target phrasepairs (each with precisely 50% word overlap). Foreach paraphrase classification task, 25% of selecteddata is reserved for testing. We use a multi-layerperceptron classifier with a single hidden layer ofsize 256 with ReLU activation, and a softmax layerto generate binary labels. We use a relatively sim-ple classifier following the reasoning of Adi et al.(2016), that this allows examination of how easilyextractable information is in these representations.

For both correlation and classification tasks, weexperiment with phrase-only inputs and context-available (full-sentence) inputs. To obtain sentencecontexts, we search for instances of source phrasesin a Wikipedia dump, and extract sentences con-taining them. For a given phrase pair, target phrasesare embedded in the same sentence context as thesource phrase, to avoid effects of varying sentenceposition between phrases of a given pair. 5

3http://saifmohammad.com/WebPages/BiRD.html4http://paraphrase.org5Because context sentences are extracted based on source

phrases, our use of the same context for source and targetphrases can give rise to unnatural contextual fit for target

6 Results

6.1 Similarity correlation

Full dataset The top row of Figure 2 showscorrelation results on the full BiRD dataset forall models, layers, and representation types, withphrase-only inputs. Among representation types,Avg-Phrase and Avg-All consistently achieve thehighest correlations across models and layers. In allmodels but DistilBERT, correlation of Avg-Phraseand Avg-All peaks at layer 1 and decreases in sub-sequent layers with minor fluctuations. Head-Wordand SEP both show weaker, but non-trivial, corre-lations. The CLS token is of note with a consis-tent rapid rise as layers progress, suggesting thatit quickly takes on properties of the words of thephrase. For all models but DistilBERT, CLS tokencorrelations peak in middle layers and then decline.

Model-wise, XLM-RoBERTa shows the weakestoverall correlations, potentially due to the fact thatit is trained to infer input language and to handlemultiple languages. BERT retains fairly consis-tent correlations across layers, while RoBERTa andXLNet show rapid declines as layers progress, sug-gesting that these models increasingly incorporateinformation that deviates from human intuitionsabout phrase smilarity. DistilBERT, despite beingof smaller size, demonstrates competitive correla-tion. The CLS token in DistilBERT is notable forits continuing rise in correlation strength across

phrases. We consider this acceptable for the sake of controllingsentence position—and if anything, differences in contextualfit may aid models in distinguishing more and less similarphrases. The slight boost observed on the full datasets (forAvg-Phrase) suggests that the sentence contexts do providethe intended benefit from using input of a more natural size.

4901

Figure 3: Correlation on BiRD dataset with phrases embedded in sentence context (context-available input setting).

layers. This suggests that DistilBERT in particularmakes use of the CLS token to encode phrase infor-mation, and unlike other models, its representationsretain the relevant properties to the final layer.

Controlled dataset Turning to our controlledAB-BA dataset, we examine the extent to which theabove correlations indicate sophisticated phrasalcomposition versus effective encoding of informa-tion about phrases’ component words. The bottomrow of Figure 2 shows the correlations on this con-trolled subset. We see that performance of all mod-els drops significantly, often with roughly zero cor-relation. Avg-All and Avg-Phrase no longer dom-inate the correlations, suggesting that these repre-sentations capture word information, but not higher-level compositional information. XLM-RoBERTaand XLNet show particularly low correlations, sug-gesting heavier reliance on word content. Notably,the CLS tokens in RoBERTa and DistilBERT standout with comparatively strong correlations in laterlayers. This suggests that the rise that we see inCLS correlations for DistilBERT in particular maycorrespond to some real compositional signal inthis token, and for this model the CLS token mayin fact correspond to something more like a repre-sentation of the meaning of the full input sequence.The Avg-Phrase representation for RoBERTa alsomakes a comparatively strong showing.

Including sentence context Figure 3 shows thecorrelations when target phrases are embedded aspart of a sentence context, rather than in isolation.As can be expected, Avg-Phrase is now consis-tently the highest in correlation on the full dataset—other tokens are presumably more impacted by the

presence of additional words in the context. Wealso see that the Avg-Phrase correlations no longerdrop so dramatically in later layers, suggestingthat when given full sentence inputs, models re-tain more word properties in later layers than whengiven only phrases. This general trend holds alsofor Avg-All and Head-Word representations.

In the AB-BA setting, we see that presenceof context does boost overall correlation with hu-man judgment. Of note is XLM-RoBERTa’s Avg-Phrase, which without sentence context has zerocorrelation in the AB-BA setting, but which withsentence context reaches our highest observed AB-BA correlations in its final layers. However, evenwith context, the strongest correlation across mod-els is still less than 0.3. It is still the case, then, thatcorrelation on the controlled data degrades signifi-cantly relative to the full dataset. This indicates thateven when phrases are input within sentence con-texts, phrase representations in transformers reflectheavy reliance on word content, largely missing ad-ditional nuances of compositional phrase meaning.

6.2 Paraphrase classification

Full dataset Results for our full paraphrase clas-sification dataset, with phrase-only inputs, areshown in the top row of Figure 4. Accuraciesare overall very high, and we see generally sim-ilar patterns to the correlation tasks. Best accu-racy is achieved by using Avg-Phrase and Avg-All representations. RoBERTa, XLM-RoBERTa,and XLNet show decreasing correlations for top-performing representations in later layers, whileBERT and DistilBERT remain more consistentacross layers. Performance of CLS requires a few

4902

Figure 4: Classification accuracy on PPDB dataset (phrase-only input setting). First row shows classificationaccuracy on original dataset, and second row shows accuracy on controlled dataset.

Figure 5: Classification accuracy on PPDB dataset with phrases embedded in sentence context. First row showsclassification accuracy on original dataset, and second row shows accuracy on controlled dataset.

layers to peak, with top performance around mid-dle layers, and in some models shows poor per-formance in later layers. SEP shows unstable per-formance compared to other representations, espe-cially in DistilBERT and RoBERTa.

Controlled dataset The bottom row of Figure 4shows classification accuracy when word overlapis held constant. Consistent with the drop in cor-relations on the controlled AB-BA experimentsabove, classification performance of all modelsdrops down to only slightly above chance perfor-mance of 50%. This suggests that the high classifi-cation performance on the full dataset relies largelyon word overlap information, and that there is lit-tle higher-level phrase meaning information to aidclassification in the absence of the overlap cue. Wesee in some cases a very slight trend such that clas-sification accuracy increases a bit toward middle

layers—so to the extent that there is any compo-sitional phrase information being captured, it mayincrease within representations in the middle lay-ers. Overall, the consistency of these results withthose of the correlation analysis suggests that theapparent lack of accurate compositional meaninginformation in our tested phrase representationsis not simply a result of cosine correlations beinginappropriate for picking up on correspondences.

Including sentence context Figure 5 shows theclassification results for representations of phrasesembedded in sentence contexts. The patternslargely align with our observations from the corre-lation task. Performance on the full dataset is stillhigh, with Avg-Phrase now showing consistentlyhighest performance, being least influenced by thepresence of new context words. In the controlledsetting, we see the same substantial drop in per-

4903

horse ran color rangallop POS NEG

dissolve NEG POS

Table 3: An example of landmark experiment of verb”run”. Representations are expected to have higher co-sine similarities between phrase and landmark wordthat are marked “POS”.

formance relative to the full dataset—there is veryslight improvement over the phrase-only represen-tations, but the highest accuracy among all modelsis still around 0.6. Thus, the inclusion of sentencecontext again does not provide any additional ev-idence for sophisticated compositional meaninginformation in the tested phrase representations.

7 Qualitative analysis: sensedisambiguation

The above analyses rely on testing models’ sensitiv-ity to meaning similarity between two phrases. Inthis section we complement these analyses with an-other test aimed at assessing phrasal composition:testing models’ ability to select the correct sensesof polysemous words in a composed phrase, as pro-posed by Kintsch (2001). Each test item consistsof a) a central verb, b) two subject-verb phrasesthat pick out different senses of the verb, and c)two landmark words, each associating with one ofthe target senses of the verb. Table 3 shows an ex-ample with central verb “ran” and phrases “horseran”/ “color ran”. The corresponding landmarkwords are “gallop”, which associates with “horseran”, and “dissolve”, which associates with “colorran”. The reasoning is that composition shouldselect the correct verb meaning, shifting represen-tations of the central verbs—and of the phrase asa whole—toward landmarks with closer meaning.For this example, models should produce phraseembeddings such that “horse ran” is closer to “gal-lop” and “color ran” is closer to “dissolve”. Weuse the items introduced in Kintsch (2001), whichconsist of a total of 4 sets of landmark tests. Wefeed landmarks and phrases respectively througheach transformer, without context, to generate cor-responding representations pirep for each layer liand representation type rep. Cosine similarity be-tween each phrase-landmark pair is computed andcompared against expected similarities.

Figure 6 shows the percentage of phrases thatfall closer to the correct landmark word than to the

incorrect one, averaged over 16 phrase-landmarkword pairs. We see strong overall performanceacross models, suggesting that the informationneeded for this task is successfully captured bythese models’ representations. Additionally, wesee that the patterns largely mirror the results abovefor correlation and classification on uncontrolleddatasets. Particularly, Avg-Phrase and Avg-Allshow comparatively strong performance acrossmodels. RoBERTa and XLNet show stronger per-formance in early layers, dropping off in later lay-ers, while BERT and DistilBERT show more con-sistency across layers. XLM-RoBERTa and XLNetshow lower performance overall.

For this verb sense disambiguation analysis, theHead-Word token is of note because it correspondsto the central verb of interest, so its sense canonly be distinguished by its combination with theother word of the phrase. XLM-RoBERTa hasthe weakest performance with Head-Word, whileBERT and DistilBERT demonstrate strong disam-biguation with this token. As for the CLS token,RoBERTa produces the highest quality representa-tion at layer 1, and BERT outperforms other modelsstarting from layer 6, with DistilBERT also show-ing strong performance across layers.

Notably, the observed parallels to our correlationand classification results are in alignment with theuncontrolled rather than the controlled versions ofthose tests. So while these parallels lend furthercredence to the general observations that we makeabout phrase representation patterns across models,layers, and representation types, it is worth not-ing that these landmark composition tests may besusceptible to lexical effects similar to those con-trolled for above. Since these test items are too fewto filter with the above methods, we leave in-depthinvestigation of this question to future work.

8 Discussion

The analyses reported above yield two primarytakeaways. First, they shed light on the natureof these models’ phrase representations, and theextent to which they reflect word content versusphrasal composition. At many points in these mod-els there is non-trivial alignment with human judg-ments of phrase similarity, paraphrase classifica-tion, and verb sense selection. However, when wecontrol our correlation and classification tests toremove the cue of word overlap, we see little evi-dence that the representations reflect sophisticated

4904

Figure 6: Landmark experiments. Y-axis denotes the percentage of samples that are shifted towards the correctlandmark words in each layer. Missing bars occur when representations are independent of input at layer 0, suchthat cosine similarity between phrases and landmarks will always be 1.

phrase composition beyond what can be gleanedfrom word content. While we see strong perfor-mance on classic sense selection items designedto test phrase composition, the observed resultslargely parallel those from the uncontrolled ver-sions of the correlation and classification analyses,suggesting that success on this landmark test mayreflect lexical properties more than sophisticatedcomposition. Given the importance of systematicmeaning composition for robust and flexible lan-guage understanding, based on these results wepredict that we will see corresponding weaknessesas more tests emerge for these models’ handling ofsubtle meaning differences in downstream tasks.

Our systematic examination of models, layersand representation types yields a second takeawayin the form of practical implications for selectingand extracting representations from these models.For faithful representations of word content, Avg-Phrase is generally the strongest candidate. If onlythe phrase is embedded, drawing from earlier lay-ers is best in RoBERTa, XLM-RoBERTa, and XL-Net, while middle layers are better in BERT, andlater layers in DistilBERT. If the phrase is inputas part of a sentence, middle layers are generallybest across models. Though the CLS token is ofteninterpreted to represent a full input sequence, wefind it to be a poor phrase representation even withphrase-only input, with the notable exception ofthe final layer of DistilBERT.

As for representations that reflect true phrasemeaning composition, we have established thatsuch representations may not currently be avail-able in these models. However, to the extentthat we do see weak evidence of potential com-positional meaning sensitivity, this appears to be

strongest in DistilBERT’s CLS token in final layers,in RoBERTa’s Avg-Phrase representation in laterlayers, and in XLM-RoBERTa’s Avg-Phrase repre-sentation from later layers only when the phrase iscontained within a sentence context.

9 Conclusions and future directions

We have systematically investigated the nature ofphrase representations in state-of-the-art transform-ers. Teasing apart sensitivity to word content ver-sus phrase meaning composition, we find strongsensitivity across models when it comes to wordcontent encoding, but little evidence of sophisti-cated phrase composition. The observed sensitivitypatterns across models, layers, and representationtypes shed light on practical considerations for ex-tracting phrase representations from these models.

Future work can apply these tests to a broaderrange of models, and continue to develop controlledtests that target encoding of complex compositionalmeanings, both for two-word phrases and for largermeaning units. We hope that our findings will stim-ulate further work on leveraging the power of thesegeneralized transformers while improving their ca-pacity to capture compositional meaning.

Acknowledgments

We would like to thank three anonymous review-ers for valuable questions and suggestions for im-proving this paper. We also thank members of theUniversity of Chicago CompLing Lab, and the Toy-ota Technological Institute at Chicago, for helpfulcomments and feedback on this work. This mate-rial is based upon work supported by the NationalScience Foundation under Award No. 1941160.

4905

ReferencesYossi Adi, Einat Kermany, Yonatan Belinkov, Ofer

Lavi, and Yoav Goldberg. 2016. Fine-grained anal-ysis of sentence embeddings using auxiliary predic-tion tasks. arXiv preprint arXiv:1608.04207.

Shima Asaadi, Saif Mohammad, and Svetlana Kir-itchenko. 2019. Big bird: A large, fine-grained,bigram relatedness dataset for examining semanticcomposition. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 505–516.

Joris Baan, Jana Leible, Mitja Nikolaus, David Rau,Dennis Ulmer, Tim Baumgartner, Dieuwke Hupkes,and Elia Bruni. 2019. On the realization of compo-sitionality in neural networks. In Proceedings of the2019 ACL Workshop BlackboxNLP: Analyzing andInterpreting Neural Networks for NLP, pages 127–137.

Geoff Bacon and Terry Regier. 2019. Does bertagree? evaluating knowledge of structure depen-dence through agreement relations. arXiv preprintarXiv:1908.09892.

Marco Baroni and Roberto Zamparelli. 2010. Nounsare vectors, adjectives are matrices: Representingadjective-noun constructions in semantic space. InProceedings of the 2010 Conference on EmpiricalMethods in Natural Language Processing, pages1183–1193. Association for Computational Linguis-tics.

Shammur Absar Chowdhury and Roberto Zamparelli.2018. RNN simulations of grammaticality judg-ments on long-distance dependencies. In Proceed-ings of the 27th international conference on compu-tational linguistics, pages 133–144.

Kevin Clark, Urvashi Khandelwal, Omer Levy, andChristopher D Manning. 2019. What does bert lookat? an analysis of bert’s attention. In Proceedings ofthe 2019 ACL Workshop BlackboxNLP: Analyzingand Interpreting Neural Networks for NLP, pages276–286.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzman, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2019. Unsupervisedcross-lingual representation learning at scale. arXivpreprint arXiv:1911.02116.

Alexis Conneau and Douwe Kiela. 2018. Senteval: Anevaluation toolkit for universal sentence representa-tions. In Proceedings of the Eleventh InternationalConference on Language Resources and Evaluation(LREC 2018).

Alexis Conneau, German Kruszewski, Guillaume Lam-ple, Loıc Barrault, and Marco Baroni. 2018. Whatyou can cram into a single $ &!#* vector: Probing

sentence embeddings for linguistic properties. InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 2126–2136.

Ishita Dasgupta, Demi Guo, Andreas Stuhlmuller,Samuel J Gershman, and Noah D Goodman. 2018.Evaluating compositionality in sentence embed-dings. arXiv preprint arXiv:1802.04302.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. Bert: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186.

Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros,and Noah A Smith. 2016. Recurrent neural networkgrammars. In Proceedings of the 2016 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, pages 199–209.

Allyson Ettinger, Ahmed Elgohary, Colin Phillips, andPhilip Resnik. 2018. Assessing composition in sen-tence vector representations. In Proceedings ofthe 27th International Conference on ComputationalLinguistics, pages 1790–1801.

Allyson Ettinger, Ahmed Elgohary, and Philip Resnik.2016. Probing for semantic evidence of compositionby means of simple classification tasks. In Proceed-ings of the 1st Workshop on Evaluating Vector-SpaceRepresentations for NLP, pages 134–139.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias,Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey-tan Ruppin. 2001. Placing search in context: Theconcept revisited. In Proceedings of the 10th inter-national conference on World Wide Web, pages 406–414.

Richard Futrell, Ethan Wilcox, Takashi Morita, PengQian, Miguel Ballesteros, and Roger Levy. 2019.Neural language models as psycholinguistic sub-jects: Representations of syntactic state. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 32–42.

Alona Fyshe, Leila Wehbe, Partha Talukdar, Brian Mur-phy, and Tom Mitchell. 2015. A compositional andinterpretable semantic space. In Proceedings of the2015 conference of the north american chapter ofthe association for computational linguistics: Hu-man language technologies, pages 32–41.

Juri Ganitkevitch, Benjamin Van Durme, and ChrisCallison-Burch. 2013. Ppdb: The paraphrasedatabase. In Proceedings of the 2013 Conference ofthe North American Chapter of the Association for

4906

Computational Linguistics: Human Language Tech-nologies, pages 758–764.

Daniela Gerz, Ivan Vulic, Felix Hill, Roi Reichart, andAnna Korhonen. 2016. Simverb-3500: A large-scale evaluation set of verb similarity. In Proceed-ings of the 2016 Conference on Empirical Methodsin Natural Language Processing, pages 2173–2182.

Kristina Gulordava, Piotr Bojanowski, Edouard Grave,Tal Linzen, and Marco Baroni. 2018. Colorlessgreen recurrent networks dream hierarchically. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies.

John Hewitt and Christopher D Manning. 2019. Astructural probe for finding syntax in word represen-tations. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4129–4138.

Felix Hill, Roi Reichart, and Anna Korhonen. 2015.Simlex-999: Evaluating semantic models with (gen-uine) similarity estimation. Computational Linguis-tics, 41(4):665–695.

Dieuwke Hupkes, Anand Singh, Kris Korrel, GermanKruszewski, and Elia Bruni. 2018. Learning compo-sitionally through attentive guidance. arXiv preprintarXiv:1805.09657.

Ganesh Jawahar, Benoıt Sagot, and Djame Seddah.2019. What does bert learn about the structure oflanguage? In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguistics,pages 3651–3657.

Jaap Jumelet, Willem Zuidema, and Dieuwke Hupkes.2019. Analysing neural language models: Con-textual decomposition reveals default reasoning innumber and gender assignment. In Proceedings ofthe 23rd Conference on Computational Natural Lan-guage Learning (CoNLL), pages 1–11.

Yoon Kim, Alexander M Rush, Lei Yu, Adhiguna Kun-coro, Chris Dyer, and Gabor Melis. 2019. Unsuper-vised recurrent neural network grammars. In Pro-ceedings of the 2019 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Vol-ume 1 (Long and Short Papers), pages 1105–1117.

Walter Kintsch. 2001. Predication. Cognitive science,25(2):173–202.

Josef Klafka and Allyson Ettinger. 2020. Spying onyour neighbors: Fine-grained probing of contex-tual embeddings for information about surroundingwords. arXiv preprint arXiv:2005.01810.

Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.2016. Assessing the ability of LSTMs to learnsyntax-sensitive dependencies. Transactions of the

Association for Computational Linguistics, 4:521–535.

Adam Liska, German Kruszewski, and Marco Baroni.2018. Memorize or generalize? searching for acompositional rnn in a haystack. arXiv preprintarXiv:1802.06467.

Nelson F Liu, Matt Gardner, Yonatan Belinkov,Matthew E Peters, and Noah A Smith. 2019a. Lin-guistic knowledge and transferability of contextualrepresentations. In Proceedings of the 2019 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Pa-pers), pages 1073–1094.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019b.Roberta: A robustly optimized bert pretraining ap-proach. arXiv preprint arXiv:1907.11692.

Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019.Right for the wrong reasons: Diagnosing syntacticheuristics in natural language inference. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 3428–3448.

Jeff Mitchell and Mirella Lapata. 2008. Vector-basedmodels of semantic composition. In proceedings ofACL-08: HLT, pages 236–244.

Jeff Mitchell and Mirella Lapata. 2010. Compositionin distributional models of semantics. Cognitive sci-ence, 34(8):1388–1429.

W James Murdoch, Peter J Liu, and Bin Yu. 2018. Be-yond word importance: Contextual decompositionto extract interactions from lstms. In InternationalConference on Learning Representations.

Navnita Nandakumar, Timothy Baldwin, and BaharSalehi. 2019. How well do embedding models cap-ture non-compositionality? a view from multiwordexpressions. In Proceedings of the 3rd Workshop onEvaluating Vector Space Representations for NLP,pages 27–34.

Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, LucaAntiga, et al. 2019. Pytorch: An imperative style,high-performance deep learning library. In Ad-vances in Neural Information Processing Systems,pages 8024–8035.

Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,Benjamin Van Durme, and Chris Callison-Burch.2015. Ppdb 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, andstyle classification. In Proceedings of the 53rd An-nual Meeting of the Association for ComputationalLinguistics and the 7th International Joint Confer-ence on Natural Language Processing (Volume 2:Short Papers), pages 425–430.

4907

Matthew Peters, Mark Neumann, Luke Zettlemoyer,and Wen-tau Yih. 2018. Dissecting contextual wordembeddings: Architecture and representation. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages1499–1509.

Alec Radford, Karthik Narasimhan, Tim Salimans,and Ilya Sutskever. 2018. Improving languageunderstanding by generative pre-training. URLhttps://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/languageunderstanding paper. pdf.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei, and Ilya Sutskever. 2019. Languagemodels are unsupervised multitask learners. OpenAIBlog, 1(8):9.

Colin Raffel, Noam Shazeer, Adam Roberts, KatherineLee, Sharan Narang, Michael Matena, Yanqi Zhou,Wei Li, and Peter J Liu. 2020. Exploring the lim-its of transfer learning with a unified text-to-texttransformer. Journal of Machine Learning Research,21(140):1–67.

Adam Roberts, Colin Raffel, and Noam Shazeer. 2020.How much knowledge can you pack into the pa-rameters of a language model? arXiv preprintarXiv:2002.08910.

Victor Sanh, Lysandre Debut, Julien Chaumond, andThomas Wolf. 2019. Distilbert, a distilled versionof bert: smaller, faster, cheaper and lighter. arXivpreprint arXiv:1910.01108.

Naomi Saphra and Adam Lopez. 2020. Word inter-dependence exposes how lstms compose representa-tions. arXiv preprint arXiv:2004.13195.

Vered Shwartz and Ido Dagan. 2019. Still a pain in theneck: Evaluating text representations on lexical com-position. Transactions of the Association for Com-putational Linguistics, 7:403–419.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Y Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In Proceedings of the 2013 conference onempirical methods in natural language processing,pages 1631–1642.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019a.Bert rediscovers the classical nlp pipeline. In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 4593–4601.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,Adam Poliak, R Thomas McCoy, Najoung Kim,Benjamin Van Durme, Samuel Bowman, DipanjanDas, et al. 2019b. What do you learn from con-text? probing for sentence structure in contextual-ized word representations. In 7th International Con-ference on Learning Representations, ICLR 2019.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Jesse Vig and Yonatan Belinkov. 2019. Analyzingthe structure of attention in a transformer languagemodel. In Proceedings of the 2019 ACL WorkshopBlackboxNLP: Analyzing and Interpreting NeuralNetworks for NLP, pages 63–76.

Alex Wang, Amanpreet Singh, Julian Michael, FelixHill, Omer Levy, and Samuel Bowman. 2018. Glue:A multi-task benchmark and analysis platform fornatural language understanding. In Proceedingsof the 2018 EMNLP Workshop BlackboxNLP: An-alyzing and Interpreting Neural Networks for NLP,pages 353–355.

John Wieting, Mohit Bansal, Kevin Gimpel, andKaren Livescu. 2015. Towards universal para-phrastic sentence embeddings. arXiv preprintarXiv:1511.08198.

Ethan Wilcox, Roger Levy, Takashi Morita, andRichard Futrell. 2018. What do RNN languagemodels learn about filler–gap dependencies? InProceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP, pages 211–221.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. Huggingface’s trans-formers: State-of-the-art natural language process-ing. ArXiv, abs/1910.03771.

Yinfei Yang, Yuan Zhang, Chris Tar, and JasonBaldridge. 2019a. PAWS-X: A Cross-lingual Adver-sarial Dataset for Paraphrase Identification. In Proc.of EMNLP.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-bonell, Russ R Salakhutdinov, and Quoc V Le.2019b. Xlnet: Generalized autoregressive pretrain-ing for language understanding. In Advances inneural information processing systems, pages 5754–5764.

Da Yin, Tao Meng, and Kai-Wei Chang. 2020. Sen-tibert: A transferable transformer-based architecturefor compositional sentiment semantics.

Yuan Zhang, Jason Baldridge, and Luheng He. 2019.PAWS: Paraphrase Adversaries from Word Scram-bling. In Proc. of NAACL.

http://arxiv.org/abs/2005.04114



Assessing Phrasal Representation and Composition in ...spondence, we use the BiRD (Asaadi et...

Documents

Transcript of Assessing Phrasal Representation and Composition in ...spondence, we use the BiRD (Asaadi et...