Generating Contrastive Referring Expressions · 2017-07-26 · 4 Generating Referring Expressions...

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 678–687Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1063

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 678–687Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1063

Generating Contrastive Referring Expressions

Martın Villalba and Christoph Teichmann and Alexander KollerDepartment of Language Science and Technology

Saarland University, Germany{villalba|cteichmann|koller}@coli.uni-saarland.de

Abstract

The referring expressions (REs) producedby a natural language generation (NLG)system can be misunderstood by thehearer, even when they are semanticallycorrect. In an interactive setting, the NLGsystem can try to recognize such misun-derstandings and correct them. We presentan algorithm for generating corrective REsthat use contrastive focus (“no, the BLUEbutton”) to emphasize the information thehearer most likely misunderstood. Weshow empirically that these contrastiveREs are preferred over REs without con-trast marking.

1 Introduction

Interactive natural language generation (NLG)systems face the task of detecting when they havebeen misunderstood, and reacting appropriately tofix the problem. For instance, even when the sys-tem generated a semantically correct referring ex-pression (RE), the user may still misunderstand it,i.e. resolve it to a different object from the one thesystem intended. In an interactive setting, such asa dialogue system or a pedestrian navigation sys-tem, the system can try to detect such misunder-standings – e.g. by predicting what the hearer un-derstood from their behavior (Engonopoulos et al.,2013) – and to produce further utterances whichresolve the misunderstanding and get the hearer toidentify the intended object after all.

When humans correct their own REs, they rou-tinely employ contrastive focus (Rooth, 1992;Krifka, 2008) to clarify the relationship to the orig-inal RE. Say that we originally described an objectb as “the blue button”, but the hearer approaches abutton b′ which is green, thus providing evidencethat they misunderstood the RE to mean b′. In this

case, we would like to say “no, the BLUE button”,with the contrastive focus realized by an appropri-ate pitch accent on “BLUE”. This utterance alertsthe hearer to the fact that they misunderstood theoriginal RE; it reiterates the information from theoriginal RE; and it marks the attribute “blue” asa salient difference between b′ and the object theoriginal RE was intended to describe.

In this paper, we describe an algorithm for gen-erating REs with contrastive focus. We start fromthe modeling assumption that misunderstandingsarise because the RE rs the system uttered wascorrupted by a noisy channel into an RE ru whichthe user “heard” and then resolved correctly; inthe example above, we assume the user literallyheard “the green button”. We compute this (hypo-thetical) RE ru as the RE which refers to b′ andhas the lowest edit distance from rs. Based onthis, we mark the contrastive words in rs, i.e. wetransform “the blue button” into “the BLUE but-ton”. We evaluate our system empirically on REsfrom the GIVE Challenge (Koller et al., 2010) andthe TUNA Challenge (van der Sluis et al., 2007),and show that the contrastive REs generated by oursystem are preferred over a number of baselines.

The paper is structured as follows. We first re-view related work in Section 2 and define the prob-lem of generating contrastive REs in Section 3.Section 4 sketches the general architecture for REgeneration on which our system is based. In Sec-tion 5, we present the corruption model and showhow to use it to reconstruct ru. Section 6 de-scribes how we use this information to generatecontrastive markup in rs, and in Section 7 we eval-uate our approach.

2 Related Work

The notion of focus has been extensively studiedin the literature on theoretical semantics and prag-

678

https://doi.org/10.18653/v1/P17-1063

https://doi.org/10.18653/v1/P17-1063

matics, see e.g. Krifka (2008) and Rooth (1997)for overview papers. Krifka follows Rooth (1992)in taking focus as “indicat(ing) the presence of al-ternatives that are relevant for the interpretationof linguistic expressions”; focus then establishesa contrast between an object and these alterna-tives. Bornkessel and Schlesewsky (2006) findthat corrective focus can even override syntacticrequirements, on the basis of “its extraordinarilyhigh communicative saliency”. This literature ispurely theoretical; we offer an algorithm for auto-matically generating contrastive focus.

In speech, focus is typically marked throughintonation and pitch accents (Levelt, 1993; Pier-rehumbert and Hirschberg, 1990; Steube, 2001),while concepts that can be taken for granted aredeaccented and/or deleted. Developing systemswhich realize precise pitch contours for focus intext-to-speech settings is an ongoing research ef-fort. We therefore realize focus in written lan-guage in this paper, by capitalizing the focusedword. We also experiment with deletion of back-ground words.

There is substantial previous work on interac-tive systems that detect and respond to misun-derstandings. Misu et al. (2014) present an er-ror analysis of an in-car dialogue system whichshows that more than half the errors can only beresolved through further clarification dialogues, asopposed to better sensors and/or databases; that is,by improved handling of misunderstandings. En-gonopoulos et al. (2013) detect misunderstandingsof REs in interactive NLG through the use of a sta-tistical model. Their model also predicts the objectto which a misunderstood RE was incorrectly re-solved. Moving from misunderstanding detectionto error correction, Zarrieß and Schlangen (2016)present an interactive NLG algorithm which is ca-pable of referring in installments, in that it cangenerate multiple REs that are designed to correctmisunderstandings of earlier REs to the same ob-ject. The interactive NLG system developed byAkkersdijk et al. (2011) generates both reflectiveand anticipative feedback based on what a userdoes and sees. Their error detection and correctionstrategy distinguishes a fixed set of possible sit-uations where feedback is necessary, and definescustom, hard-coded RE generation sub-strategiesfor each one. None of these systems generate REsmarked for focus.

We are aware of two items of previous work that

address the generation of contrastive REs directly.Milosavljevic and Dale (1996) outline strategiesfor generating clarificatory comparisons in ency-clopedic descriptions. Their surface realizer cangenerate contrastive REs, but the attributes thatreceive contrastive focus have to be specified byhand. Krahmer and Theune (2002) extend the In-cremental Algorithm (Dale and Reiter, 1995) so itcan mark attributes as contrastive. This is a fullyautomatic algorithm for contrastive REs, but it in-herits all the limitations of the Incremental Algo-rithm, such as its reliance on a fixed attribute or-der. Neither of these two approaches evaluates thequality of the contrastive REs it generates.

Finally, some work has addressed the issue ofgenerating texts that realize the discourse relationcontrast. For instance, Howcroft et al. (2013)show how to choose contrastive discourse connec-tives (but, while, . . . ) when generating restau-rant descriptions, thus increasing human ratingsfor naturalness. Unlike their work, the researchpresented in this paper is not about discourse rela-tions, but about assigning focus in contrastive REs.

3 Interactive NLG

We start by introducing the problem of generatingcorrective REs in an interactive NLG setting. Weuse examples from the GIVE Challenge (Kolleret al., 2010) throughout the paper; however, thealgorithm itself is domain-independent.

GIVE is a shared task in which an NLG system(the instruction giver, IG) must guide a human user(the instruction follower, IF) through a virtual 3Denvironment. The IF needs to open a safe and steala trophy by clicking on a number of buttons in theright order without triggering alarms. The job ofthe NLG system is to generate natural-languageinstructions which guide the IF to complete thistask successfully.

The generation of REs has a central place in theGIVE Challenge because the system frequentlyneeds to identify buttons in the virtual environ-ment to the IF. Figure 1 shows a screenshot of aGIVE game in progress; here b1 and b4 are bluebuttons, b2 and b3 are yellow buttons, and w1 is awindow. If the next button the IF needs to press isb4 – the intended object, os – then one good RE forb4 would be “the blue button below the window”,and the system should utter:

(1) Press the blue button below the window.

After uttering this sentence, the system can

679

Figure 1: Example scene from the GIVE Chal-lenge.

track the IF’s behavior to see whether the IF hasunderstood the RE correctly. If the wrong but-ton is pressed, or if a model of IF’s behavior sug-gests that they are about to press the wrong but-ton (Engonopoulos et al., 2013), the original REhas been misunderstood. However, the system stillgets a second chance, since it can utter a correctiveRE, with the goal of identifying b4 to the IF afterall. Examples include simply repeating the origi-nal RE, or generating a completely new RE fromscratch. The system can also explicitly take intoaccount which part of the original RE the IF mis-understood. If it has reason to believe that the IFresolved the RE to b3, it could say:

(2) No, the BLUE button below the window.

This use of contrastive focus distinguishes theattributes the IF misunderstood (blue) from thosethat they understood correctly (below the win-dow), and thus makes it easier for the IF to resolvethe misunderstanding. In speech, contrastive focuswould be realized with a pitch accent; we approx-imate this accent in written language by capitaliz-ing the focused word. We call an RE that uses con-trastive focus to highlight the difference betweenthe misunderstood and the intended object, a con-trastive RE. The aim of this paper is to present analgorithm for computing contrastive REs.

4 Generating Referring Expressions

While we make no assumptions on how the orig-inal RE rs was generated, our algorithm for re-constructing the corrupted RE ru requires an REgeneration algorithm that can represent all seman-tically correct REs for a given object compactly ina chart. Here we sketch the RE generation of En-gonopoulos and Koller (2014), which satisfies thisrequirement.

NPb4,{b4}

Nb4,{b4}

PPb4,{b3,b4}

NPw1,{w1}

Nw1,{w1}

window

Dw1,

the

Pb4,below

below

Nb4,{b1,b4}

Nb4,{b1,b2,b3,b4}

button

ADJb4,{b1,b4}

blue

Db4,

the

Figure 2: Example syntax tree for an RE for b4.

This algorithm assumes a synchronous gram-mar which relates strings with the sets of objectsthey refer to. Strings and their referent sets areconstructed in parallel from lexicon entries andgrammar rules; each grammar rule specifies howthe referent set of the parent is determined fromthose of the children. For the scene in Figure 1,we assume lexicon entries which express, amongother things, that the word “blue” denotes the set{b1, b4} and the word “below” denotes the relation{(w1, b1), (w1, b2), (b3, w1), (b4, w1)}. We com-bine these lexicons entries using rules such as

“N→ button() |button |{b1, b2, b3, b4}”which generates the string “button” and asso-

ciates it with the set of all buttons or

“N→ N1(N,PP) |w1 • w2 |R1 ∩R2”

which states that a phrase of type noun can becombined with a prepositional phrase and their de-notations will be intersected. Using these rules wecan determine that “the window” denotes {w1},that “below the window” can refer to {b3, b4} andthat “blue button below the window” uniquelyrefers to {b4}. The syntax tree in Fig. 2 representsa complete derivation of an RE for {b4}.

The algorithm of Engonopoulos and Kollercomputes a chart which represents the set of allpossible REs for a given set of input objects, suchas {b4}, according to the grammar. This is doneby building a chart containing all derivations ofthe grammar which correspond to the desired set.They represent this chart as a finite tree automa-ton (Comon et al., 2007). Here we simply writethe chart as a Context-Free Grammar. The stringsproduced by this Context-Free Grammar are thenexactly the REs for the intended object. For ex-ample, the syntax tree in Fig. 2 is generated by theparse chart for the set {b4}. Its nonterminal sym-bols consist of three parts: a syntactic category

680

intendedobject: os

referringexpression: rs

heard referringexpression: ru

user resolvedobject: ou

Contrastive RE

b4 b2

the blue buttonbelow the window

the yellow buttonabove the window

InstructionGiver (IG)

Corruption

InstructionFollower (IF)

Figure 3: The corruption model.

(given by the synchronous grammar), the referentfor which an RE is currently being constructed,and the set of objects to which the entire subtreerefers. The grammar may include recursion andtherefore allow for an infinite set of possible REs.If it is weighted, one can use the Viterbi algorithmto compute the best RE from the chart.

5 Listener Hypotheses and Edit Distance

5.1 Corruption model

Now let us say that the system has generated anduttered an RE rs with the intention of referring tothe object os, but it has then found that the IF hasmisunderstood the RE and resolved it to anotherobject, ou (see Fig. 3). We assume for the pur-poses of this paper that such a misunderstandingarises because rs was corrupted by a noisy chan-nel when it was transmitted to the IF, and the IF“heard” a different RE, ru. We further assume thatthe IF then resolved ru correctly, i.e. the corrup-tion in the transmission is the only source of mis-understandings.

In reality, there are of course many other rea-sons why the IF might misunderstand rs, such aslack of attention, discrepancies in the lexicon orthe world model of the IG and IF, and so on. Wemake a simplifying assumption in order to makethe misunderstanding explicit at the level of theRE strings, while still permitting meaningful cor-rections for a large class of misunderstandings.

An NLG system that builds upon this idea inorder to generate a corrective RE has access tothe values of os, rs and ou; but it needs to in-fer the most likely corrupted RE ru. To do this,we model the corruption using the edit operationsused for the familiar Levenshtein edit distance(Mohri, 2003) over the alphabet Σ: Sa, substitu-tion of a word with a symbol a ∈ Σ; D, deletionof a word; Ia, insertion of the symbol a ∈ Σ; orK, keeping the word. The noisy channel passes

over each word in rs and applies either D, K orone of the S operations to it. It may also apply Ioperations before or after a word. We call any se-quence s of edit operations that could apply to rsan edit sequence for rs.

An example for an edit sequence which cor-rupts rs = “the blue button below the window”into ru = “the yellow button above the window”is shown in Figure 4. The same ru could also havebeen generated by the edit operation sequenceK Syellow K Sabove KK, and there is generally alarge number of edit sequences that could trans-form between any two REs. If an edit sequence smaps x to y, we write apply(s, x) = y.

We can now define a probability distributionP (s | rs) over edit sequences s that the noisychannel might apply to the string rs, as follows:

P (s | rs) =1

Z

∏

si∈sexp(−c(si)),

where c(si) is a cost for using the edit operationsi. We set c(K) = 0, and for any a in our alpha-bet we set c(Sa) = c(Ia) = c(D) = C, for somefixed C > 0. Z is a normalizing constant which isindependent of s and ensures that the probabilitiessum to 1. It is finite for sufficiently high valuesof C, because no sequence for rs can ever containmore K, S and D operations than there are wordsin rs, and the total weight of sequences generatedby adding more and more I operations will con-verge.

Finally, let L be the set of referring expressionsthat the IF would resolve to ou, i.e. the set of candi-dates for ru. Then the most probable edit sequencefor rs which generates an ru ∈ L is given by

s∗ = arg maxs : apply(s,rs)∈L

P (s | rs)

= arg mins

∑si∈s c(si),

i.e. s∗ is the edit sequence that maps rs to an REin L with minimal cost. We will assume that s∗ isthe edit sequence that corrupted rs, i.e. that ru =apply(s∗, rs).

5.2 Finding the most likely corruptionIt remains to compute s∗; we will then show inSection 6 how it can be used to generate a cor-rective RE. Attempting to find s∗ by enumerationis impractical, as the set of edit sequences for agiven rs and ru may be large and the set of pos-sible ru for a given ou may be infinite. Instead

681

rs the blue button below the windowedit operation sequence K D Iyellow K Sabove K K

ru the yellow button above the window

Figure 4: Example edit sequence for a given corruption.

we will use the algorithm from Section 4 to com-pute a chart for all the possible REs for ou, rep-resented as a context-free grammar G whose lan-guage L = L(G) consists of these REs. Wewill then intersect it with a finite-state automa-ton which keeps track of the edit costs, obtaininga second context-free grammar G′. These opera-tions can be performed efficiently, and s∗ can beread off of the minimum-cost syntax tree of G′.

Edit automaton. The possible edit sequencesfor a given rs can be represented compactly in theform of a weighted finite-state automaton F (rs)(Mohri, 2003). Each run of the automaton on astring w corresponds to a specific edit sequencethat transforms rs intow, and the sum of transitionweights of the run is the cost of that edit sequence.We call F (rs) the edit automaton. It has a state qifor every position i in rs; the start state is q0 andthe final state is q|rs|. For each i, it has a “keep”transition from qi to qi+1 that reads the word atposition i with cost 0. In addition, there are tran-sitions from qi to qi+1 with cost C that read anysymbol in Σ (for substitution) and ones that readthe empty string ε (for deletion). Finally, there is aloop with cost C from each qi to itself and for anysymbol in Σ, implementing insertion.

An example automaton for rs =“the blue button below the window” is shownin Figure 5. The transitions are written in theform 〈word in w : associated cost〉. Note thatevery path through the edit transducer corre-sponds to a specific edit sequence s, and thesum of the costs along the path corresponds to− logP (s | rs)− logZ.

Combining G and F (rs). Now we can com-bine G with F (rs) to obtain G′, by intersectingthem using the Bar-Hillel construction (Bar-Hillelet al., 1961; Hopcroft and Ullman, 1979). For thepurposes of our presentation we assume that G isin Chomsky Normal Form, i.e. all rules have theform A → a, where a is a word, or A → B C,where both symbols on the right hand side are non-terminals. The resulting grammar G′ uses non-terminal symbols of the form Nb,A,〈qi,qk〉, where

b, A are as in Section 4, and qi, qk indicate that thestring derived by this nonterminal was generatedby editing the substring of rs from position i to k.

Let Nb,A → a be a production rule of G with aword a on the right-hand side; as explained above,b is the object to which the subtree should refer,and A is the set of objects to which the subtreeactually might refer. Let t = qi → 〈a:c〉qk be atransition in F (rs), where q, q′ are states of F (rs)and c is the edit cost. From these two, we createa context-free rule Nb,A,〈qi,qk〉 → a with weight cand add it to G′. If k = i + 1, these rules repre-sent K and S operations; if k = i, they representinsertions.

Now let Nb,A → Xb1,A1 Yb2,A2 be a binary rulein G, and let qi, qj , qk be states of F (rs) withi ≤ j ≤ k. We then add a rule Nb,A,〈qi,qk〉 →Xb1,A1,〈qi,qj〉 Yb2,A2,〈qj ,qk〉 to G′. These rules areassigned weight 0, as they only combine words ac-cording to the grammar structure of G and do notencode any edit operations.

Finally, we deal with deletion. Let Nb,A be anonterminal symbol in G and let qh, qi, qj , qk bestates of F (rs) with h ≤ i ≤ j ≤ k. We thenadd a rule Nb,A,〈qh,qk〉 → Nb,A,〈qi,qj〉 to G′. Thisrule deletes the substrings from positions h to iand j to k from rs; thus we assign it the cost ((i−h) + (k − j))C, i.e. the cost of the correspondingε transitions.

If the start symbol of G is Sb,A, then the startsymbol of G′ is Sb,A,〈q0,q|rs|〉. This constructionintersects the languages of G and F (rs), but be-cause F (rs) accepts all strings over the alpha-bet, the languages of G′ and G will be the same(namely, all REs for ou). However, the weightsin G′ are inherited from F (rs); thus the weight ofeach RE in L(G′) is the edit cost from rs.

Example. Fig. 6 shows an example treefor the G′ we obtain from the automatonin Fig. 5. We can read the string w =“the yellow button above the window” off of theleaves; by construction, this is an RE for ou. Fur-thermore, we can reconstruct the edit sequencethat maps from rs to w from the rules of G′ that

682

q0start q1 q2 q3 q4 q5 q6the:0Σ:C

ε:C

Σ:C

blue:0Σ:C

ε:C

Σ:C

button:0

Σ:C

ε:C

Σ:C

below:0

Σ:C

ε:C

Σ:C

the:0Σ:C

ε:C

Σ:C

window:0

Σ:CΣ:C

ε:C

Σ:C

Figure 5: Edit automaton F (rs) for rs = “the blue button below the window”.

Tree

NPb2,{b2}, 〈q0, q6〉

Nb2,{b2},〈q1,q6〉

PPb2,{b1,b2},〈q3,q6〉

NPw1,{w1},〈q4,q6〉

Nw1,{w1},〈q5,q6〉

window

Dw1, ,〈q4,q5〉

the

Pb2,above,〈q3,q4〉

above

Nb2,{b2,b3},〈q1,q3〉

Nb2,{b2,b3},〈q2,q3〉

Nb2,{b1,b2,b3,b4},〈q2,q3〉

button

ADJb2,{b2,b3},〈q2,q2〉

yellow

Db2, ,〈q0,q1〉

the

s K D Iyellow K Sabove KK

Emphasis No, press the BLUE button BELOW the window

Figure 6: A syntax tree described by G′, together with its associated edit sequence and contrastive RE.

were used to derive w. We can see that “yellow”was created by an insertion because the two statesof F (rs) in the preterminal symbol just above itare the same. If the two states are different, thenthe word was either substituted (“above”, if therule had weight C) or kept (“the”, if the rule hadweight 0). By contrast, unary rules indicate dele-tions, in that they make “progress” in rs withoutadding new words to w.

We can compute the minimal-cost tree ofG′ us-ing the Viterbi algorithm. Thus, to summarize, wecan calculate s∗ from the intersection of a context-free grammar G representing the REs to ou withthe automaton F (rs) representing the edit distanceto rs. From this, we obtain ru = apply(s∗, rs).This is efficient in practice.

6 Generating Contrastive REs

6.1 Contrastive focus

We are now ready to generate a contrastive REfrom rs and s∗. We assign focus to the wordsin rs which were changed by the corruption –that is, the ones to which s∗ applied Substitute orDelete operations. For instance, the edit sequencein Fig. 6 deleted “blue” and substituted “below”with “above”. Thus, we mark these words withfocus, and obtain the contrastive RE “the BLUEbutton BELOW the window”. We call this strat-egy Emphasis, and write rsE for the RE obtained

by applying the Emphasis strategy to the RE rs.

6.2 ShorteningWe also investigate a second strategy, which gen-erates more succinct contrastive REs than the Em-phasis strategy. Most research on RE genera-tion (e.g. Dale and Reiter (1995)) has assumedthat hearers should prefer succinct REs, which inparticular do not violate the Maxim of Quantity(Grice, 1975). When we utter a contrastive RE,the user has previously heard the RE rs, so someof the information in rsE is redundant. Thus wemight obtain a more succinct, and possibly better,RE by dropping such redundant information fromthe RE.

For the grammars we consider here, rsE oftencombines an NP and a PP, e.g. “[blue button]NP

[below the window]PP ”. If errors occur only inone of these constituents, then it might be suffi-cient to generate a contrastive RE using only thatconstituent. We call this strategy Shortening anddefine it as follows.

If all the words that are emphasized in rsE

are in the NP, the Shortening RE is “the” plusthe NP, with emphasis as in rs

E . So if rs is“the [blue button] [above the window]” and s∗ =K SyellowKKKK, corresponding to a rsE of“the [BLUE button] [above the window]”, then theRE would be “the [BLUE button]”.

If all the emphasis in rsE is in the PP, we use

683

We wanted our player to select this button:

So we told them: press the red button to the rightof the blue button.

But they selected this button instead:

Which correction is better for this scene?◦ No, press the red BUTTON to the right of

the BLUE BUTTON◦ No, press the red button to the RIGHT of

the blue button

Figure 7: A sample scene from Experiment 1.

“the one” plus the PP and again capitalize as inrs

E . So if we have s∗ = KKK SbelowKK,where rsE is “the [blue button] [ABOVE the win-dow]”, we obtain “the one [ABOVE the window].”If there is no PP or if rsE emphasizes words inboth the NP and the PP, then we just use rsE .

7 Evaluation

To test whether our algorithm for contrastive REsassigns contrastive focus correctly, we evaluatedit against several baselines in crowdsourced pair-wise comparison overhearer experiments. LikeBuß et al. (2010), we opted for an overhearer ex-periment to focus our evaluation on the effects ofcontrastive feedback, as opposed to the challengespresented by the navigational and timing aspectsof a fully interactive system.

7.1 Domains and stimuli

We created the stimuli for our experiments fromtwo different domains. We performed a first ex-periment with scenes from the GIVE Challenge,while a second experiment replaced these sceneswith stimuli from the “People” domain of theTUNA Reference Corpus (van der Sluis et al.,2007). This corpus consists of photographs of menannotated with nine attributes, such as whether the

We wanted our player to select the person circledin green:

So we told them: the light haired old man in asuit looking straight.

But they selected the person circled in red instead.Which correction is better for this scene?

◦ No, the light haired old man IN A SUITLOOKING STRAIGHT◦ No, the LIGHT HAIRED OLD man in a

suit looking straight

Figure 8: A sample scene from Experiment 2.

person has a beard, a tie, or is looking straight.Six of these attributes were included in the cor-pus to better reflect human RE generation strate-gies. Many human-generated REs in the corpusare overspecific, in that they contain attributes thatare not necessary to make the RE semanticallyunique.

We chose the GIVE environment in order to testREs referring both to attributes of an object, i.e.color, and to its spatial relation to other visible ob-jects in the scene. The TUNA Corpus was chosenas a more challenging domain, due to the greaternumber of available properties for each object ona scene.

Each experimental subject was presented withscreenshots containing a marked object and an RE.Subjects were told that we had previously referredto the marked object with the given RE, but an(imaginary) player misunderstood this RE and se-lected a different object, shown in a second screen-shot. They were then asked to select which oneof two corrections they considered better, where“better” was intentionally left unspecific. Figs. 7and 8 show examples for each domain. The full setof stimuli is available as supplementary material.

To maintain annotation quality in our crowd-sourcing setting, we designed test items with a

684

clearly incorrect answer, such as REs referring tothe wrong target or a nonexistent one. These testitems were randomly interspersed with the realstimuli, and only subjects with a perfect score onthe test items were taken into account. Experimen-tal subjects were asked to rate up to 12 compar-isons, shown in groups of 3 scenes at a time, andwere automatically disqualified if they evaluatedany individual scene in less than 10 seconds. Theorder in which the pairs of strategies were shownwas randomized, to avoid effects related to the or-der in which they were presented on screen.

7.2 Experiment 1

Our first experiment tested four strategies againsteach other. Each experimental subject was pre-sented with two screenshots of 3D scenes with amarked object and an RE (see Fig. 7 for an exam-ple). Each subject was shown a total of 12 scenes,selected at random from 16 test scenes. We col-lected 10 judgments for each possible combina-tion of GIVE scene and pair of strategies, yieldinga total of 943 judgements from 142 subjects afterremoving fake answers.

We compared the Emphasis and Shorteningstrategies from Section 6 against two baselines.The Repeat strategy simply presented rs as a “con-trastive” RE, without any capitalization. Com-parisons to Repeat test the hypothesis that sub-jects prefer explicit contrastive focus. The Ran-dom strategy randomly capitalized adjectives, ad-verbs, and/or prepositions that were not capital-ized by the Emphasis strategy. Comparisons toRandom verify that any preference for Emphasis isnot only due to the presence of contrastive focus,but also because our method identifies preciselywhere that focus should be.

Table 1a shows the results of all pairwise com-parisons. For each row strategy StratR and eachcolumn strategy StratC , the table value corre-sponds to(#StratR pref. over StratC)−(#StratC pref. over StratR)

(# tests between StratR and StratC)

Significance levels are taken from a two-tailedbinomial test over the counts of preferences foreach strategy. We find a significant preference forthe Emphasis strategy over all others, providingevidence that our algorithm assigns contrastive fo-cus to the right words in the corrective RE.

While the Shortening strategy is numericallypreferred over both baselines, the difference isnot significant, and it is significantly worse than

the Emphasis strategy. This is surprising, givenour initial assumption that listeners prefer succinctREs. It is possible that a different strategy forshortening contrastive REs would work better; thisbears further study.

7.3 Experiment 2In our second experiment, we paired the Empha-sis, Repeat, and Random strategies against eachother, this time evaluating each strategy in theTUNA people domain. Due to its poor perfor-mance in Experiment 1, which was confirmed inpilot experiments for Experiment 2, the Shorten-ing strategy was not included.

The experimental setup for the TUNA domainused 3x4 grids of pictures of people chosen atrandom from the TUNA Challenge, as shown inFig. 8. We generated 8 such grids, along with REsranging from two to five attributes and requiringone or two attributes to establish the correct con-trast. The larger visual size of objects in the theTUNA scenes allowed us to mark both os and ouin a single picture without excessive clutter.

The REs for Experiment 2 were designed toonly include attributes from the referred objects,but no information about its position in relation toother objects. The benefit is twofold: we avoidtaxing our subjects’ memory with extremely longREs, and we ensure that the overall length of thesecond set of REs is comparable to those in theprevious experiment.

We obtained 240 judgements from 65 subjects(after removing fake answers). Table 1b showsthe results of all pairwise comparisons. We findthat even in the presence of a larger number of at-tributes, our algorithm assigns contrastive focus tothe correct words of the RE.

7.4 DiscussionOur experiments confirm that the strategy for com-puting contrastive REs presented in this paperworks in practice. This validates the corruptionmodel, which approximates semantic mismatchesbetween what the speaker said and what the lis-tener understood as differences at the level ofwords in strings. Obviously, this model is still anapproximation, and we will test its limits in futurework.

We find that users generally prefer REs withan emphasis over simple repetitions. In the morechallenging scenes of the TUNA corpus, userseven have a significant preference of Random over

685

Repeat Random Emphasis ShorteningRepeat – 0.041 -0.570*** -0.141

Random -0.041 – -0.600*** -0.109Emphasis 0.570*** 0.600*** – 0.376***

Shortening 0.141 0.109 -0.376*** –

(a) Results for Experiment 1

Repeat Random EmphasisRepeat – -0.425*** -0.575***

Random 0.425*** – -0.425***Emphasis 0.575*** 0.425*** –

(b) Results for Experiment 2

Table 1: Pairwise comparisons between feedback strategies for experiments 1 and 2. A positive valueshows preference for the row strategy, significant at *** p < 0.001.

Repeat, although this makes no semantic sense.This preference may be due to the fact that em-phasizing anything at least publically acknowl-edges the presence of a misunderstanding that re-quires correction. It will be interesting to explorewhether this preference holds up in an interac-tive setting, rather than an overhearer experiment,where listeners will have to act upon the correctiveREs.

The poor performance of the Shortening strat-egy is a surprising negative result. We would ex-pect a shorter RE to always be preferred, follow-ing the Gricean Maxim of Quantity (Grice, 1975).This may because our particular Shortening strat-egy can be improved, or it may be because listen-ers interpret the shortened REs not with respect tothe original instructions, but rather with respect toa “refreshed” context (as observed, for instance, inGotzner et al. (2016)). In this case the shortenedREs would not be unique with respect to the re-freshed, wider context.

8 Conclusion

In this paper, we have presented an algorithm forgenerating contrastive feedback for a hearer whohas misunderstood a referring expression. Ourtechnique is based on modeling likely user misun-derstandings and then attempting to give feedbackthat contrasts with the most probable incorrect un-derstanding. Our experiments show that this tech-nique accurately predicts which words to mark asfocused in a contrastive RE.

In future work, we will complement the over-hearer experiment presented here with an end-to-end evaluation in an interactive NLG setting. Thiswill allow us to further investigate the quality ofthe correction strategies and refine the Shorteningstrategy. It will also give us the opportunity to in-vestigate empirically the limits of the corruptionmodel. Furthermore, we could use this data to re-fine the costs c(D), c(Ia) etc. for the edit opera-tions, possibly assigning different costs to differ-ent edit operations.

Finally, it would be interesting to combine ouralgorithm with a speech synthesis system. In thisway, we will be able to express focus with actualpitch accents, in contrast to the typographic ap-proximation we made here.

ReferencesSaskia Akkersdijk, Marin Langenbach, Frieder Loch,

and Mariet Theune. 2011. The thumbs up! twentesystem for give 2.5. In The 13th European Workshopon Natural Language Generation (ENLG 2011).

Yehoshua Bar-Hillel, Micha Perles, and Eli Shamir.1961. On formal properties of simple phrase struc-ture grammars. Zeitschrift fur Phonetik, Sprachwis-senschaft und Kommunikationsforschung 14:143–172.

Ina Bornkessel and Matthias Schlesewsky. 2006. Therole of contrast in the local licensing of scramblingin german: Evidence from online comprehension.Journal of Germanic Linguistics 18(01):1–43.

Okko Buß, Timo Baumann, and David Schlangen.2010. Collaborating on utterances with a spoken di-alogue system using an isu–based approach to incre-mental dialogue management. In Proceedings of theSpecial Interests Group on Discourse and DialogueConference (SIGdial 2010).

Hubert Comon, Max Dauchet, Remi Gilleron, Flo-rent Jacquemard, Denis Lugiez, Sophie Tison, MarcTommasi, and Christof Loding. 2007. Tree Au-tomata techniques and applications. publishedonline - http://tata.gforge.inria.fr/.http://tata.gforge.inria.fr/.

Robert Dale and Ehud Reiter. 1995. Computationalinterpretations of the Gricean maxims in the gen-eration of referring expressions. Cognitive Science19(2):233–263.

Nikos Engonopoulos and Alexander Koller. 2014.Generating effective referring expressions usingcharts. In Proceedings of the INLG and SIGdial2014 Joint Session.

Nikos Engonopoulos, Martın Villalba, Ivan Titov, andAlexander Koller. 2013. Predicting the resolutionof referring expressions from user behavior. InProceedings of the 2013 Conference on EmpiricalMethods in Natural Language Processing (EMNLP2013).

686

Nicole Gotzner, Isabell Wartenburger, and KatharinaSpalek. 2016. The impact of focus particles on therecognition and rejection of contrastive alternatives.Language and Cognition 8(1):59–95.

H. Paul Grice. 1975. Logic and conversation. InP. Cole and J. L. Morgan, editors, Syntax and Seman-tics: Vol. 3: Speech Acts, Academic Press, pages41–58.

John Edward Hopcroft and Jeffrey Ullman. 1979. In-troduction to Automata Theory, Languages, andComputation. Addison-Wesley.

David Howcroft, Crystal Nakatsu, and Michael White.2013. Enhancing the expression of contrast inthe SPaRKy restaurant corpus. In Proceedings ofthe 14th European Workshop on Natural LanguageGeneration (ENLG 2013).

Alexander Koller, Kristina Striegnitz, Andrew Gargett,Donna Byron, Justine Cassell, Robert Dale, JohannaMoore, and Jon Oberlander. 2010. Report on theSecond NLG Challenge on Generating Instructionsin Virtual Environments (GIVE-2). In Proceedingsof the Sixth International Natural Language Gen-eration Conference (Special session on GenerationChallenges).

E. Krahmer and M. Theune. 2002. Efficient context-sensitive generation of referring expressions. InK. van Deemter and R. Kibble, editors, InformationSharing: Reference and Presupposition in LanguageGeneration and Interpretation, Center for the Studyof Language and Information-Lecture Notes, CSLIPublications, volume 143, pages 223–263.

Manfred Krifka. 2008. Basic notions of informationstructure. Acta Linguistica Hungarica 55:243–276.

Willem J.M. Levelt. 1993. Speaking: From Intentionto Articulation. MIT University Press Group.

Maria Milosavljevic and Robert Dale. 1996. Strate-gies for comparison in encyclopædia descriptions.In Proceedings of the 8th International Natural Lan-guage Generation Workshop (INLG 1996).

Teruhisa Misu, Antoine Raux, Rakesh Gupta, and IanLane. 2014. Situated language understanding at 25miles per hour. In Proceedings of the 15th AnnualMeeting of the Special Interest Group on Discourseand Dialogue (SIGdial 2014).

Mehryar Mohri. 2003. Edit-distance of weighted au-tomata: General definitions and algorithms. Inter-national Journal of Foundations of Computer Sci-ence 14(6):957–982.

Janet B. Pierrehumbert and Julia Hirschberg. 1990.The meaning of intonational contours in the inter-pretation of discourse. In Philip R. Cohen, JerryMorgan, and Martha E. Pollack, editors, Intentionsin Communication, MIT University Press Group,chapter 14.

Mats Rooth. 1992. A theory of focus interpretation.Natural Language Semantics 1:75–116.

Mats Rooth. 1997. Focus. In Shalom Lappin, editor,The Handbook of Contemporary Semantic Theory,Blackwell Publishing, chapter 10, pages 271–298.

Anita Steube. 2001. Correction by contrastive focus.Theoretical Linguistics 27(2-3):215–250.

Ielka van der Sluis, Albert Gatt, and Kees van Deemter.2007. Evaluating algorithms for the generation ofreferring expressions: Going beyond toy domains.In Proceedings of the International Conference onRecent Advances in Natural Language Processing(RANLP 2007).

Sina Zarrieß and David Schlangen. 2016. Easy ThingsFirst: Installments Improve Referring ExpressionGeneration for Objects in Photographs. In Proceed-ings of the 54th Annual Meeting of the Associationfor Computational Linguistics (ACL 2016).

687

Generating Contrastive Referring Expressions · 2017-07-26 · 4 Generating Referring Expressions...

Documents

Transcript of Generating Contrastive Referring Expressions · 2017-07-26 · 4 Generating Referring Expressions...