arXiv:2107.06905v1 [cs.CL] 14 Jul 2021

16
Transition-based Bubble Parsing: Improvements on Coordination Structure Prediction Tianze Shi Cornell University [email protected] Lillian Lee Cornell University [email protected] Abstract We propose a transition-based bubble parser to perform coordination structure identifica- tion and dependency-based syntactic analysis simultaneously. Bubble representations were proposed in the formal linguistics literature decades ago; they enhance dependency trees by encoding coordination boundaries and in- ternal relationships within coordination struc- tures explicitly. In this paper, we introduce a transition system and neural models for pars- ing these bubble-enhanced structures. Ex- perimental results on the English Penn Tree- bank and the English GENIA corpus show that our parsers beat previous state-of-the-art ap- proaches on the task of coordination structure prediction, especially for the subset of sen- tences with complex coordination structures. 1 1 Introduction Coordination structures are prevalent in treebank data (Ficler and Goldberg, 2016a), especially in long sentences (Kurohashi and Nagao, 1994), and they are among the most challenging constructions for NLP models. Difficulties in correctly identify- ing coordination structures have consistently con- tributed to a significant portion of errors in state- of-the-art parsers (Collins, 2003; Goldberg and El- hadad, 2010; Ficler and Goldberg, 2017). These errors can further propagate to downstream NLP modules and applications, and limit their perfor- mance and utility. For example, Saha et al. (2017) report that missing conjuncts account for two-thirds of the errors in recall made by their open informa- tion extraction system. Coordination constructions are particularly chal- lenging for the widely-adopted dependency-based paradigm of syntactic analysis, since the asymmet- ric definition of head-modifier dependency rela- tions is not directly compatible with the symmetric 1 Code at github.com/tzshi/bubble-parser-acl21. Bubble Tree: I prefer hot coffee or tea and a bun conj cc conj amod conj cc conj det nsubj obj UD Tree: I prefer hot coffee or tea and a bun nsubj obj amod conj cc conj cc det Figure 1: Bubble tree and (basic) UD tree for the same example sentence. (For clarity, we omit punctuation and single-word bubble boundaries.) Bubbles explic- itly encode the scope of the shared modifier hot with respect to the nested coordination, whereas the UD tree gives both tea and bun identical relationships to hot. nature of the relations among the participating con- juncts and coordinators. 2 Existing treebanks usu- ally resort to introducing special relations to rep- resent coordination structures. But, there remain theoretical and empirical challenges regarding how to most effectively encode information like modi- fier sharing relations while still permitting accurate statistical syntactic analysis. In this paper, we explore Kahane’s (1997) alter- native solution: extend the dependency-tree repre- sentation by introducing bubble structures to ex- plicitly encode coordination boundaries. The co- heads within a bubble enjoy a symmetric relation- ship, as befits a model of conjunction. Further, bub- ble trees support representation of nested coordina- tion, with the scope of shared modifiers identifiable by the attachment sites of bubble arcs. Figure 1 compares a bubble tree against a Universal Depen- dencies (UD; Nivre et al., 2016, 2020) tree for the same sentence. Yet, despite theses advantages, implementation 2 Rambow (2010) comments on other divergences between syntactic representation and syntactic phenomena. arXiv:2107.06905v1 [cs.CL] 14 Jul 2021

Transcript of arXiv:2107.06905v1 [cs.CL] 14 Jul 2021

Transition-based Bubble Parsing:Improvements on Coordination Structure Prediction

Tianze ShiCornell University

[email protected]

Lillian LeeCornell University

[email protected]

Abstract

We propose a transition-based bubble parserto perform coordination structure identifica-tion and dependency-based syntactic analysissimultaneously. Bubble representations wereproposed in the formal linguistics literaturedecades ago; they enhance dependency treesby encoding coordination boundaries and in-ternal relationships within coordination struc-tures explicitly. In this paper, we introduce atransition system and neural models for pars-ing these bubble-enhanced structures. Ex-perimental results on the English Penn Tree-bank and the English GENIA corpus show thatour parsers beat previous state-of-the-art ap-proaches on the task of coordination structureprediction, especially for the subset of sen-tences with complex coordination structures.1

1 Introduction

Coordination structures are prevalent in treebankdata (Ficler and Goldberg, 2016a), especially inlong sentences (Kurohashi and Nagao, 1994), andthey are among the most challenging constructionsfor NLP models. Difficulties in correctly identify-ing coordination structures have consistently con-tributed to a significant portion of errors in state-of-the-art parsers (Collins, 2003; Goldberg and El-hadad, 2010; Ficler and Goldberg, 2017). Theseerrors can further propagate to downstream NLPmodules and applications, and limit their perfor-mance and utility. For example, Saha et al. (2017)report that missing conjuncts account for two-thirdsof the errors in recall made by their open informa-tion extraction system.

Coordination constructions are particularly chal-lenging for the widely-adopted dependency-basedparadigm of syntactic analysis, since the asymmet-ric definition of head-modifier dependency rela-tions is not directly compatible with the symmetric

1Code at github.com/tzshi/bubble-parser-acl21.

Bubble Tree:

I prefer hot coffee or tea and a bunconj cc conj

amod conjcc conjdetnsubj

obj

UD Tree:

I prefer hot coffee or tea and a bun

nsubjobj

amodconj

cc

conjcc

det

Figure 1: Bubble tree and (basic) UD tree for the sameexample sentence. (For clarity, we omit punctuationand single-word bubble boundaries.) Bubbles explic-itly encode the scope of the shared modifier hot withrespect to the nested coordination, whereas the UD treegives both tea and bun identical relationships to hot.

nature of the relations among the participating con-juncts and coordinators.2 Existing treebanks usu-ally resort to introducing special relations to rep-resent coordination structures. But, there remaintheoretical and empirical challenges regarding howto most effectively encode information like modi-fier sharing relations while still permitting accuratestatistical syntactic analysis.

In this paper, we explore Kahane’s (1997) alter-native solution: extend the dependency-tree repre-sentation by introducing bubble structures to ex-plicitly encode coordination boundaries. The co-heads within a bubble enjoy a symmetric relation-ship, as befits a model of conjunction. Further, bub-ble trees support representation of nested coordina-tion, with the scope of shared modifiers identifiableby the attachment sites of bubble arcs. Figure 1compares a bubble tree against a Universal Depen-dencies (UD; Nivre et al., 2016, 2020) tree for thesame sentence.

Yet, despite theses advantages, implementation2Rambow (2010) comments on other divergences between

syntactic representation and syntactic phenomena.

arX

iv:2

107.

0690

5v1

[cs

.CL

] 1

4 Ju

l 202

1

of the formalism was not broadly pursued, for rea-sons unknown to us. Given its appealing and in-tuitive treatment of coordination phenomena, werevisit the bubble tree formalism, introducing andimplementing a transition-based solution for pars-ing bubble trees. Our transition system, Bubble-Hybrid, extends the Arc-Hybrid transition system(Kuhlmann et al., 2011) with three bubble-specifictransitions, each corresponding to opening, expand-ing, and closing bubbles. We show that our tran-sition system is both sound and complete with re-spect to projective bubble trees (defined in § 2.2).

Experiments on the English Penn Treebank(PTB; Marcus et al., 1993) extended with coordina-tion annotation (Ficler and Goldberg, 2016a) andthe English GENIA treebank (Kim et al., 2003)demonstrate the effectiveness of our proposedtransition-based bubble parsing on the task of coor-dination structure prediction. Our method achievesstate-of-the-art performance on both datasets andimproves accuracy on the subset of sentences ex-hibiting complex coordination structures.

2 Dependency Trees and Bubble Trees

2.1 Dependency-based Representations forCoordination Structures

A dependency tree encodes syntactic relations viadirected bilexical dependency edges. These are nat-ural for representing argument and adjunct modifi-cation, but Popel et al. (2013) point out that “depen-dency representation is at a loss when it comes torepresenting paratactic linguistic phenomena suchas coordination, whose nature is symmetric (twoor more conjuncts play the same role), as opposedto the head-modifier asymmetry of dependencies”(pg. 517).

If one nonetheless persists in using dependencyrelations to annotate all syntactic structures, as iscommon practice in most dependency treebanks(Hajic et al., 2001; Nivre et al., 2016, inter alia),then one must introduce special relations to repre-sent coordination structures and promote one ele-ment from each coordinated phrase to become the“representational head”. One choice is to specifyone of the conjuncts as the “head” (Mel’cuk, 1988,2003; Järvinen and Tapanainen, 1998; Lombardoand Lesmo, 1998) (e.g., in Figure 1, the visuallyasymmetric “conj” relation between “coffee” and“tea” is overloaded to admit a symmetric relation-ship), but it is then non-trivial to distinguish sharedmodifiers from private ones (e.g., in the UD tree

at the bottom of Figure 1, it is difficult to tell that“hot” is private to “coffee” and “tea”, which share it,but “hot” does not modify “bun”). Another choiceis let one of the coordinators dominate the phrase(Hajic et al., 2001, 2020), but the coordinator doesnot directly capture the syntactic category of thecoordinated phrase. Decisions on which of thesedependency-based fixes is more workable are fur-ther complicated by the interaction between rep-resentation styles and their learnability in statisti-cal parsing (Nilsson et al., 2006; Johansson andNugues, 2007; Rehbein et al., 2017).

Enhanced UD A tactic used by many recent re-leases of UD treebanks is to introduce certain extraedges and non-lexical nodes (Schuster and Man-ning, 2016; Nivre et al., 2018; Bouma et al., 2020).While some of the theoretical issues still persist inthis approach with respect to capturing the sym-metric nature of relations between conjuncts, thissolution better represents shared modifiers in coor-dinations, and so is a promising direction. In workconcurrent with our own, Grünewald et al. (2021)manually correct the coordination structure anno-tations in an English treebank under the enhancedUD representation format. We leave it to futurework to explore the feasibility of automatic conver-sion of coordination structure representations be-tween enhanced UD trees and bubble trees, whichwe discuss next.

2.2 Bubble Trees

An alternative solution to the coordination-in-dependency-trees dilemma is to permit certain re-stricted phrase-inspired constructs for such struc-tures. Indeed, Tesnière’s (1959) seminal work ondependency grammar does not describe all syntac-tic relations in terms of dependencies, but ratherreserves a primitive relation for connecting coor-dinated items. Hudson (1984) further extends thisidea by introducing explicit markings of coordina-tion boundaries.

In this paper, we revisit bubble trees, a represen-tational device along the same vein introduced byKahane (1997) for syntactic representation. (Ka-hane credits Gladkij (1968) with a formal study.)Bubbles are used to denote coordinated phrases;otherwise, asymmetric dependency relations areretained. Conjuncts immediately within the bub-ble may co-head the bubble, and the bubble itselfmay establish dependencies with its governor andmodifiers. Figure 1 depicts an example bubble tree.

We now formally define bubble trees and theirprojective subset, which will become the focus ofour transition-based parser in §3. The following for-mal descriptions are adapted from Kahane (1997),tailored to the presentation of our parser.

Formal Definition Given a dependency-relationlabel set L, we define a bubble tree for a length-n sentence W = w1, . . . , wn to be a quadruple(V,B, φ,A), where V = {RT, w1, . . . , wn} is theground set of nodes (RT is the dummy root), B isa set of bubbles, the function φ : B 7→ (2V \{∅})gives the content of each bubble as a non-empty3

subset of V , and A ⊂ B × L×B defines a labeleddirected tree over B. Given labeled directed tree A,we say α1 → α2 if and only if (α1, l, α2) ∈ A forsome l. We denote the reflexive transitive closureof relation→ by ∗→.

Bubble tree (V,B, φ,A) is well-formed if andonly if it satisfies the following conditions:4

• No partial overlap: ∀α1, α2 ∈ B, either φ(α1) ∩φ(α2) = ∅ or φ(α1) ⊆ φ(α2) or φ(α2) ⊆ φ(α1);• Non-duplication: there exists no non-identicalα1, α2 ∈ B such that φ(α1) = φ(α2);• Lexical coverage: for any singleton (i.e., one-element) set s in 2V , ∃α ∈ B such that φ(α) = s;• Roothood: the root RT appears in exactly one bub-ble, a singleton that is the root of the tree definedby A.• Containment: if ∃α1, α2 ∈ B such that φ(α2) ⊂φ(α1), then α1

∗→ α2.

Projectivity Our parser focuses on the subclassof projective well-formed bubble trees. Visually, aprojective bubble tree only contains bubbles cover-ing a consecutive sequence of words (such that wecan draw boxes around the span of words to repre-sent them) and can be drawn with all arcs arrangedspatially above the sentence where no two arcs orbubble boundaries cross each other. The bubbletree in Figure 1 is projective.

Formally, we define the projection ψ(α) ∈ 2V

of a bubble α ∈ B to be all nodes the bubble andits subtree cover, that is, v ∈ ψ(α) if and only ifα∗→ α′ and v ∈ φ(α′) for some α′. Then, we can

define a well-formed bubble tree to be projective ifand only if it additionally satisfies the following:• Continuous coverage: for any bubble α ∈ B, ifwi, wj ∈ φ(α) and i < k < j, then wk ∈ φ(α);

3Our definition does not allow empty nodes; we leave it tofuture work to support them for gapping constructions.

4We do not use β for bubbles because we reserve the βsymbol for our parser’s buffer.

• Continuous projections: for any bubble α ∈ B, ifwi, wj ∈ ψ(α) and i < k < j, then wk ∈ ψ(α);• Contained projections: for α1, α2 ∈ B, ifα1

∗→ α2, then either ψ(α2) ⊂ φ(α1) or ψ(α2) ∩φ(α1) = ∅.

3 Our Transition System for ParsingBubble Trees

Although, as we have seen, bubble trees have theo-retical benefits in representing coordination struc-tures that interface with an overall dependency-based analysis, there has been a lack of parser im-plementations capable of handling such representa-tions. In this section, we fill this gap by introducinga transition system that can incrementally build pro-jective bubble trees.

Transition-based approaches are popular in de-pendency parsing (Nivre, 2008; Kübler et al., 2009).We propose to extend the Arc-Hybrid transitionsystem (Kuhlmann et al., 2011) with transitionsspecific to bubble structures.5

3.1 Bubble-Hybrid Transition System

A transition system consists of a data structure de-scribing the intermediate parser states, called con-figurations; specifications of the initial and termi-nal configurations; and an inventory of transitionsthat advance the parser in configuration space to-wards reaching a terminal configuration.

Our transition system uses a similar configura-tion data structure to that of Arc-Hybrid, whichconsists of a stack, a buffer, and the partially-committed syntactic analysis. Initially, the stackonly contains a singleton bubble corresponding to{RT}, and the buffer contains singleton bubbles,each representing a token in the sentence. Then,through taking transitions one at a time, the parsercan incrementally move items from the buffer tothe stack, or reduce items by attaching them toother bubbles or merging them into larger bubbles.Eventually, the parser should arrive at a terminalconfiguration where the stack contains the single-ton bubble of {RT} again, but the buffer is emptyas all the tokens are now attached to or containedin other bubbles that are now descendants of the

5Our strategy can be adapted to other transition systemsas well; we focus on Arc-Hybrid here because of its com-paratively small inventory of transitions, absence of spuriousambiguities (there is a one-to-one mapping between a gold treeand a valid transition sequence), and abundance of existingimplementations (e.g., Kiperwasser and Goldberg, 2016).

Transition From To(Pre-conditions) Stack σ Buffer β Stack σ′ Buffer β′

SHIFT(|β| ≥ 1)

. . . b1 . . . . . . b1 . . .

LEFTARClbl(|σ| ≥ 1; |β| ≥ 1; s1, b1 /∈ O;φ(s1) 6= {RT})

. . . s1 b1 . . . . . . b1 . . .

s1lbl

RIGHTARClbl(|σ| ≥ 2; s1, s2 /∈ O)

. . . s2 s1 . . . . . . s2

s1lbl

. . .

BUBBLEOPENlbl(|σ| ≥ 2; s1, s2 /∈ O;φ(s2) 6= {RT})

. . . s2 s1 . . . . . . s2 s1conj lbl

. . .

BUBBLEATTACHlbl(|σ| ≥ 2; s1 /∈ O; s2 ∈ O)

. . . s2a . . . s1conj . . .

. . . . . . s2a . . . s1conj . . . lbl

. . .

BUBBLECLOSE(|σ| ≥ 1; s1 ∈ O)

. . . s1a . . .conj . . .

. . . . . . s1a . . . . . .conj . . .

Table 1: Illustration of our Bubble-Hybrid transition system. We give the pre-conditions for each transition andvisualizations of the affected stack and buffer items comparing the configurations before and after taking thattransition. O denotes the set of currently open bubbles and RT is the dummy root symbol.

Stack Buffer

(Initial) RT I prefer hot coffee or tea and a bunSHIFT===⇒ RT I prefer hot coffee or tea and a bunLEFTARCnsubj========⇒ RT prefer

I

hot coffee or tea and a bunnsubj

SHIFT===⇒ RT prefer hot coffee or tea and a bunSHIFT===⇒ RT prefer hot coffee or tea and a bunSHIFT===⇒ RT prefer hot coffee or tea and a bunSHIFT===⇒ RT prefer hot coffee or tea and a bun

BUBBLEOPENcc=========⇒ RT prefer hot coffee orconj cc

tea and a bun

SHIFT===⇒ RT prefer hot coffee or tea

conj cc

and a bun

BUBBLEATTACHconj===========⇒ RT prefer hot coffee or tea

conj cc conj

and a bunBUBBLECLOSE========⇒ RT prefer hot and a bunLEFTARCamod========⇒ RT prefer

hot

and a bunamod

SHIFT===⇒ RT prefer and a bunSHIFT===⇒ RT prefer and a bun

BUBBLEOPENcc=========⇒ RT prefer andconj cc

a bun

SHIFT===⇒ RT prefer and a

conj cc

bun

LEFTARCdet=======⇒ RT prefer andconj cc

bun

adet

SHIFT===⇒ RT prefer and bun

conj cc

BUBBLEATTACHconj===========⇒ RT prefer and bun

conj cc conj

∅BUBBLECLOSE========⇒ RT preferSHIFT===⇒ RT prefer ∅RIGHTARCobj========⇒ RT prefer

dobj

RIGHTARCroot========⇒ RT

preferroot

∅ (Terminal)

Figure 2: Step-by-step visualization of the stack and buffer during parsing of the example sentence in Figure 1.For steps following an attachment or BUBBLECLOSE transition, the detailed subtree or internal bubble structure isomitted for visual clarity. For the same reason, we omit drawing the boundaries around singleton bubbles.

{RT} singleton, and we can retrieve a completedbubble-tree parse.

Table 1 lists the available transitions in ourBubble-Hybrid system. The SHIFT, LEFTARC, andRIGHTARC transitions are as in the Arc-Hybrid sys-tem. We introduce three new transitions to handlecoordination-related bubbles: BUBBLEOPEN putsthe first two items on the stack into an open bubble,with the first item in the bubble, i.e., previously thesecond topmost item on the stack, labeled as thefirst conjunct of the resulting bubble; BUBBLEAT-TACH absorbs the topmost item on the stack into theopen bubble that is at the second topmost position;and finally, BUBBLECLOSE closes the open bubbleat the top of the stack and moves it to the buffer,which then allows it to take modifiers from its leftthrough LEFTARC transitions. Figure 2 visualizesthe stack and buffer throughout the process of pars-ing the example sentence in Figure 1. In particular,the last two steps in the left column of Figure 2show the bubble corresponding to the phrase “cof-fee or tea” receiving its left modifier “hot” througha LEFTARC transition after it is put back on thebuffer by a BUBBLECLOSE transition.

Formal Definition Our transition system is aquadruple (C, T, ci, Cτ ), where C is the set of con-figurations to be defined shortly, T is the set oftransitions with each element being a partial func-tion t ∈ T : C 7⇀ C, ci maps a sentence to its intialconfiguration, and Cτ ⊂ C is a set of terminal con-figurations. Each configuration c ∈ C is a septuple(σ, β, V,B, φ,A,O), where V , B, φ, and A definea partially-recognized bubble tree, σ and β are eachan (ordered) list of items in B, and O ⊂ B is a setof open bubbles. For a sentence W = w1, . . . , wn,we let ci(W ) = (σ0, β0, V,B0, φ0, {}, {}), whereV = {RT, w1, . . . , wn}, B0 contains n + 1 items,φ0(B00) = {RT}, φ0(B0i ) = {wi} for i from 1 to n,σ0 = [B00], and β0 = [B01, . . . ,B0n]. We write σ|s1and b1|β to denote a stack and a buffer with theirtopmost items being s1 and b1 and the remaindersbeing σ and β respectively. We also omit the con-stant V in describing c when the context is clear.

For the transitions T , we have:• SHIFT[(σ, b1|β,B, φ,A,O)] =

(σ|b1, β,B, φ,A,O);• LEFTARClbl[(σ|s1, b1|β,B, φ,A,O)] =(σ, b1|β,B, φ,A ∪ {(b1, lbl, s1)},O);• RIGHTARClbl[(σ|s2|s1, β,B, φ,A,O)] =(σ|s2, β,B, φ,A ∪ {(s2, lbl, s1)},O);• BUBBLEOPENlbl[(σ|s2|s1, β,B, φ,A,O)] =

(σ|α, β,B ∪ {α}, φ′, A∪ {(α, conj, s2), (α, lbl,s1)},O ∪ {α}), where α is a new bubble, andφ′ = φ d {α 7→ ψ(s2) ∪ ψ(s1)} (i.e., φ′ is almostthe same as φ, but with α added to the function’sdomain, mapped by the new function to cover theprojections of both s2 and s1);• BUBBLEATTACHlbl[(σ|s2|s1, β,B, φ,A,O)] =

(σ|s2, β,B, φ′, A∪{s2, lbl, s1},O), where φ′ =φ d {s2 7→ φ(s2) ∪ ψ(s1)};• BUBBLECLOSE[(σ|s1, β,B, φ,A,O)] =

(σ, s1|β,B, φ,A,O\{s1}).

3.2 Soundness and CompletenessIn this section, we show that our Bubble-Hybridtransition system is both sound and complete (de-fined below) with respect to the subclass of projec-tive bubble trees.6

Define a valid transition sequence π =t1, . . . , tm for a given sentenceW to be a sequencesuch that for the corresponding sequence of con-figurations c0, . . . , cm, we have c0 = ci(W ), ci =ti(ci−1), and cm ∈ Cτ , We can then state sound-ness and completeness properties, and present high-level proof sketches below, adapted from Nivre’s(2008) proof frameworks.

Lemma 1. (Soundness) Every valid transition se-quence π produces a projective bubble tree.Proof Sketch. We examine the requirements for aprojective bubble tree one by one. The set of edgessatisfies the tree constraints since every bubble ex-cept for the singleton bubble of RT must have anin-degree of one to have been reduced from thestack, and the topological order of reductions im-plies acyclicness. Lexical coverage is guaranteedby ci. Roothood is safeguarded by the transitionpre-conditions. Non-duplication is ensured becausenewly-created bubbles are strictly larger. All theother properties can be proved by induction overthe lengths of transition sequence prefixes sinceeach of our transitions preserves zero partial over-lap, containment, and projectivity constraints.

Lemma 2. (Completeness) For every projectivebubble tree over any given sentenceW , there existsa corresponding valid transition sequence π.Proof Sketch. The proof proceeds by strong in-duction on sentence length. We omit relation la-bels without loss of generality. The base case of|W | = 1 is trivial. For the inductive step, weenumerate how to decompose the tree’s top-level

6More precisely, our transition system handles the subsetwhere each non-singleton bubble has ≥ 2 internal children.

structure. (1) When the root has multiple children:Due to projectivity, each child bubble tree τi coversa consecutive span of words wxi , . . . , wyi that areshorter than |W |. Based on the induction hypoth-esis, there exisits a valid transition sequence πi toconstruct the child tree over RT, wxi , . . . , wyi . Herewe let πi to denote the transition sequence exclud-ing the always-present final RIGHTARC transitionthat attaches the subtree to RT; this is for explicitillustration of what transitions to take once the sub-trees are constructed. The full tree can be con-structed by π = π1, RIGHTARC, π2, RIGHTARC,. . . (expanding each πi sequence into its componenttransitions), where we simply attach each subtreeto RT immediately after it is constructed. (2) Whenthe root has a single child bubble α, we cannotdirectly use the induction hypothesis since α cov-ers the same number of words as W . Thus weneed to further enumerate the top-level structureof α. (2a) If α has children with their projectionsoutside of φ(α), then we can find a sequence π0for constructing the shorter-length bubble α andplacing it on the buffer (this corresponds to anempty transition sequence if α is a singleton; oth-erwise, π0 ends with a BUBBLECLOSE transition)and πis for α’s outside children; say it has l chil-dren left of its contents. We construct the entiretree via π = π1,. . . ,πl, π0, LEFTARC, . . . , LEFT-ARC, SHIFT, πl+1, RIGHTARC, . . . , RIGHTARC,where we first construct all the left outside chil-dren and leave them on the stack, next build thebubble α and use LEFTARC transitions to attachits left children while it is on the buffer, then shiftα to the stack before finally continuing on build-ing its right children subtrees, each immediatelyfollowed by a RIGHTARC transition. (2b) If α is anon-singleton bubble without any outside children,but each of its inside children can be parsed throughπi based on the inductive hypothesis, then we candefine π = π1,π2, BUBBLEOPEN, π3, BUBBLEAT-TACH, . . . , BUBBLECLOSE, SHIFT, RIGHTARC,where we use a BUBBLEOPEN transition once thefirst two bubble-internal children are built, eachsubsequent child is attached via BUBBLEATTACH

immediately after construction, and the final threetransitions ensure proper closing of the bubble andits attachment to RT.

4 Models

Our model architecture largely follows that ofKiperwasser and Goldberg’s (2016) neural Arc-

Hybrid parser, but we additionally introduce fea-ture composition for non-singleton bubbles, and arescoring module to reduce frequent coordination-boundary prediction errors. Our model has fivecomponents: feature extraction, bubble-featurecomposition, transition scoring, label scoring, andboundary subtree rescoring.

Feature Extraction We first extract contextual-ized features for each token using a bidirectionalLSTM (Graves and Schmidhuber, 2005):

[w0,w1, . . . ,wn] = bi-LSTM([RT, w1, . . . , wn]),

where the inputs to the bi-LSTM are concatena-tions of word embeddings, POS-tag embeddings,and character-level LSTM embeddings. We alsoreport experiments replacing the bi-LSTM withpre-trained BERT features (Devlin et al., 2019).

Bubble-Feature Composition We initialize thefeatures7 for each singleton bubble Bi in the initialconfiguration to be vBi = wi. For a non-singletonbubble α, we use recursively composed features

vα = g({vα′ |(α, conj, α′) ∈ A}),

where g is a composition function combining fea-tures from the co-heads (conjuncts) immediatelyinside the bubble.8 For our model, for any V ′ ={vi1 , . . . ,viN }, we set

g(V ′) = tanh(Wg ·mean(V ′)),

where mean() computes element-wise averagesand Wg is a learnable square matrix. We also ex-periment with a parameter-free version: g = mean.Neither of the feature functions distinguishes be-tween open and closed bubbles, so we appendto each v vector an indicator-feature embeddingbased on whether the bubble is open, closed, orsingleton.

Transition Scoring Given the current parser con-figuration c, the model predicts the best unla-beled transition to take among all valid transitionsvalid(c) whose pre-conditions are satisfied. We

7We adopt the convenient abuse of notation of allowingindexing by arbitrary objects.

8Comparing with the subtree-feature composition func-tions in dependency parsing that are motivated by asymmetricheaded constructions (Dyer et al., 2015; de Lhoneux et al.,2019; Basirat and Nivre, 2021), our definition focuses oncomposing features from an unordered set of vectors repre-senting the conjuncts in a bubble. The composition functionis recursively applied when there are nested bubbles.

model the log-linear probability of taking an actionwith a multi-layer perceptron (MLP):

P (t|c) ∝ exp(MLPtranst ([vs3 ◦vs2 ◦vs1 ◦vb1 ])),

where ◦ denotes vector concatenation, s1 throughs3 are the first through third topmost items on thestack, and b1 is the immediately accessible bufferitem. We experiment with varying the number ofstack items to extract features from.

Label Scoring We separate edge-label predic-tion from (unlabeled) transition prediction, but thescoring function takes a similar form:

P (l|c, t) ∝ exp(MLPlbll ([vh(c,t) ◦ vd(c,t)])),

where (h(c, t), l, d(c, t)) is the edge to be addedinto the partial bubble tree in t(c).

Boundary Subtree Rescoring In our prelimi-nary error analysis, we find that our models tendto make more mistakes at the boundaries of fullcoordination phrases than at the internal conjunctboundaries, due to incorrect attachments of chil-dren choosing between the phrasal bubble and thefirst/last conjunct. For example, our initial modelpredicts “if you owned it and liked it Friday” in-stead of the annotated “if you owned it and liked itFriday” (the predicted and gold conjuncts are bothitalicized and underlined), incorrectly attaching“Friday” to “liked”. We attribute this problem to thegreedy nature of our first formulation of the parser,and propose to mitigate the issue through rescoring.To rescore boundary attachments of a non-singletonbubble α, for each of the left dependents d of α andits first conjunct αf , we (re)-decide the attachmentvia

P (α→ d|αf ) = logistic(MLPre([vd◦vα◦vαf])),

and similarly for the last conjunct αl and a potentialright dependent.

Training and Inference Our parser is a locally-trained greedy parser. In training, we optimize themodel parameters to maximize the log-likelihoodsof predicting the target transitions and labels alongthe paths generating the gold bubble trees, andthe log-likelihoods of the correct attachments inrescoring;9 during inference, the parser greedilycommits to the highest-scoring transition and labelfor each of its current parser configurations, andafter reaching a terminal configuration, it rescoresand readjusts all boundary subtree attachments.

9We leave the definition of dynamic oracles (Goldberg andNivre, 2013) for bubble tree parsing to future work.

Exact InnerAll NP All NP

FG16 – – 72.70 76.10TSM17 71.08 75.01 73.74 77.25TSM19 75.47 77.83 77.74 80.06Ours 76.48 81.63 78.30 84.03

+BERT 83.74 85.26 84.46 86.22

Table 2: F1 scores on the PTB test set. See Appendix Cfor precision, recall and dev set results.

Exact Whole

HSOM09 – 61.5FG16 – 64.14TSM17 55.22 66.31TSM19 61.22 61.31Ours 67.09 68.23+BERT 79.18 80.41

Table 3: Recall results on the GENIA dataset (we reportrecall instead of F1 scores following prior work). SeeAppendix C for detailed results per constituent type.

5 Empirical Results

Task and Evaluation We validate the utility ofour transition-based parser using the task of coor-dination structure prediction. Given an input sen-tence, the task is to identify all coordination struc-tures and the spans for all their conjuncts withinthat sentence. We mainly evaluate based on exactmetrics which count a prediction of a coordinationstructure as correct if and only if all of its conjunctspans are correct. To facilitate comparison withpre-existing systems that do not attempt to identifyall conjunct boundaries, following Teranishi et al.(2017, 2019), we also consider inner (=only con-sider the correctness of the two conjuncts adjacentto the coordinator) and whole (=only consider theboundary of the whole coordinated phrase) metrics.

Data and Experimental Setup We experimentwith two English datasets, the Penn Treebank (PTB;Marcus et al., 1993, newswire) with added coor-dination annotations (Ficler and Goldberg, 2016a)and the GENIA treebank (Kim et al., 2003, re-search abstracts). We use the conversion tool dis-tributed with the Stanford Parser (Schuster andManning, 2016) to extract UD trees from the PTB-style phrase-structure annotations, which we thenmerge with coordination annotations to form bub-

Bubble-Hybrid (Ours) Edge-FactoredPrec. Rec. Prec. Rec.

punct 92.56 92.52 92.92 92.85case 97.46 98.14 97.71 98.26compound 94.12 95.18 94.24 95.02det 98.85 99.13 98.70 99.06nsubj 97.69 97.72 98.01 97.92nmod 93.04 93.45 93.20 93.62amod 94.52 93.43 94.61 93.95. . . . . . . . . . . . . . .conj 92.68 93.20 92.52 93.04

UAS 95.81 95.99LAS 94.46 94.56

Table 4: PTB test-set results, comparing our transition-based bubble parser and an edge-factored graph-basedparser, both using a BERT-based feature encoder. Therelation labels are ordered by decreasing frequency.While our transition-based bubble parser slightly under-performs the graph-based dependency parser generally,perhaps due to the disadvantage of greedy decoding, itgives slightly better precision and recall on the “conj”relation type.

Rescoring + −

Ours (bi-LSTM) 77.10 76.27• g = mean 75.51 74.16• {vs2 ,vs1 ,vb1} 76.05 74.87• {vs1 ,vb1} 76.33 73.85• {vb1} 50.27 35.14• +BERT 84.40 83.70

Table 5: Exact F1 scores of different model variationson the PTB dev set, w/ and w/o the rescoring module.

ble trees. We follow prior work in reporting PTBresults on its standard splits and GENIA resultsusing 5-fold cross-validation.10 During training(but not test), we discard all non-projective sen-tences. See Appendix A for dataset pre-processingand statistics and Appendix B for implementationdetails.

Baseline Systems We compare our models withseveral baseline systems. Hara et al. (2009,HSOM09) use edit graphs to explicitly align co-ordinated conjuncts based on the idea that they areusually similar; Ficler and Goldberg (2016b, FG16)score candidate coordinations extracted from aphrase-structure parser by modeling their symme-

10We affirm that, as is best practice, only two test-set/crossval-suite runs occurred (one with BERT and one with-out), happening after we fixed everything else; that is, no othermodels were tried after seeing the first test-set/cross-validationresults with and without BERT.

Complexity All Simple Complex

TSM17 66.09 72.90 50.37TSM19 70.90 78.16 54.16Ours 72.97 79.97 56.82+BERT 80.07 83.74 71.59

Table 6: Per-sentence exact match on the PTB test set.Simple includes sentences with only one two-conjunctcoordination, and complex contains the other cases.

try and replaceability properties; Teranishi et al.(2017, TSM17) directly predict boundaries of co-ordinated phrases and then split them into con-juncts;11 Teranishi et al. (2019, TSM19) use sep-arate neural models to score the inner and outerboundaries of conjuncts relative to the coordina-tors, and then use a chart parser to find the globally-optimal coordination structures.

Main Results Table 2 and Table 3 show themain evaluation results on the PTB and GENIAdatasets. Our models surpass all prior results onboth datasets. While the BERT improvements maynot seem surprising, we note that Teranishi et al.(2019) report that their pre-trained language mod-els — specifically, static ELMo embeddings — failto improve their model performance.

General Parsing Results We also evaluate ourmodels on standard parsing metrics by convert-ing the predicted bubble trees to UD-style depen-dency trees. On PTB, our parsers reach unla-beled and labeled attachment scores (UAS/LAS)of 95.81/94.46 with BERT and 94.49/92.88 with bi-LSTM, which are similar to the scores of priortransition-based parsers equipped with similar fea-ture extractors (Kiperwasser and Goldberg, 2016;Mohammadshahi and Henderson, 2020).12 Table 4compares the general parsing results of our bub-ble parser and an edge-factored graph-based de-pendency parser based on Dozat and Manning’s(2017) parser architecture and the same feature en-coder as our parser and trained on the same data.Our bubble parser shows a slight improvement onidentifying the “conj” relations, despite having alower overall accuracy due to the greedy natureof our transition-based decoder. Additionally, our

11We report results for the extended model of TSM17 asdescribed by Teranishi et al. (2019).

12Results are not strictly comparable with previous PTBevaluations that mostly focus on non-UD dependency conver-sions. Table 4 makes a self-contained comparison using thesame UD-based and coordination-merged data conversions.

bubble parser simultaneously predicts the bound-aries of each coordinated phrase and conjuct, whilea typical dependency parser cannot produce suchstructures.

Model Analysis Table 5 shows results of ourmodels with alternative bubble-feature compositionfunctions and varying feature-set sizes. We findthat the parameterized form of composition func-tion g performs better, and the F1 scores mostlydegrade as we use fewer features from the stack. In-terestingly, the importance of our rescoring modulebecomes more prominent when we use fewer fea-tures. Our results resonate with Shi et al.’s (2017)findings on Arc-Hybrid that we need at least onestack item but not necessarily two. Table 6 showsthat our model performs better than previous meth-ods on complex sentences with multiple coordi-nation structures and/or more than two conjuncts,especially when we use BERT as feature extractor.

6 Related Work

Coordination Structure Prediction Very earlywork with heuristic, non-learning-based ap-proaches (Agarwal and Boggess, 1992; Kurohashiand Nagao, 1994) typically report difficulties indistinguishing shared modifiers from private ones,although such heuristics have been recently incor-porated in unsupervised work (Sawada et al., 2020).Generally, researchers have focused on symmetryprinciples, seeking to align conjuncts (Kurohashiand Nagao, 1994; Shimbo and Hara, 2007; Haraet al., 2009; Hanamoto et al., 2012), since coordi-nated conjuncts tend to be semantically and syn-tactically similar (Hogan, 2007), as attested to bypsycholinguistic evidence of structural parallelism(Frazier et al., 1984, 2000; Dubey et al., 2005).Ficler and Goldberg (2016a) and Teranishi et al.(2017) additionally leverage the linguistic principleof replaceability — one can typically replace a co-ordinated phrase with one of its conjuncts withoutthe sentence becoming incoherent; this idea hasresulted in improved open information extraction(Saha and Mausam, 2018). Using these principlesmay further improve our parser.

Coordination in Constituency GrammarWhile our paper mainly focuses on enhancingdependency-based syntactic analysis with coordi-nation structures, coordination is a well-studiedtopic in constituency-based syntax (Zhang, 2009),including proposals and treatments under lexical

functional grammar (Kaplan and Maxwell III,1988), tree-adjoining grammar (Sarkar and Joshi,1996; Han and Sarkar, 2017), and combinatorycategorial grammar (Steedman, 1996, 2000).

Tesnière Dependency Structure Sangati andMazza (2009) propose a representation that is faith-ful to Tesnière’s (1959) original framework. Simi-lar to bubble trees, their structures include specialattention to coordination structures respecting con-junct symmetry, but they also include constructsto handle other syntactic notions currently beyondour parser’s scope.13 Such representations havebeen used for re-ranking (Sangati, 2010), but notfor (direct) parsing. Perhaps our work can inspirea future Tesnière Dependency Structure parser.

Non-constituent Coordination Seemingly in-complete (non-constituent) conjuncts are particu-larly challenging (Milward, 1994), and our bubbleparser currently has no special mechanism for them.Dependency-based analyses have adapted by ex-tending to a graph structure (Gerdes and Kahane,2015) or explicitly representing elided elements(Schuster et al., 2017). It may be straightforwardto integrate the latter into our parser, à la Kahane’s(1997) proposal of phonologically-empty bubbles.

7 Conclusion

We revisit Kahane’s (1997) bubble tree representa-tions for explicitly encoding coordination bound-aries as a viable alternative to existing mechanismsin dependency-based analysis of coordination struc-tures. We introduce a transition system that is bothsound and complete with respect to the subclassof projective bubble trees. Empirically, our bub-ble parsers achieve state-of-the-art results on thetask of coordination structure prediction on twoEnglish datasets. Future work may extend the re-search scope to other languages, graph-based, andnon-projective parsing methods.

Acknowledgements We thank the anonymousreviewers for their constructive comments, YueGuo for discussion, and Hiroki Teranishi for helpwith experiment setup. This work was supportedin part by a Bloomberg Data Science Ph.D. Fellow-ship to Tianze Shi and a gift from Bloomberg toLillian Lee.

13For example, differentiating content and function wordswhich has recently been explored by Basirat and Nivre (2021).

References

Rajeev Agarwal and Lois Boggess. 1992. A simplebut useful approach to conjunct identification. InProceedings of the 30th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 15–21,Newark, Delaware, USA. Association for Computa-tional Linguistics.

Ali Basirat and Joakim Nivre. 2021. Syntactic nuclei independency parsing – A multilingual exploration. InProceedings of the 16th Conference of the EuropeanChapter of the Association for Computational Lin-guistics: Main Volume, pages 1376–1387, Online.Association for Computational Linguistics.

Gosse Bouma, Djamé Seddah, and Daniel Zeman.2020. Overview of the IWPT 2020 shared task onparsing into enhanced Universal Dependencies. InProceedings of the 16th International Conferenceon Parsing Technologies and the IWPT 2020 SharedTask on Parsing into Enhanced Universal Dependen-cies, pages 151–161, Online. Association for Com-putational Linguistics.

Michael Collins. 2003. Head-driven statistical modelsfor natural language parsing. Computational Lin-guistics, 29(4):589–637.

Miryam de Lhoneux, Miguel Ballesteros, and JoakimNivre. 2019. Recursive subtree composition inLSTM-based dependency parsing. In Proceedingsof the 2019 Conference of the North American Chap-ter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Longand Short Papers), pages 1566–1576, Minneapo-lis, Minnesota, USA. Association for ComputationalLinguistics.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In Proceedings of the 2019 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Volume 1 (Long and Short Papers), pages4171–4186, Minneapolis, Minnesota, USA. Associ-ation for Computational Linguistics.

Timothy Dozat and Christopher D. Manning. 2017.Deep biaffine attention for neural dependency pars-ing. In Proceedings of the 5th International Confer-ence on Learning Representations, Toulon, France.OpenReview.net.

Amit Dubey, Patrick Sturt, and Frank Keller. 2005. Par-allelism in coordination as an instance of syntacticpriming: Evidence from corpus-based modeling. InProceedings of Human Language Technology Con-ference and Conference on Empirical Methods inNatural Language Processing, pages 827–834, Van-couver, British Columbia, Canada. Association forComputational Linguistics.

Chris Dyer, Miguel Ballesteros, Wang Ling, AustinMatthews, and Noah A. Smith. 2015. Transition-based dependency parsing with stack long short-term memory. In Proceedings of the 53rd AnnualMeeting of the Association for Computational Lin-guistics and the 7th International Joint Conferenceon Natural Language Processing (Volume 1: LongPapers), pages 334–343, Beijing, China. Associa-tion for Computational Linguistics.

Jessica Ficler and Yoav Goldberg. 2016a. Coordina-tion annotation extension in the Penn tree bank. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 834–842, Berlin, Germany. As-sociation for Computational Linguistics.

Jessica Ficler and Yoav Goldberg. 2016b. A neural net-work for coordination boundary prediction. In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, pages 23–32,Austin, Texas, USA. Association for ComputationalLinguistics.

Jessica Ficler and Yoav Goldberg. 2017. Improvinga strong neural parser with conjunction-specific fea-tures. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics: Volume 2, Short Papers, pages343–348, Valencia, Spain. Association for Computa-tional Linguistics.

Lyn Frazier, Alan Munn, and Charles Clifton. 2000.Processing coordinate structures. Journal of Psy-cholinguistic Research, 29(4):343–370.

Lyn Frazier, Lori Taft, Tom Roeper, Charles Clifton,and Kate Ehrlich. 1984. Parallel structure: A sourceof facilitation in sentence comprehension. Memory& Cognition, 12(5):421–430.

Kim Gerdes and Sylvain Kahane. 2015. Non-constituent coordination and other coordinative con-structions as dependency graphs. In Proceedings ofthe Third International Conference on DependencyLinguistics (Depling 2015), pages 101–110, Upp-sala, Sweden. Uppsala University.

Aleksej V. Gladkij. 1968. "On describing the syntac-tic structure of a sentence" (in Russian with Englishsummary). Computational Linguistics, 7:21–44.

Yoav Goldberg and Michael Elhadad. 2010. Inspect-ing the structural biases of dependency parsing al-gorithms. In Proceedings of the Fourteenth Confer-ence on Computational Natural Language Learning,pages 234–242, Uppsala, Sweden. Association forComputational Linguistics.

Yoav Goldberg and Joakim Nivre. 2013. Training de-terministic parsers with non-deterministic oracles.Transactions of the Association for ComputationalLinguistics, 1:403–414.

Alex Graves and Jürgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectionalLSTM and other neural network architectures. Neu-ral Networks, 18(5):602–610.

Stefan Grünewald, Prisca Piccirilli, and AnnemarieFriedrich. 2021. Coordinate constructions in En-glish enhanced Universal Dependencies: Analysisand computational modeling. In Proceedings of the16th Conference of the European Chapter of the As-sociation for Computational Linguistics: Main Vol-ume, pages 795–809, Online. Association for Com-putational Linguistics.

Jan Hajic, Eduard Bejcek, Jaroslava Hlavacova, MarieMikulová, Milan Straka, Jan Štepánek, and BarboraŠtepánková. 2020. Prague dependency treebank -consolidated 1.0. In Proceedings of the 12th Lan-guage Resources and Evaluation Conference, pages5208–5218, Marseille, France. European LanguageResources Association.

Jan Hajic, Barbora Vidová Hladká, Jarmila Panevová,Eva Hajicová, Petr Sgall, and Petr Pajas. 2001.Prague dependency treebank 1.0 (LDC2001T10).

Chung-hye Han and Anoop Sarkar. 2017. Coordina-tion in TAG without the Conjoin Operation. InProceedings of the 13th International Workshop onTree Adjoining Grammars and Related Formalisms,pages 43–52, Umeå, Sweden. Association for Com-putational Linguistics.

Atsushi Hanamoto, Takuya Matsuzaki, and Jun’ichiTsujii. 2012. Coordination structure analysis usingdual decomposition. In Proceedings of the 13th Con-ference of the European Chapter of the Associationfor Computational Linguistics, pages 430–438, Avi-gnon, France. Association for Computational Lin-guistics.

Kazuo Hara, Masashi Shimbo, Hideharu Okuma, andYuji Matsumoto. 2009. Coordinate structure analy-sis with global structural constraints and alignment-based local features. In Proceedings of the JointConference of the 47th Annual Meeting of the ACLand the 4th International Joint Conference on Natu-ral Language Processing of the AFNLP, pages 967–975, Suntec, Singapore. Association for Computa-tional Linguistics.

Deirdre Hogan. 2007. Coordinate noun phrase disam-biguation in a generative parsing model. In Proceed-ings of the 45th Annual Meeting of the Association ofComputational Linguistics, pages 680–687, Prague,Czech Republic. Association for Computational Lin-guistics.

Richard A. Hudson. 1984. Word Grammar. Blackwell,Oxford.

Timo Järvinen and Pasi Tapanainen. 1998. Towards animplementable dependency grammar. In Process-ing of Dependency-Based Grammars, pages 1–10,Montreal, Quebec, Canada. Association for Compu-tational Linguistics.

Richard Johansson and Pierre Nugues. 2007. Ex-tended constituent-to-dependency conversion for En-glish. In Proceedings of the 16th Nordic Conferenceof Computational Linguistics (NODALIDA 2007),pages 105–112, Tartu, Estonia. University of Tartu,Estonia.

Sylvain Kahane. 1997. Bubble trees and syntactic rep-resentations. In Proceedings of the 5th Meetingof Mathematics of Language, pages 70–76, Saar-brücken, Germany.

Ronald M. Kaplan and John T. Maxwell III. 1988.Constituent coordination in lexical-functional gram-mar. In Proceedings of International Conferenceon Computational Linguistics, pages 303–305, Bu-dapest, Hungary.

Jin Dong Kim, Tomoko Ohta, Yuka Tateisi, andJun’ichi Tsujii. 2003. GENIA corpus—a semanti-cally annotated corpus for bio-textmining. Bioinfor-matics, 19(suppl_1):i180–i182.

Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-ple and accurate dependency parsing using bidirec-tional LSTM feature representations. Transactionsof the Association for Computational Linguistics,4:313–327.

Sandra Kübler, Ryan McDonald, and Joakim Nivre.2009. Dependency parsing. Synthesis Lectureson Human Language Technologies. Morgan & Clay-pool Publishers.

Marco Kuhlmann, Carlos Gómez-Rodríguez, and Gior-gio Satta. 2011. Dynamic programming algorithmsfor transition-based dependency parsers. In Pro-ceedings of the 49th Annual Meeting of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, pages 673–682, Portland, Ore-gon, USA. Association for Computational Linguis-tics.

Sadao Kurohashi and Makoto Nagao. 1994. A syn-tactic analysis method of long Japanese sentencesbased on the detection of conjunctive structures.Computational Linguistics, 20(4):507–534.

Vincenzo Lombardo and Leonardo Lesmo. 1998. Unitcoordination and gapping in dependency theory. InProcessing of Dependency-Based Grammars, pages11–20, Montreal, Quebec, Canada. Association forComputational Linguistics.

Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn treebank. Computa-tional Linguistics, 19(2):313–330.

Igor Mel’cuk. 1988. Dependency Syntax: Theory andPractice. State University Press of New York, Al-bany.

Igor Mel’cuk. 2003. Levels of dependency in linguis-tic description: Concepts and problems. In VilmosÁgel, Ludwig M. Eichinger, Hans Werner Eroms,

Peter Hellwig, Hans Jürgen Heringer, and HenningLobin, editors, Dependency and Valency: An Inter-national Handbook of Contemporary Research, vol-ume 1, pages 188–229. Walter de Gruyter, Berlin.

David Milward. 1994. Non-constituent coordination:Theory and practice. In Proceedings of the 15th In-ternational Conference on Computational Linguis-tics, volume 2, pages 935–941, Kyoto, Japan.

Alireza Mohammadshahi and James Henderson. 2020.Graph-to-graph Transformer for transition-based de-pendency parsing. In Findings of the Associationfor Computational Linguistics: EMNLP 2020, pages3278–3289, Online. Association for ComputationalLinguistics.

Jens Nilsson, Joakim Nivre, and Johan Hall. 2006.Graph transformations in data-driven dependencyparsing. In Proceedings of the 21st InternationalConference on Computational Linguistics and 44thAnnual Meeting of the Association for Computa-tional Linguistics, pages 257–264, Sydney, Aus-tralia. Association for Computational Linguistics.

Joakim Nivre. 2008. Algorithms for deterministic in-cremental dependency parsing. Computational Lin-guistics, 34(4):513–553.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajic, Christopher D. Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, Reut Tsarfaty, and Daniel Zeman.2016. Universal Dependencies v1: A multilingualtreebank collection. In Proceedings of the Tenth In-ternational Conference on Language Resources andEvaluation (LREC’16), pages 1659–1666, Portorož,Slovenia. European Language Resources Associa-tion.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Jan Hajic, Christopher D. Manning, SampoPyysalo, Sebastian Schuster, Francis Tyers, andDaniel Zeman. 2020. Universal Dependencies v2:An evergrowing multilingual treebank collection.In Proceedings of the 12th Language Resourcesand Evaluation Conference, pages 4034–4043, Mar-seille, France. European Language Resources Asso-ciation.

Joakim Nivre, Paola Marongiu, Filip Ginter, JennaKanerva, Simonetta Montemagni, Sebastian Schus-ter, and Maria Simi. 2018. Enhancing UniversalDependency treebanks: A case study. In Proceed-ings of the Second Workshop on Universal Depen-dencies (UDW 2018), pages 102–107, Brussels, Bel-gium. Association for Computational Linguistics.

Martin Popel, David Marecek, Jan Štepánek, DanielZeman, and Zdenek Žabokrtský. 2013. Coordina-tion structures in dependency treebanks. In Proceed-ings of the 51st Annual Meeting of the Associationfor Computational Linguistics (Volume 1: Long Pa-pers), pages 517–527, Sofia, Bulgaria. Associationfor Computational Linguistics.

Owen Rambow. 2010. The simple truth about de-pendency and phrase structure representations: Anopinion piece. In Human Language Technologies:The 2010 Annual Conference of the North AmericanChapter of the Association for Computational Lin-guistics, pages 337–340, Los Angeles, California,USA. Association for Computational Linguistics.

Ines Rehbein, Julius Steen, Bich-Ngoc Do, and AnetteFrank. 2017. Universal Dependencies are hard toparse — or are they? In Proceedings of the FourthInternational Conference on Dependency Linguis-tics (Depling 2017), pages 218–228, Pisa, Italy.Linköping University Electronic Press.

Swarnadeep Saha and Mausam. 2018. Open informa-tion extraction from conjunctive sentences. In Pro-ceedings of the 27th International Conference onComputational Linguistics, pages 2288–2299, SantaFe, New Mexico, USA. Association for Computa-tional Linguistics.

Swarnadeep Saha, Harinder Pal, and Mausam. 2017.Bootstrapping for numerical Open IE. In Proceed-ings of the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 2: Short Pa-pers), pages 317–323, Vancouver, Canada. Associa-tion for Computational Linguistics.

Federico Sangati. 2010. A probabilistic generativemodel for an intermediate constituency-dependencyrepresentation. In Proceedings of the ACL 2010Student Research Workshop, pages 19–24, Uppsala,Sweden. Association for Computational Linguistics.

Federico Sangati and Chiara Mazza. 2009. An Englishdependency treebank à la Tesnière. In Proceedingsof the 8th International Workshop on Treebanks andLinguistic Theories, pages 173–184, Milan, Italy.

Anoop Sarkar and Aravind Joshi. 1996. Coordina-tion in tree adjoining grammars: Formalization andimplementation. In Proceedings of the 16th Inter-national Conference on Computational Linguistics,pages 610–615, Copenhagen, Denmark.

Yuya Sawada, Takashi Wada, Takayoshi Shibahara,Hiroki Teranishi, Shuhei Kondo, Hiroyuki Shindo,Taro Watanabe, and Yuji Matsumoto. 2020. Coordi-nation boundary identification without labeled datafor compound terms disambiguation. In Proceed-ings of the 28th International Conference on Com-putational Linguistics, pages 3043–3049, Barcelona,Spain (Online). International Committee on Compu-tational Linguistics.

Sebastian Schuster, Matthew Lamm, and Christo-pher D. Manning. 2017. Gapping constructions inUniversal Dependencies v2. In Proceedings of theNoDaLiDa 2017 Workshop on Universal Dependen-cies (UDW 2017), pages 123–132, Gothenburg, Swe-den. Association for Computational Linguistics.

Sebastian Schuster and Christopher D. Manning. 2016.Enhanced English Universal Dependencies: An im-

proved representation for natural language under-standing tasks. In Proceedings of the Tenth Interna-tional Conference on Language Resources and Eval-uation (LREC 2016), pages 2371–2378, Portorož,Slovenia. European Language Resources Associa-tion.

Tianze Shi, Liang Huang, and Lillian Lee. 2017.Fast(er) exact decoding and global training fortransition-based dependency parsing via a minimalfeature set. In Proceedings of the 2017 Conferenceon Empirical Methods in Natural Language Process-ing, pages 12–23, Copenhagen, Denmark. Associa-tion for Computational Linguistics.

Masashi Shimbo and Kazuo Hara. 2007. A discrim-inative learning model for coordinate conjunctions.In Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning(EMNLP-CoNLL), pages 610–619, Prague, CzechRepublic. Association for Computational Linguis-tics.

Mark Steedman. 1996. Surface Structure and Interpre-tation. Number 30 in Linguistic Inquiry Monograph.The MIT Press, Cambridge, Massachusetts.

Mark Steedman. 2000. The Syntactic Process. MITPress, Cambridge, MA, USA.

Hiroki Teranishi, Hiroyuki Shindo, and Yuji Mat-sumoto. 2017. Coordination boundary identificationwith similarity and replaceability. In Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),pages 264–272, Taipei, Taiwan. Asian Federation ofNatural Language Processing.

Hiroki Teranishi, Hiroyuki Shindo, and Yuji Mat-sumoto. 2019. Decomposed local models for coor-dinate structure parsing. In Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers), pages 3394–3403, Minneapolis, Minnesota,USA. Association for Computational Linguistics.

Lucien Tesnière. 1959. Eléments de Syntaxe Struc-turale. Librairie C. Klincksieck, Paris.

Niina Ning Zhang. 2009. Coordination in Syntax.Cambridge Studies in Linguistics. Cambridge Uni-versity Press, New York.

Appendix A Dataset Processing andStatistics

We follow Teranishi et al. (2019) and use the samedataset splits and pre-processing steps. For thePenn Treebank (PTB; Marcus et al., 1993) datawith added coordination annotations (Ficler andGoldberg, 2016a), we use WSJ sections 02-21for training, section 22 for development, and sec-tion 23 for test sets respectively. We also useTeranishi et al.’s (2019) pre-processing steps instripping quotation marks surrounding PTB coor-dinated phrases to normalize irregular coordina-tions. This results in 39,832/1,700/2,416 sentencesand 19,890/848/1,099 coordination structures intrain/dev/test splits respectively. For the GENIAtreebank (Kim et al., 2003), we use the beta ver-sion of the corpus and follow the same 5-fold cross-validation splits as Teranishi et al. (2019). In total,GENIA contains 2,508 sentences and 3,598 coor-dination structures.

To derive bubble tree representations, we firstconvert the PTB-style phrase-structure trees in bothtreebanks with the conversion tool (Schuster andManning, 2016) provided by the Stanford CoreNLPtoolkit version 4.2.0 into Universal Dependencies(UD; Nivre et al., 2016) style. We then mergethe UD trees with the bubbles formed by the co-ordination boundaries. We define the boundariesto be from the beginning of the first conjunct tothe end of the last conjunct for each coordinatedphrase. We attach all conjuncts to their corre-sponding bubbles with a “conj” label, and map any“conj”-labeled dependencies outside an annotatedcoordination to “dep”. We resolve modifier scopeambiguities according to conjunct annotations: ifthe modifier is within the span of a conjunct, thenit is a private modifier; otherwise, it is a sharedmodifier to the entire coordinated phrase and weattach it to the phrasal bubble. Since our transitionsystem targets projective bubble trees, we filter outany non-projective trees during training (but stillevaluate on them during testing). We retain 39,678sentences, or 99.6% of the PTB training set, and2,429 sentences, or 96.9% of the GENIA dataset.

Appendix B Implementation Details

Our implementation (https://www.github.com/tzshi/bubble-parser-acl21) is based on PyTorch.

We train our models by using the Adam opti-mizer (Kingma and Ba, 2015). After a fixed num-ber of optimization steps (3,200 steps for PTB and

Adam Optimizer:Initial learning rate for bi-LSTM 10−3

Initial learning rate for BERT 10−5

β1 0.9β2 0.999ε 10−8

Minibatch size 8Linear warmup steps 800Gradient clipping L2 norm 5.0

Inputs to bi-LSTM:Word-embedding dimensionality 100POS tag-embedding dimensionality 32Character bi-LSTM layers 1Character bi-LSTM dimensionality 128

Bi-LSTM:Number of layers 3Dimensionality 800Dropout 0.3

MLPs (same for all MLPs):Number of hidden layers 1Hidden layer dimensionality 400Activation function ReLUDropout 0.3

Table A1: Hyperparameters of our models.

800 steps for GENIA, based on their training setsizes), we perform an evaluation on the dev set.If the dev set performance fails to improve within5 consecutive evaluation rounds, we multiply thelearning rate by 0.1. We terminate model trainingwhen the learning rate has dropped three times,and select the best model checkpoint based on devset F1 scores according to the “exact” metrics.14

For the BERT feature extractor, we finetune thepretrained case-sensitive BERTbase model throughthe transformers package.15 For the non-BERTmodel, we use pre-trained GloVe embeddings (Pen-nington et al., 2014).

Following prior practice, we embed gold POStags as input features when using bi-LSTM for themodels trained on the GENIA dataset, but we omitthe POS tag embeddings for the PTB dataset.

The training process for each model takesroughly 10 hours using an RTX 2080 Ti GPU;model inference speed is 41.9 sentences per sec-ond.16

We select our hyperparameters by hand. Dueto computational constraints, our hand-tuning hasbeen limited to setting the dropout rates, and fromthe candidates set of {0.0, 0.1, 0.3, 0.5} we chose

14Even though we report recall on GENIA, model selectionis still performed using F1.

15github.com/huggingface/transformers16We have not yet done extensive optimization regarding

GPU batching for greedy transition-based parsers.

0.3 based on dev-set performance. Our hyperpa-rameters are listed in Table A1.

Appendix C Extended Results

Table A2 and Table A3 include detailed evaluationresults on the PTB and GENIA datasets.

ReferencesJessica Ficler and Yoav Goldberg. 2016a. Coordina-

tion annotation extension in the Penn tree bank. InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 834–842, Berlin, Germany. As-sociation for Computational Linguistics.

Jessica Ficler and Yoav Goldberg. 2016b. A neural net-work for coordination boundary prediction. In Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, pages 23–32,Austin, Texas, USA. Association for ComputationalLinguistics.

Kazuo Hara, Masashi Shimbo, Hideharu Okuma, andYuji Matsumoto. 2009. Coordinate structure analy-sis with global structural constraints and alignment-based local features. In Proceedings of the JointConference of the 47th Annual Meeting of the ACLand the 4th International Joint Conference on Natu-ral Language Processing of the AFNLP, pages 967–975, Suntec, Singapore. Association for Computa-tional Linguistics.

Jin Dong Kim, Tomoko Ohta, Yuka Tateisi, andJun’ichi Tsujii. 2003. GENIA corpus—a semanti-cally annotated corpus for bio-textmining. Bioinfor-matics, 19(suppl_1):i180–i182.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. In Proceedingsof the 3rd International Conference on LearningRepresentations, San Diego, California, USA.

Mitchell P. Marcus, Beatrice Santorini, and Mary AnnMarcinkiewicz. 1993. Building a large annotatedcorpus of English: The Penn treebank. Computa-tional Linguistics, 19(2):313–330.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajic, Christopher D. Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, Reut Tsarfaty, and Daniel Zeman.2016. Universal Dependencies v1: A multilingualtreebank collection. In Proceedings of the Tenth In-ternational Conference on Language Resources andEvaluation (LREC’16), pages 1659–1666, Portorož,Slovenia. European Language Resources Associa-tion.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. GloVe: Global vectors for word rep-resentation. In Proceedings of the 2014 Conference

on Empirical Methods in Natural Language Process-ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso-ciation for Computational Linguistics.

Sebastian Schuster and Christopher D. Manning. 2016.Enhanced English Universal Dependencies: An im-proved representation for natural language under-standing tasks. In Proceedings of the Tenth Interna-tional Conference on Language Resources and Eval-uation (LREC 2016), pages 2371–2378, Portorož,Slovenia. European Language Resources Associa-tion.

Hiroki Teranishi, Hiroyuki Shindo, and Yuji Mat-sumoto. 2017. Coordination boundary identificationwith similarity and replaceability. In Proceedings ofthe Eighth International Joint Conference on Natu-ral Language Processing (Volume 1: Long Papers),pages 264–272, Taipei, Taiwan. Asian Federation ofNatural Language Processing.

Hiroki Teranishi, Hiroyuki Shindo, and Yuji Mat-sumoto. 2019. Decomposed local models for coor-dinate structure parsing. In Proceedings of the 2019Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and ShortPapers), pages 3394–3403, Minneapolis, Minnesota,USA. Association for Computational Linguistics.

Dev TestAll NP All NP

P R F P R F P R F P R F

Exact

TSM17 74.13 73.34 73.74 76.21 75.51 75.86 71.48 70.70 71.08 75.20 74.84 75.01TSM19 76.95 76.76 76.85 78.11 77.57 77.84 75.33 75.61 75.47 77.95 77.70 77.83Ours 77.19 77.00 77.10 80.00 79.63 79.82 76.62 76.34 76.48 81.89 81.37 81.63

+BERT 84.25 84.55 84.40 88.53 88.33 88.43 83.85 83.62 83.74 85.33 85.19 85.26

Inner

FG16 72.34 72.25 72.29 75.17 74.82 74.99 72.81 72.61 72.70 76.91 75.31 76.10TSM17 76.04 75.23 75.63 77.82 77.11 77.47 74.14 73.33 73.74 77.44 77.07 77.25TSM19 79.19 79.00 79.10 80.64 80.09 80.36 77.60 77.88 77.74 80.19 79.93 80.06Ours 78.61 78.42 78.51 82.07 81.69 81.88 78.45 78.16 78.30 84.29 83.76 84.03

+BERT 85.19 85.50 85.34 89.45 89.24 89.35 84.58 84.35 84.46 86.28 86.15 86.22

Table A2: Precision, recall, and F1 scores on the PTB dev and test sets.

NP VP ADJP S PP UCP SBAR ADVP Others AllCount 2,317 465 321 188 167 60 56 21 3 3,598

Exact

TSM17 57.14 54.83 72.27 8.51 55.68 28.33 57.14 85.71 0.00 55.22TSM19 59.21 64.94 78.19 53.19 55.68 48.33 66.07 90.47 0.00 61.22Ours 68.28 58.71 86.29 56.38 55.09 51.67 58.93 95.24 0.00 67.09

+BERT 79.41 76.34 88.79 77.13 73.05 61.67 76.79 100.0 33.33 79.18

Whole

HSOM09 64.2 54.2 80.4 22.9 59.9 36.7 51.8 85.7 66.7 61.5FG16 65.08 71.82 74.76 17.02 56.28 51.67 91.07 80.95 33.33 64.14TSM17 67.19 63.65 76.63 53.19 61.67 35.00 78.57 85.71 33.33 66.31TSM19 59.30 65.16 78.19 53.19 55.68 48.33 66.07 90.47 0.00 61.31Ours 69.40 59.35 87.85 57.45 56.89 51.67 62.50 95.24 0.00 68.23

+BERT 80.58 76.77 90.03 78.19 76.65 61.67 82.14 100.0 33.33 80.41

Table A3: Recall on the GENIA dataset.