Exact Inference for Generative Probabilistic Non ...scohen/emnlp11nonproj.pdf · Exact Inference...

Exact Inference for Generative ProbabilisticNon-Projective Dependency Parsing

Shay B. CohenSchool of Computer Science

Carnegie Mellon University, [email protected]

Carlos Gomez-RodrıguezDepartamento de ComputacionUniversidade da Coruna, Spain

[email protected]

Giorgio SattaDept. of Information Engineering

University of Padua, [email protected]

Abstract

We describe a generative model for non-projective dependency parsing based on a sim-plified version of a transition system that hasrecently appeared in the literature. We thendevelop a dynamic programming parsing al-gorithm for our model, and derive an inside-outside algorithm that can be used for unsu-pervised learning of non-projective depend-ency trees.

1 Introduction

Dependency grammars have received considerableattention in the statistical parsing community inrecent years. These grammatical formalisms of-fer a good balance between structural expressiv-ity and processing efficiency. Most notably, whennon-projectivity is supported, these formalisms canmodel crossing syntactic relations that are typical inlanguages with relatively free word order.

Recent work has reduced non-projective parsingto the identification of a maximum spanning tree in agraph (McDonald et al., 2005; Koo et al., 2007; Mc-Donald and Satta, 2007; Smith and Smith, 2007).An alternative to this approach is to use transition-based parsing (Yamada and Matsumoto, 2003; Nivreand Nilsson, 2005; Attardi, 2006; Nivre, 2009;Gomez-Rodrıguez and Nivre, 2010), where there isan incremental processing of a string with a modelthat scores transitions between parser states, condi-tioned on the parse history. This paper focuses onthe latter approach.

The above work on transition-based parsing hasfocused on greedy algorithms set in a statisticalframework (Nivre, 2008). More recently, dynamic

programming has been successfully used for pro-jective parsing (Huang and Sagae, 2010; Kuhlmannet al., 2011). Dynamic programming algorithms forparsing (also known as chart-based algorithms) al-low polynomial space representations of all parsetrees for a given input string, even in cases wherethe size of this set is exponential in the length ofthe string itself. In combination with appropriatesemirings, these packed representations can be ex-ploited to compute many values of interest for ma-chine learning, such as best parses and feature ex-pectations (Goodman, 1999; Li and Eisner, 2009).

In this paper we move one step forward with re-spect to Huang and Sagae (2010) and Kuhlmann etal. (2011) and present a polynomial dynamic pro-gramming algorithm for non-projective transition-based parsing. Our algorithm is coupled with asimplified version of the transition system from At-tardi (2006), which has high coverage for the typeof non-projective structures that appear in varioustreebanks. Instead of an additional transition oper-ation which permits swapping of two elements inthe stack (Titov et al., 2009; Nivre, 2009), Attardi’ssystem allows reduction of elements at non-adjacentpositions in the stack. We also present a generat-ive probabilistic model for transition-based parsing.The implication for this, for example, is that one cannow approach the problem of unsupervised learningof non-projective dependency structures within thetransition-based framework.

Dynamic programming algorithms for non-projective parsing have been proposed by Kahane etal. (1998), Gomez-Rodrıguez et al. (2009) and Kuhl-mann and Satta (2009), but they all run in exponen-tial time in the ‘gap degree’ of the parsed structures.To the best of our knowledge, this paper is the first to

introduce a dynamic programming algorithm for in-ference with non-projective structures of unboundedgap degree.

The rest of this paper is organized as follows. In§2 and §3 we outline the transition-based model weuse, together with a probabilistic generative inter-pretation. In §4 we give the tabular algorithm forparsing, and in §5 we discuss statistical inferenceusing expectation maximization. We then discusssome other aspects of the work in §6 and concludein §7.

2 Transition-based Dependency Parsing

In this section we briefly introduce the basic defini-tions for transition-based dependency parsing. For amore detailed presentation of this subject, we referthe reader to Nivre (2008). We then define a spe-cific transition-based model for non-projective de-pendency parsing that we investigate in this paper.

2.1 General Transition Systems

Assume an input alphabet Σ with a special symbol$ ∈ Σ , which we use as the root of our parse struc-tures. Throughout this paper we denote the inputstring as w = a0 · · · an−1, n ≥ 1, where a0 = $ andai ∈ Σ \ {$} for each i with 1 ≤ i ≤ n− 1.

A dependency tree for w is a directed tree Gw =(Vw, Aw), where Vw = {0, . . . , n − 1} is the set ofnodes, and Aw ⊆ Vw × Vw is the set of arcs. Theroot of Gw is the node 0. The intended meaningis that each node in Vw encodes the position of atoken in w. Furthermore, each arc in Aw encodes adependency relation between two tokens. We writei → j to denote a directed arc (i, j) ∈ Aw, wherenode i is the head and node j is the dependent.

A transition system (for dependency parsing) is atuple S = (C, T, I, Ct), whereC is a set of configur-ations, defined below, T is a finite set of transitions,which are partial functions t:C ⇀ C, I is a totalinitialization function mapping each input string toa unique initial configuration, and Ct ⊆ C is a set ofterminal configurations.

A configuration is defined relative to some inputstring w, and is a triple (σ, β,A), where σ and β aredisjoint lists called stack and buffer, respectively,and A ⊆ Vw × Vw is a set of arcs. Elements ofσ and β are nodes from Vw and, in the case of the

stack, a special symbol ¢ that we will use as initialstack symbol. If t is a transition and c1, c2 are con-figurations such that t(c1) = c2, we write c1 `t c2,or simply c1 ` c2 if t is understood from the context.

Given an input string w, a parser based on S in-crementally processes w from left to right, startingin the initial configuration I(w). At each step, theparser nondeterministically applies one transition, orelse it stops if it has reached some terminal config-uration. The dependency graph defined by the arcset associated with a terminal configuration is thenreturned as one possible analysis for w.

Formally, a computation of S is a sequence γ =c0, . . . , cm, m ≥ 1, of configurations such that, forevery iwith 1 ≤ i ≤ m, ci−1 `ti ci for some ti ∈ T .In other words, each configuration in a computa-tion is obtained as the value of the preceding con-figuration under some transition. A computation iscalled complete whenever c0 = I(w) for some in-put string w, and cm ∈ Ct.

We can view a transition-based dependencyparser as a device mapping strings into graphs (de-pendency trees). Without any restriction on trans-ition functions in T , these functions might have aninfinite domain, and could thus encode even non-recursively enumerable languages. However, instandard practice for natural language parsing, trans-itions are always specified by some finite mean. Inparticular, the definition of each transition dependson some finite window at the top of the stack andsome finite window at the beginning of the bufferin each configuration. In this case, we can view atransition-based dependency parser as a notationalvariant of a push-down transducer (Hopcroft et al.,2000), whose computations output sequences thatdirectly encode dependency trees. These transducersare nondeterministic, meaning that several trans-itions can be applied to some configurations. Thetransition systems we investigate in this paper fol-low these principles.

We close this subsection with some additionalnotation. We denote the stack with its topmost ele-ment to the right and the buffer with its first ele-ment to the left. We indicate concatenation in thestack and buffer by a vertical bar. For example, fork ∈ Vw, σ|k denotes some stack with topmost ele-ment k and k|β denotes some buffer with first ele-ment k. For 0 ≤ i ≤ n − 1, βi denotes the buffer

[i, i + 1, . . . , n − 1]; for i ≥ n, βi denotes [] (theempty buffer).

2.2 A Non-projective Transition System

We now turn to give a description of our trans-ition system for non-projective parsing. While aprojective dependency tree satisfies the requirementthat, for every arc in the tree, there is a direc-ted path between its headword and each of thewords between the two endpoints of the arc, a non-projective dependency tree may violate this condi-tion. Even though some natural languages exhibitsyntactic phenomena which require non-projectiveexpressive power, most often such a resource is usedin a limited way.

This idea is demonstrated by Attardi (2006), whoproposes a transition system whose individual trans-itions can deal with non-projective dependenciesonly to a limited extent, depending on the distancein the stack of the nodes involved in the newly con-structed dependency. The author defines this dis-tance as the degree of the transition, with transitionsof degree one being able to handle only projectivedependencies. This formulation permits parsing asubset of the non-projective trees, where this subsetdepends on the degree of the transitions. The repor-ted coverage in Attardi (2006) is already very highwhen the system is restricted to transitions of degreetwo or three. For instance, on training data for Czechcontaining 28,934 non-projective relations, 27,181can be handled by degree two transitions, and 1,668additional dependencies can be handled by degreethree transitions. Table 1 gives additional statisticsfor treebanks from the CoNLL-X shared task (Buch-holz and Marsi, 2006).

We now turn to describe our variant of the trans-ition system of Attardi (2006), which is equivalent tothe original system restricted to transitions of degreetwo. Our results are based on such a restriction. It isnot difficult to extend our algorithms (§4) to higherdegree transitions, but this comes at the expense ofhigher complexity. See §6 for more discussion onthis issue.

Let w = a0 · · · an−1 be an input string over Σdefined as in §2.1, with a0 = $. Our transition sys-tem for non-projective dependency parsing is

S(np) = (C, T (np), I(np), C(np)t ),

Language Deg. 2 Deg. 3 Deg. 4Arabic 180 21 7Bulgarian 961 41 10Czech 27181 1668 85Danish 876 136 53Dutch 9072 2119 171German 15827 2274 466Japanese 1484 143 9Portuguese 3104 424 37Slovene 601 48 13Spanish 66 7 0Swedish 1566 226 79Turkish 579 185 8

Table 1: The number of non-projective relations of vari-ous degrees for several treebanks (training sets), as repor-ted by the parser of Attardi (2006). Deg. stands for ‘de-gree.’ The parser did not detect non-projective relationsof degree higher than 4.

where C is the same set of configurations definedin §2.1. The initialization function I(np) maps eachstring w to the initial configuration ([¢], β0, ∅). Theset of terminal configurationsC(np)

t contains all con-figurations of the form ([¢, 0], [], A), for any set ofarcs A.

The set of transition functions is defined as

T (np) = {shb | b ∈ Σ} ∪ {la1, ra1, la2, ra2},

where each transition is specified below. We let vari-ables i, j, k, l range over Vw, and variable σ is a listof stack elements from Vw ∪ {¢}:

shb : (σ, k|β,A) ` (σ|k, β,A) if ak = b;

la1 : (σ|i|j, β,A) ` (σ|j, β,A ∪ {j → i});ra1 : (σ|i|j, β,A) ` (σ|i, β, A ∪ {i→ j});la2 : (σ|i|j|k, β,A) ` (σ|j|k, β,A ∪ {k → i});ra2 : (σ|i|j|k, β,A) ` (σ|i|j, β,A ∪ {i→ k}).

Each of the above transitions is undefined on config-urations that do not match the forms specified above.As an example, transition la2 is not defined for aconfiguration (σ, β,A) with |σ| ≤ 2, and transitionshb is not defined for a configuration (σ, k|β,A)with b 6= ak, or for a configuration (σ, [], A).

Transition shb removes the first node from the buf-fer, in case this node represents symbol b ∈ Σ ,

and pushes it into the stack. These transitions arecalled shift transitions. The remaining four trans-itions are called reduce transitions, i.e., transitionsthat consume nodes from the stack. Notice that inthe transition system at hand all the reduce trans-itions decrease the size of the stack by one ele-ment. Transition la1 creates a new arc with the top-most node on the stack as the head and the second-topmost node as the dependent, and removes thelatter from the stack. Transition ra1 is symmetricwith respect to la1. Transitions la1 and ra1 havedegree one, as already explained. When restrictedto these three transitions, the system is equivalentto the so-called stack-based arc-standard model ofNivre (2004). Transition la2 and transition ra2 arevery similar to la1 and ra1, respectively, but withthe difference that they create a new arc betweenthe topmost node in the stack and a node which istwo positions below the topmost node. Hence, thesetransitions have degree two, and are the key com-ponents in parsing of non-projective dependencies.

We turn next to describe the equivalence betweenour system and the system in Attardi (2006). Thetransition-based parser presented by Attardi pushesback into the buffer elements that are in the top pos-ition of the stack. However, a careful analysis showsthat only the first position in the buffer can be af-fected by this operation, in the sense that elementsthat are pushed back from the stack are never foundin buffer positions other than the first. This meansthat we can consider the first element of the bufferas an additional stack element, always sitting on thetop of the top-most stack symbol.

More formally, we can define a function mc :C → C that maps configurations in the original al-gorithm to those in our variant as follows:

mc((σ, k|β,A)) = (σ|k, β,A)

By applying this mapping to the source and targetconfiguration of each transition in the original sys-tem, it is easy to check that c1 ` c2 in that parser ifand only if mc(c1) ` mc(c2) in our variant. We ex-tend this and define an isomorphism between com-putations in both systems, such that a computationc0, . . . , cm in the original parser is mapped to a com-putation mc(c0), . . . ,mc(cm) in the variant, withboth generating the same dependency graph A. This

2n2n− 12n− 21 2 3

Figure 1: A dependency structure of arbitrary gap degreethat can be parsed with Attardi’s parser.

proves that our notational variant is in fact equival-ent to Attardi’s parser.

A relevant property of the set of dependencystructures that can be processed by Attardi’s parser,even when restricted to transitions of degree two, isthat the number of discontinuities present in each oftheir subtrees, defined as the gap degree by Bod-irsky et al. (2005), is not bounded. For example, thedependency graph in Figure 1 has gap degree n− 1,and it can be parsed by the algorithm for any arbit-rary n ≥ 1 by applying 2n shb transitions to pushall the nodes into the stack, followed by (2n − 2)ra2 transitions to create the crossing arcs, and finallyone ra1 transition to create the dependency 1→ 2.

As mentioned in §1, the computational complex-ity of the dynamic programming algorithm that willbe described in later sections does not depend on thegap degree, contrary to the non-projective depend-ency chart parsers presented by Gomez-Rodrıguez etal. (2009) and by Kuhlmann and Satta (2009), whoserunning time is exponential in the maximum gap de-gree allowed by the grammar.

3 A Generative Probabilistic Model

In this section we introduce a generative probabil-istic model based on the transition system of §2.2.In formal language theory, there is a standard wayof giving a probabilistic interpretation to a non-deterministic parser whose computations are basedon sequences of elementary operations such as trans-itions. The idea is to define conditional probabilitydistributions over instances of the transition func-tions, and to ‘combine’ these probabilities to assignprobabilities to computations and strings.

One difficulty we have to face with when dealingwith transition systems is that the notion of compu-tation, defined in §2.1, depends on the input string,because of the buffer component appearing in eachconfiguration. This is a pitfall to generative model-

ing, where we are interested in a system whose com-putations lead to the generation of any string. Toovercome this problem, we observe that each com-putation, defined as a sequence of stacks and buffers(the configurations) can equivalently be expressed asa sequence of stacks and transitions.

More precisely, consider a computation γ =c0, . . . , cm, m ≥ 1. Let σi, be the stack associatedwith ci, for each i with 0 ≤ i ≤ m. Let also Cσ bethe set of all stacks associated with configurations inC. We can make explicit the transitions that havebeen used in the computation by rewriting γ in theform σ0 `t1 σ1 · · ·σm−1 `tm σm. In this way, γgenerates a string that is composed by all symbolsthat are pushed into the stack by transitions shb, inthe left to right order.

We can now associate a probability to (our repres-entation of) sequence γ by setting

p(γ) =m∏i=1

p(ti | σi−1). (1)

To assign probabilities to complete computations weshould further multiply p(γ) by factors ps(σ0) andpe(σm), where ps and pe are start and end probabil-ity distributions, respectively, both defined over Cσ.Note however that, as defined in §2.2, all initial con-figurations are associated with stack [¢] and all finalconfigurations are associated with stack [¢, 0], thusps and pe are deterministic. Note that the Markovchain represented in Eq. 1 is homogeneous, i.e., theprobabilities of the transition operations do not de-pend on the time step.

As a second step we observe that, according to thedefinition of transition system, each t ∈ T has an in-finite domain. A commonly adopted solution is tointroduce a special function, called history functionand denoted by H , defined over the set Cσ and tak-ing values over some finite set. For each t ∈ T andσ, σ′ ∈ Cσ, we then impose the condition

p(t | σ) = p(t | σ′)

whenever H(σ) = H(σ′). Since H is finitely val-ued, and since T is a finite set, the above conditionguarantees that there will only be a finite number ofparameters p(t | σ) in our model.

So far we have presented a general discussion ofhow to turn a transition-based parser into a gener-ative probabilistic model, and have avoided further

specification of the history function. We now turnour attention to the non-projective transition systemof §2.2. To actually transform that system into aparametrized probabilistic model, and to develop anassociated efficient inference procedure as well, weneed to balance between the amount of informationwe put into the history function and the computa-tional complexity which is required for inference.

We start the discussion with a naıve model using ahistory function defined by a fixed size window overthe topmost portion of the stack. More precisely,each transition is conditioned on the lexical form ofthe three symbols at the top of the stack σ, indic-ated as b3, b2, b1 ∈ Σ below, with b1 referring to thetopmost symbol. The parameters of the model aredefined as follows.

p(shb | b3, b2, b1) = θshbb3,b2,b1, ∀b ∈ Σ ,

p(la1 | b3, b2, b1) = θla1b3,b2,b1 ,

p(ra1 | b3, b2, b1) = θra1b3,b2,b1 ,

p(la2 | b3, b2, b1) = θla2b3,b2,b1 ,

p(ra2 | b3, b2, b1) = θra2b3,b2,b1 .

The parameters above are subject to the follow-ing normalization conditions, for every choice ofb3, b2, b1 ∈ Σ :

θla1b3,b2,b1 + θra1b3,b2,b1 + θla2b3,b2,b1+

θra2b3,b2,b1 +∑b∈Σ

θshbb3,b2,b1= 1 .

This naıve model presents two practical problems.The first problem relates to the efficiency of an in-ference algorithm, which has a quite high computa-tional complexity, as it will be discussed in §5. Asecond problem arises in the probabilistic setting.Using this model would require estimating manyparameters which are based on trigrams. This leadsto higher sample complexity to avoid sparse counts:we would need more samples to accurately estimatethe model.

We therefore consider a more elaborated model,which tackles both of the above problems. Again,let b3, b2, b1 ∈ Σ indicate the lexical form of thethree symbols at the top of the stack. We define the

distributions p(t | σ) as follows:

p(shb | b1) = θshbb1, ∀b ∈ Σ ,

p(la1 | b2, b1) = θrdb1 · θla1b2,b1

,

p(ra1 | b2, b1) = θrdb1 · θra1b2,b1

,

p(la2 | b3, b2, b1) = θrdb1 · θrd2b2,b1· θla2b3,b2,b1 ,

p(ra2 | b3, b2, b1) = θrdb1 · θrd2b2,b1· θra2b3,b2,b1 .

The parameters above are subject to the followingnormalization conditions, for every b3, b2, b1 ∈ Σ :∑

b∈Σθshbb1

+ θrdb1 = 1 , (2)

θla1b2,b1 + θra1b2,b1 + θrd2b2,b1 = 1 , (3)

θla2b3,b2,b1 + θra2b3,b2,b1 = 1 . (4)

Intuitively, parameter θrdb denotes the probabilitythat we perform a reduce transition instead of a shifttransition, given that we have seen lexical form b atthe top of the stack. Similarly, parameter θrd2b2,b1 de-notes the probability that we perform a reduce trans-ition of degree 2 (see §2.2) instead of a reduce trans-ition of degree 1, given that we have seen lexicalforms b1 and b2 at the top of the stack.

We observe that the above model has a num-ber of parameters |Σ | + 4 · |Σ |2 + 2 · |Σ |3 (notall independent). This should be contrasted withthe naıve model, that has a number of parameters4 · |Σ |3 + |Σ |4.

4 Tabular parsing

We present here a dynamic programming algorithmfor simulating the computations of the system from§2–3. Given an input string w, our algorithm pro-duces a compact representation of the set Γ (w),defined as the set of all possible computations ofthe model when processing w. In combination withthe appropriate semirings, this method can providefor instance the highest probability computation inΓ (w), or else the probability of w, defined as thesum of all probabilities of computations in Γ (w).

We follow a standard approach in the literatureon dynamic programming simulation of stack-basedautomata (Lang, 1974; Tomita, 1986; Billot andLang, 1989). More recently, this approach has alsobeen applied by Huang and Sagae (2010) and by

c0

c1

cm

σ h1

h1σ i

σ h2 h3

minimumstacklength inc1, . . . , cm

i i+ 1

i+ 1

buffer size

stack size

stack

buffer

j

Figure 2: Schematic representation of the computationsγ associated with item [h1, i, h2h3, j].

Kuhlmann et al. (2011) to the simulation of pro-jective transition-based parsers. The basic idea inthis approach is to decompose computations of theparser into smaller parts, group them into equival-ence classes and recombine to obtain larger parts ofcomputations.

Let w = a0 · · · an−1, Vw and S(np) be defined asin §2. We use a structure called item, defined as

[h1, i, h2h3, j],

where 0 ≤ i < j ≤ n and h1, h2, h3 ∈ Vw mustsatisfy h1 < i and i ≤ h2 < h3 < j. The intendedinterpretation of an item can be stated as follows; seealso Figure 2.

• There exists a computation γ of S(np) on w hav-ing the form c0, . . . , cm, m ≥ 1, with c0 =(σ|h1, βi, A) and cm = (σ|h2|h3, βj , A′) forsome stack σ and some arc sets A and A′;• For each iwith 1 ≤ i < m, the stack σi associated

with configuration ci has the list σ at the bottomand satisfies |σi| ≥ |σ|+ 2.

Some comments on the above conditions are inorder here. Let t1, · · · , tm be the sequence of trans-itions in T (np) associated with computation γ. Thenwe have t1 = shai , since |σ1| ≥ |σ| + 2. Thus weconclude that |σ1| = |σ|+ 2.

The most important consequence of the definitionof item is that each transition ti with 2 ≤ i ≤ mdoes not depend on the content of the σ portion ofthe stack σi. To see this, consider transition ci−1 `tici. If ti = shai , the content of σ is irrelevant at

this step, since in our model shai is conditioned onlyon the topmost stack symbol of σi−1, and we have|σi−1| ≥ |σ|+ 2.

Consider now the case of ti = la2. From |σi| ≥|σ| + 2 we have that |σi−1| ≥ |σ| + 3. Again, thecontent of σ is irrelevant at this step, since in ourmodel la2 is conditioned only on the three topmoststack symbols of σi−1. A similar argument appliesto the cases of ti ∈ {ra2, la1, ra1}.

From the above, we conclude that if we apply thetransitions t1, . . . , tm to stacks of the form σ|h1, theresulting computations have all identical probabilit-ies, independently of the choice of σ.

Each computation satisfying the two conditionsabove will be called an I-computation associ-ated with item [h1, i, h2h3, j]. Notice that an I-computation has the overall effect of replacing nodeh1 sitting above a stack σ with nodes h2 and h3.This is the key property in the development of ouralgorithm below.

We specify our dynamic programming algorithmas a deduction system (Shieber et al., 1995). Thededuction system starts with axiom [¢, 0, ¢0, 1], cor-responding to an initial stack [¢] and to the shift ofa0 = $ from the buffer into the stack. The set Γ (w)is non-empty if and only if item [¢, 0, ¢0, n] can bederived using the inference rules specified below.Each inference rule is annotated with the type oftransition it simulates, along with the arc constructedby the transition itself, if any.

[h1, i, h2h3, j]

[h3, j, h3j, j + 1](shaj )

[h1, i, h2h3, k] [h3, k, h4h5, j]

[h1, i, h2h5, j](la1;h5 → h4)

[h1, i, h2h3, k] [h3, k, h4h5, j]

[h1, i, h2h4, j](ra1;h4 → h5)

[h1, i, h2h3, k] [h3, k, h4h5, j]

[h1, i, h4h5, j](la2;h5 → h2)

[h1, i, h2h3, k] [h3, k, h4h5, j]

[h1, i, h2h4, j](ra2;h2 → h5)

The above deduction system infers items in abottom-up fashion. This means that longer compu-tations over substrings of w are built by combiningshorter ones. In particular, the inference rule shajasserts the existence of I-computations consisting ofa single shaj transition. Such computations are rep-resented by the consequent item [h3, j, h3j, j + 1],indicating that the index of the shifted word aj isadded to the stack by pushing it on top of h3.

The remaining four rules implement the reducetransitions of the model. We have already ob-served in §2.2 that all available reduce transitionsshorten the size of the stack by one unit. This al-lows us to combine pairs of I-computations witha reduce transition, resulting in a computation thatis again an I-computation. More precisely, if weconcatenate an I-computation asserted by an item[h1, i, h2h3, k] with an I-computation asserted by anitem [h3, k, h4h5, j], we obtain a computation thathas the overall effect of increasing the size of thestack by 2, replacing the topmost stack element h1with stack elements h2, h4 and h5. If we now applyany of the reduce transitions from the inventory ofthe model, we will remove one of these three nodesfrom the stack, and the overall result will be againan I-computation, which can then be asserted by acertain item. For example, if we apply the reducetransition la1, the consequent item is [h1, i, h2h5, j],since an la1 transition removes the second topmostelement from the stack (h4). The other reduce trans-itions remove a different element, and thus theirrules produce different consequent items.

The above argument shows the soundness of thededuction system, i.e., an item I = [h1, i, h2h3, j]is only generated if there exists an I-computationγ = c0, . . . , cm with c0 = (σ|h1, βi, A) and cm =(σ|h2|h3, βj , A′). To prove completeness, we mustshow the converse result, i.e., that the existence ofan I-computation γ implies that item I is inferred.We first do this under the assumption that the infer-ence rule for the shift transitions do not have an ante-cedent, i.e., items [h1, j, h1j, j + 1] are consideredas axioms. We proceed by using strong induction onthe length m of the computation γ.

For m = 1, γ consists of a single transition shaj ,and the corresponding item I = [h1, j, h1j, j + 1]is constructed as an axiom. For m > 1, let γ beas specified above. The transition that produced

cm must have been a reduce transition, otherwiseγ would not be an I-computation. Let ck be therightmost configuration in c0, . . . , cm−1 whose stacksize is |σ| + 2. Then it can be shown that the com-putations γ1 = c0, . . . , ck and γ2 = ck, . . . , cm−1are again I-computations. Since γ1 and γ2 havestrictly fewer transitions than γ, by the induction hy-pothesis, the system constructs items [h1, i, h2h3, k]and [h3, k, h4h5, j], where h2 and h3 are the stackelements at the top of ck. Applying to these itemsthe inference rule corresponding to the reduce trans-ition at hand, we can construct item I .

When the inference rule for the shift transition hasan antecedent [h1, i, h2h3, j], as indicated above, wehave the overall effect that I-computations consist-ing of a single transition shifting aj on the top of h3are simulated only in case there exists a computationstarting with configuration ([¢], β0) and reaching aconfiguration of the form (σ|h2|h3, βj). This acts asa filter on the search space of the algorithm, but doesnot invalidate the completeness property. However,in this case the proof is considerably more involved,and we do not report it here.

An important property of the deduction systemabove, which will be used in the next section, isthat the system is unambiguous, that is, each I-computation is constructed by the system in aunique way. This can be seen by observing that, inthe sketch of the completeness proof reported above,there always is an unique choice of ck that decom-poses I-computation γ into I-computations γ1 andγ2. In fact, if we choose a configuration ck′ otherthan ck with stack size |σ| + 2, the computationγ′2 = ck′ , . . . , cm−1 will contain ck as an interme-diate configuration, which violates the definition ofI-computation because of an intervening stack hav-ing size not larger than the size of the stack associ-ated with the initial configuration.

As a final remark, we observe that we can keeptrack of all inference rules that have been appliedin the computation of each item by the above al-gorithm, by encoding each application of a rule asa reference to the pair of items that were taken asantecedent in the inference. In this way, we ob-tain a parse forest structure that can be viewed as ahypergraph or as a non-recursive context-free gram-mar, similar to the case of parsing based on context-free grammars. See for instance Klein and Manning

(2001) or Nederhof (2003). Such a parse forest en-codes all valid computations in Γ (w), as desired.

The algorithm runs in O(n8) time. Using meth-ods similar to those specified in Eisner and Satta(1999), we can reduce the running time to O(n7).However, we do not further pursue this idea here,and proceed with the discussion of exact inference,found in the next section.

5 Inference

We turn next to specify exact inference with ourmodel, for computing feature expectations. Suchinference enables, for example, the derivation ofan expectation-maximization algorithm for unsuper-vised parsing.

Here, a feature is a function over computations,providing the count of a pattern related to a para-meter. We denote by f la2b3,b2,b1(γ), for instance,the number of occurrences of transition la2 withinγ with topmost stack symbols having word formsb3, b2, b1 ∈ Σ , with b1 associated with the topmoststack symbol.

Feature expectations are computed by using aninside-outside algorithm for the items in the tabu-lar algorithm. More specifically, given a string w,we associate each item [h1, i, h2h3, j] defined as in§4 with two quantities:

I([h1, i, h2h3, j]) =∑

γ=([h1],βi),...,([h2,h3],βj)

p(γ) ; (5)

O([h1, i, h2h3, j]) =∑

σ,γ=([¢],β0),...,(σ|h1,βi)γ′=(σ|h2|h3,βj),...,([¢,0],βn)

p(γ) · p(γ′) . (6)

I([h1, i, h2h3, j]) and O([h1, i, h2h3, j]) are calledthe inside and the outside probabilities, respect-ively, of item [h1, i, h2h3, j]. The tabular algorithmof §4 can be used to compute the inside probabilit-ies. Using the gradient transformation (Eisner et al.,2005), a technique for deriving outside probabilitiesfrom a set of inference rules, we can also computeO([h1, i, h2h3, j]). The use of the gradient trans-formation is valid in our case because the tabular al-gorithm is unambiguous (see §4).

Using the inside and outside probabilities, we cannow efficiently compute feature expectations for our

Ep(γ|w)[fla2b3,b2,b1

(γ)] =∑

γ∈Γ (w)

p(γ | w) · f la2b3,b2,b1(γ) =1

p(w)·

∑γ∈Γ (w)

p(γ) · f la2b3,b2,b1(γ)

=1

p(w)·

∑σ,i,k,j,

h1,h2,h3,h4,h5,

s.t. ah2=b3,

ah4=b2, ah5=b1

∑γ0=([¢],β0),...,(σ|h1,βi),

γ1=(σ|h1,βi),...,(σ|h2|h3,βk),γ2=(σ|h2|h3,βk),...,(σ|h2|h4|h5,βj),

γ3=(σ|h2|h5,βj),...,([¢,0],βn)

p(γ0) · p(γ1) · p(γ2) · p(la2 | b3, b2, b1) · p(γ3)

=θrdb1 · θ

rd2b2,b1· θla2b3,b2,b1

p(w)·

∑σ,i,j,

h1,h2,h5, s.t.

ah2=b3, ah5=b1

∑γ0=([¢],β0),...,(σ|h1,βi),

γ3=(σ|h2|h5,βj),...,([¢,0],βn)

p(γ0) · p(γ3) ·

·∑

k,h3,h4,

s.t. ah4=b2

∑γ1=(σ|h1,βi),...,(σ|h2|h3,βk)

p(γ1) ·∑

γ2=(σ|h2|h3,βk),...,(σ|h2|h4|h5,βj)

p(γ2)

Figure 3: Decomposition of the feature expectationEp(γ|w)[fla2b3,b2,b1

(γ)] into a finite summation. Quantity p(w) aboveis the sum over all probabilities of computations in Γ (w).

model. Figure 3 shows how to express the expect-ation of feature f la2b3,b2,b1(γ) by means of a finitesummation. Using Eq. 5 and 6 and the relationp(w) = I([¢, 0, ¢0, n]) we can then write:

Ep(γ|w)[fla2b3,b2,b1

(γ)] =θrdb1 · θ

rd2b2,b1· θla2b3,b2,b1

I([¢, 0, ¢0, n])·

·∑

i,j,h1,h4,h5,

s.t. ah4=b2, ah5=b1

O([h1, i, h4h5, j]) ·

·∑

k,h2,h3,

s.t. ah2=b3

I([h1, i, h2h3, k]) · I([h3, k, h4h5, j]) .

Very similar expressions can be derived for the ex-pectations for features f ra2b3,b2,b1

(γ), f la1b2,b1(γ), and

f ra1b2,b1(γ). As for feature f shbb1

(γ), b ∈ Σ , the aboveapproach leads to

Ep(γ|w)[fshbb1

(γ)] =

=θshbb1

I([¢, 0, ¢0, n])·

∑σ,i,h, s.t.

ah=b1, ai=b

O([h, i, hi, i+ 1]) .

As mentioned above, these expectations can beused, for example, to derive an EM algorithm for ourmodel. The EM algorithm in our case is not com-pletely straightforward because of the way we para-metrize the model. We give now the re-estimationsteps for such an EM algorithm. We assume that allexpectations below are taken with respect to a set ofparameters θ from iteration s − 1 of the algorithm,and we are required to update these θ. To simplifynotation, let us assume that there is only one stringwin the training corpus. For each b1 ∈ Σ , we define:

Zb1 =∑b2∈Σ

Ep(γ|w)

[f la1b2,b1(γ) + f ra1b2,b1

(γ)]

+∑

b3,b2∈ΣEp(γ|w)

[f la2b3,b2,b1(γ) + f ra2b3,b2,b1

(γ)];

Zb2,b1 =∑b3∈Σ

Ep(γ|w)

[f la2b3,b2,b1(γ) + f ra2b3,b2,b1

(γ)].

We then have, for every b ∈ Σ :

θshbb1(s)←

Ep(γ|w)[fshbb1

(γ)]

Zb1 +∑

b′∈Σ Ep(γ|w)[fshb′b1

(γ)].

Furthermore, we have:

θla1b2,b1(s)←Ep(γ|w)[f

la1b2,b1

(γ)]

Zb2,b1 + Ep(γ|w)

[f la1b2,b1(γ) + f ra1b2,b1

(γ)] ,

and:

θla2b3,b2,b1(s)←Ep(γ|w)[f

la2b3,b2,b1

(γ)]

Ep(γ|w)

[f la2b3,b2,b1(γ) + f ra2b3,b2,b1

(γ)] .

The rest of the parameter updates can easily be de-rived using the above updates because of the sum-to-1 constraints in Eq. 2–4.

6 Discussion

We note that our model inherits spurious ambigu-ity from Attardi’s model. More specifically, we canhave different derivations, corresponding to differ-ent system computations, that result in identical de-pendency graphs and strings. While running ourtabular algorithm with the Viterbi semiring effi-ciently computes the highest probability computa-tion in Γ (w), spurious ambiguity means that find-ing the highest probability dependency tree is NP-hard. This latter result can be shown using prooftechniques similar to those developed by Sima’an(1996). We leave it for future work how to eliminatespurious ambiguity from the model.

While in the previous sections we have describeda tabular method for the transition system of Attardi(2006) restricted to transitions of degree up to two, itis possible to generalize the model to include higher-degree transitions. In the general formulation of At-tardi parser, transitions of degree d create links in-volving nodes located d positions beneath the top-most position in the stack:

lad : (σ|i1|i2| . . . |id+1, β, A) `(σ|i2| . . . |id+1, β, A ∪ {id+1 → i1});

rad : (σ|i1|i2| . . . |id+1, β, A) `(σ|i1|i2| . . . |id, β, A ∪ {i1 → id+1}).

To define a transition system that supports trans-itions up to degree D, we use a set ofitems of the form [s1 . . . sD−1, i, e1 . . . eD, j], cor-responding (in the sense of §4) to compu-tations of the form c0, . . . , cm, m ≥ 1,

with c0 = (σ|s1| . . . |sD−1, βi, A) and cm =(σ|e1| . . . |eD, βj , A′). The deduction steps corres-ponding to reduce transitions in this general systemhave the general form

[s1 . . . sD−1, i, e1m1 . . .mD−1, j][m1 . . .mD−1, j, e2 . . . eD+1, w]

[s1 . . . sD−1, i, e1 . . . ec−1ec+1 . . . eD+1, w](ep → ec)

where the values of p and c differ for each transition:to obtain the inference rule corresponding to a ladtransition, we make p = D + 1 and c = D + 1− d;and to obtain the rule for a rad transition, we makep = D + 1− d and c = D + 1. Note that the parserruns in timeO(n3D+2), whereD stands for the max-imum transition degree, so each unit increase in thetransition degree adds a cubic factor to the parser’spolynomial time complexity. This is in contrast to aprevious tabular formulation of the Attardi parser byGomez-Rodrıguez et al. (2011), which ran in expo-nential time.

The model for the transition system we give in thispaper is generative. It is not hard to naturally extendthis model to the discriminative setting. In this case,we would condition the model on the input string toget a conditional distribution over derivations. It isperhaps more natural in this setting to use arbitraryweights for the parameter values, since the compu-tation of a normalization constant (the probability ofa string) is required in any case. Arbitrary weightsin the generative setting could be more problematic,because it would require computing a normalizationconstant corresponding to a sum over all strings andderivations.

7 Conclusion

We presented in this paper a generative probabilisticmodel for non-projective parsing, together with thedescription of an efficient tabular algorithm for pars-ing and doing statistical inference with the model.

Acknowledgments

The authors thank Marco Kuhlmann for helpfulcomments on an early draft of the paper. The authorsalso thank Giuseppe Attardi for the help received toextract the parsing statistics. The second author hasbeen partially supported by Ministerio de Ciencia eInnovacion and FEDER (TIN2010-18552-C03-02).

ReferencesGiuseppe Attardi. 2006. Experiments with a multil-

anguage non-projective dependency parser. In Pro-ceedings of the Tenth Conference on ComputationalNatural Language Learning (CoNLL), pages 166–170,New York, USA.

Sylvie Billot and Bernard Lang. 1989. The structureof shared forests in ambiguous parsing. In Proceed-ings of the 27th Annual Meeting of the Associationfor Computational Linguistics (ACL), pages 143–151,Vancouver, Canada.

Manuel Bodirsky, Marco Kuhlmann, and Mathias Mohl.2005. Well-nested drawings as models of syntacticstructure. In Tenth Conference on Formal Gram-mar and Ninth Meeting on Mathematics of Language,pages 195–203, Edinburgh, UK.

Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing.In Proceedings of the Tenth Conference on Compu-tational Natural Language Learning (CoNLL), pages149–164, New York, USA.

Jason Eisner and Giorgio Satta. 1999. Efficient parsingfor bilexical context-free grammars and Head Auto-maton Grammars. In Proceedings of the 37th An-nual Meeting of the Association for ComputationalLinguistics (ACL), pages 457–464, College Park, MD,USA.

Jason Eisner, Eric Goldlust, and Noah A. Smith. 2005.Compiling Comp Ling: Practical weighted dynamicprogramming and the Dyna language. In Proceedingsof HLT-EMNLP, pages 281–290.

Carlos Gomez-Rodrıguez and Joakim Nivre. 2010. Atransition-based parser for 2-planar dependency struc-tures. In Proceedings of the 48th Annual Meeting ofthe Association for Computational Linguistics (ACL),pages 1492–1501, Uppsala, Sweden.

Carlos Gomez-Rodrıguez, David J. Weir, and John Car-roll. 2009. Parsing mildly non-projective dependencystructures. In Twelfth Conference of the EuropeanChapter of the Association for Computational Lin-guistics (EACL), pages 291–299, Athens, Greece.

Carlos Gomez-Rodrıguez, John Carroll, and David Weir.2011. Dependency parsing schemata and mildly non-projective dependency parsing. Computational Lin-guistics (in press), 37(3).

Joshua Goodman. 1999. Semiring parsing. Computa-tional Linguistics, 25(4):573–605.

John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ull-man. 2000. Introduction to Automata Theory.Addison-Wesley, 2nd edition.

Liang Huang and Kenji Sagae. 2010. Dynamic program-ming for linear-time incremental parsing. In Proceed-ings of the 48th Annual Meeting of the Association for

Computational Linguistics (ACL), pages 1077–1086,Uppsala, Sweden.

Sylvain Kahane, Alexis Nasr, and Owen Rambow. 1998.Pseudo-projectivity: A polynomially parsable non-projective dependency grammar. In 36th AnnualMeeting of the Association for Computational Lin-guistics and 18th International Conference on Compu-tational Linguistics (COLING-ACL), pages 646–652,Montreal, Canada.

Dan Klein and Christopher D. Manning. 2001. Parsingand hypergraphs. In Proceedings of the IWPT, pages123–134.

Terry Koo, Amir Globerson, Xavier Carreras, and Mi-chael Collins. 2007. Structured prediction modelsvia the matrix-tree theorem. In Proceedings of theEMNLP-CoNLL, pages 141–150.

Marco Kuhlmann and Giorgio Satta. 2009. Tree-bank grammar techniques for non-projective depend-ency parsing. In Twelfth Conference of the EuropeanChapter of the Association for Computational Lin-guistics (EACL), pages 478–486, Athens, Greece.

Marco Kuhlmann, Carlos Gomez-Rodrıguez, and Gior-gio Satta. 2011. Dynamic programming algorithmsfor transition-based dependency parsers. In Proceed-ings of the 49th Annual Meeting of the Association forComputational Linguistics (ACL), Portland, Oregon,USA.

Bernard Lang. 1974. Deterministic techniques for ef-ficient non-deterministic parsers. In Jacques Loecx,editor, Automata, Languages and Programming, 2ndColloquium, University of Saarbrucken, July 29–August 2, 1974, number 14 in Lecture Notes in Com-puter Science, pages 255–269. Springer.

Zhifei Li and Jason Eisner. 2009. First- and second-orderexpectation semirings with applications to minimum-risk training on translation forests. In Proceedings ofthe 2009 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP), pages 40–51, Singa-pore.

Ryan McDonald and Giorgio Satta. 2007. On the com-plexity of non-projective data-driven dependency pars-ing. In Tenth International Conference on ParsingTechnologies (IWPT), pages 121–132, Prague, CzechRepublic.

Ryan McDonald, Fernando Pereira, Kiril Ribarov, andJan Hajic. 2005. Non-projective dependency parsingusing spanning tree algorithms. In Human LanguageTechnology Conference (HLT) and Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 523–530, Vancouver, Canada.

Mark-Jan Nederhof. 2003. Weighted deductive pars-ing and knuth’s algorithm. Computational Linguistics,29(1):135–143.

Joakim Nivre and Jens Nilsson. 2005. Pseudo-projectivedependency parsing. In 43rd Annual Meeting ofthe Association for Computational Linguistics (ACL),pages 99–106, Ann Arbor, USA.

Joakim Nivre. 2004. Incrementality in deterministic de-pendency parsing. In Workshop on Incremental Pars-ing: Bringing Engineering and Cognition Together,pages 50–57, Barcelona, Spain.

Joakim Nivre. 2008. Algorithms for deterministic incre-mental dependency parsing. Computational Linguist-ics, 34(4):513–553.

Joakim Nivre. 2009. Non-projective dependency parsingin expected linear time. In Proceedings of the 47th An-nual Meeting of the ACL and the Fourth InternationalJoint Conference on Natural Language Processing ofthe AFNLP, pages 351–359, Singapore.

Stuart M. Shieber, Yves Schabes, and Fernando Pereira.1995. Principles and implementation of deductiveparsing. Journal of Logic Programming, 24(1–2):3–36.

Khalil Sima’an. 1996. Computational complexityof probabilistic disambiguation by means of tree-grammars. In Proceedings of COLING, pages 1175–1180.

David A. Smith and Noah A. Smith. 2007. Probab-ilistic models of nonprojective dependency trees. InProceedings of the EMNLP-CoNLL, pages 132–140.

Ivan Titov, James Henderson, Paola Merlo, and GabrieleMusillo. 2009. Online graph planarisation for syn-chronous parsing of semantic and syntactic dependen-cies. In Proceedings of IJCAI, pages 281–290.

Masaru Tomita. 1986. Efficient Parsing for NaturalLanguage: A Fast Algorithm for Practical Systems.Springer.

Hiroyasu Yamada and Yuji Matsumoto. 2003. Statisticaldependency analysis with support vector machines. InProceedings of the Eighth International Workshop onParsing Technologies (IWPT), pages 195–206, Nancy,France.

Exact Inference for Generative Probabilistic Non ...scohen/emnlp11nonproj.pdf · Exact Inference...

Documents

Transcript of Exact Inference for Generative Probabilistic Non ...scohen/emnlp11nonproj.pdf · Exact Inference...