Transition-based Dependency Parsing with Selectional Branching

Transition-based Dependency ParsingWith Selectional Branching

Jinho D. Choi and Andrew McCallumDepartment of Computer Science, University of Massachusetts Amherst

Greedy vs. non-greedy dependency parsing

• Speed vs. Accuracy : Greedy parsing is faster (about 10-35 times faster). : Non-greedy parsing is more accurate (about 1-2% more accurate).• Non-greedy parsing approaches : Transition-based parsing using beam search. : Other approaches: graph-based parsing, linear programming, dual decomposition.

How many beams do we need during parsing?

• Intuition: simpler sentences need a fewer number of beams than more complex sentences to generate the most accurate parse output.• Rule of thumb: a greedy parser performs as accurately as a non-greedy parser using beam search (beam size = 80) for about 64% of time.• Motivation: the average parsing speed can be improved without compromising the overall parsing accuracy by allocating different beam sizes for different sentences.• Challenge: how can we determine the appropriate beam size given a sentence?

Introduction

Branching strategy

• sij represents a parsing state, where i is the index of the current transition sequence and j is

the index of the current parsing state, and pkj represents the k’th best prediction given s1j.

1. The one-best sequence T1 = [s11, … , s1t] is generated by a greedy parser.

2. While generating T1, the parser adds tuples (s1j, p2j), … , (s1j, pkj) to a list λ for each low

confidence prediction p1j given s1j (in our case, k = 2).

3. Then, new transition sequences are generated by using the b highest scoring predictions in λ,

where b is the beam size.4. The same greedy parser is used to generate these new sequences although it now starts with

s1j instead of an initial parsing state, applies pkj to s1j, and performs further transitions.

5. Finally, a parse tree is built from a sequence with the highest score, where the score of Ti is the average score of all predictions that lead to generate the sequence.

Comparison to beam search

• t is the maximum number of parsing states generated by any transition sequence.

Finding low confidence predictions

• The best prediction is low confident if there exists any other prediction whose margin (score difference) to the best prediction is less than a threshold, m (|Ck(x, m)| > 1).

• The optimal beam size and margin threshold are found during development using grid search.

Selectional Branching

Projective parsing experiments

• The standard setup on WSJ (Yamada and Matsumoto’s headrules, Nivre’s labeling rules).• Our speeds (seconds per sentence) are measured on an Intel Xeon 2.57GHz machine.• POS tags are generated by the ClearNLP POS tagger (97.5% accuracy on WSJ-23).• bt: beam size used during training, bd: beam size used during decoding.

Non-projective parsing experiments

• Danish, Dutch, Slovene, and Swedish data from the CoNLL-X shared task.• Nivre’06: pseudo-projective parsing, McDonald’06: 2nd order graph-based parsing.• Nivre’09: swap transition, N&M’08: ensemble between Nivre’06 and McDonald’06.• Fernandez-Gonzalez&Gomez-Rodriguez’12: buffer transition, Martins’10: linear programming.

Experiments

ApproachDanish Dutch Slovene Swedish

LAS UAS LAS UAS LAS UAS LAS UAS

Nivre’06 84.77 89.80 78.59 81.35 70.30 78.72 84.58 89.50

McDonald’06 84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93

Nivre’09 84.20 – – – 75.20 – – –

F&G’12 85.17 90.10 – – – – 83.55 89.30

N&M’08 86.67 – 81.63 – – 75.94 84.66 –

Martins’10 – 91.50 – 84.91 – 85.53 – 89.80

bt = 80, bt = 80 87.27 91.36 82.45 85.33 77.46 84.65 86.80 91.36

bt = 80, bt = 1 86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12

• Our non-projective parsing algorithm shows an expected linear-time parsing speed and gives state-of-the-art parsing accuracy compared to other non-projective parsing approaches.• Our selectional branching uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to greedy parsing.• Our parser is publicly available under the open source project, ClearNLP (clearnlp.com).

Conclusion

• We gratefully acknowledge a grant from the Defense Advanced Research Projects Agency under the DEFT project, solicitation #: DARPA-BAA-12-47.

Acknowledgments

Algorithm

• A hybrid between Nivre’s arc-eager and list-based algorithms (Nivre, 2003; Nivre, 2008).• When training data contains only projective trees, it learns only projective transitions. → gives a parsing complexity of O(n) for projective parsing. • When training data contains both projective and non-projective trees, it learns both kinds of transitions → gives an expected linear time parsing speed.

Transitions

Hybrid Dependency Parsing

IESL

Approach UAS LAS Speed Note

Zhang & Clark (2008) 92.10 – – Beam search

Huang & Sagae (2010) 92.10 – 0.04 + Dynamic programming

Zhang & Nivre (2011) 92.90 91.80 0.03 + Rich non-local features

Bohnet & Nivre (2012) 93.38 92.44 0.40 + Joint Inference

bt = 80, bt = 80 92.96 91.93 0.009

Using beam sizes of 16 or aboveduring decoding gave almost the

same results.

bt = 80, bt = 64 92.96 91.93 0.009

bt = 80, bt = 32 92.96 91.94 0.009

bt = 80, bt = 16 92.96 91.94 0.008

bt = 80, bt = 1 92.26 91.25 0.002 Training with a higher beamsize improved greedy parsing.

bt = 1, bt = 1 92.06 91.05 0.002

• The # of transitions performed during decoding with respect to sentence lengths for Dutch.

• Dutch consists of the highest number of non-projective trees among languages in CoNLL-X (5.4% in arcs, 36.4% in trees).

Proj.

NonProj.

2nd parsing state in the 1st transition sequence2nd-best prediction given s11

Beam search Selectional branching

Max. # of transition sequences b d = min(b, |λ| + 1)

Max. # of parsing states b ∙ t d ∙ t - d(d − 1)/2

http://code.google.com/p/clearparser/

Transition-based Dependency Parsing with Selectional Branching

Documents

Transcript of Transition-based Dependency Parsing with Selectional Branching