Transition-based Dependency Parsing with Selectional Branching

1
Transition-based Dependency Parsing With Selectional Branching Jinho D. Choi and Andrew McCallum Department of Computer Science, University of Massachusetts Amherst Greedy vs. non-greedy dependency parsing • Speed vs. Accuracy : Greedy parsing is faster (about 10-35 times faster). : Non-greedy parsing is more accurate (about 1-2% more accurate). • Non-greedy parsing approaches : Transition-based parsing using beam search. : Other approaches: graph-based parsing, linear programming, dual decomposition. How many beams do we need during parsing? Intuition: simpler sentences need a fewer number of beams than more complex sentences to generate the most accurate parse output. Rule of thumb: a greedy parser performs as accurately as a non-greedy parser using beam search (beam size = 80) for about 64% of time. Motivation: the average parsing speed can be improved without compromising the overall parsing accuracy by allocating different beam sizes for different sentences. Challenge: how can we determine the appropriate beam size given a sentence? Introduction Branching strategy s ij represents a parsing state, where i is the index of the current transition sequence and j is the index of the current parsing state, and p kj represents the k’th best prediction given s 1j . 1. The one-best sequence T 1 = [s 11 , … , s 1t ] is generated by a greedy parser. 2. While generating T 1 , the parser adds tuples (s 1j , p 2j ), … , (s 1j , p kj ) to a list λ for each low confidence prediction p 1j given s 1j (in our case, k = 2). 3. Then, new transition sequences are generated by using the b highest scoring predictions in λ, where b is the beam size. 4. The same greedy parser is used to generate these new sequences although it now starts with s 1j instead of an initial parsing state, applies p kj to s 1j , and performs further transitions. 5. Finally, a parse tree is built from a sequence with the highest score, where the score of T i is the average score of all predictions that lead to generate the sequence. Comparison to beam search t is the maximum number of parsing states generated by any transition sequence. Finding low confidence predictions • The best prediction is low confident if there exists any other prediction whose margin (score difference) to the best prediction is less than a threshold, m (| C k (x, m)| > 1). • The optimal beam size and margin threshold are found during development using grid search. Selectional Branching Projective parsing experiments • The standard setup on WSJ (Yamada and Matsumoto’s headrules, Nivre’s labeling rules). • Our speeds (seconds per sentence) are measured on an Intel Xeon 2.57GHz machine. • POS tags are generated by the ClearNLP POS tagger (97.5% accuracy on WSJ-23). b t : beam size used during training, b d : beam size used during decoding. Non-projective parsing experiments • Danish, Dutch, Slovene, and Swedish data from the CoNLL-X shared task. • Nivre’06: pseudo-projective parsing, McDonald’06: 2 nd order graph- based parsing. • Nivre’09: swap transition, N&M’08: ensemble between Nivre’06 and McDonald’06. • Fernndez-Gonzlez&Gmez-Rodrguez’12: buffer transition, Martins’10: linear programming. Experiments Approach Danish Dutch Slovene Swedish LAS UAS LAS UAS LAS UAS LAS UAS Nivre’06 84.77 89.80 78.59 81.35 70.30 78.72 84.58 89.50 McDonald’06 84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93 Nivre’09 84.20 75.20 F&G’12 85.17 90.10 83.55 89.30 N&M’08 86.67 81.63 75.94 84.66 Martins’10 91.50 84.91 85.53 89.80 b t = 80, b t = 80 87.27 91.36 82.45 85.33 77.46 84.65 86.80 91.36 b t = 80, b t = 1 86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12 • Our non-projective parsing algorithm shows an expected linear-time parsing speed and gives state-of-the-art parsing accuracy compared to other non-projective parsing approaches. • Our selectional branching uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to greedy parsing. • Our parser is publicly available under the open source project, ClearNLP (clearnlp.com ). Conclusion • We gratefully acknowledge a grant from the Defense Advanced Research Projects Agency under the DEFT project, solicitation #: DARPA-BAA-12-47. Acknowledgments Algorithm • A hybrid between Nivre’s arc-eager and list-based algorithms (Nivre, 2003; Nivre, 2008). • When training data contains only projective trees, it learns only projective transitions. → gives a parsing complexity of O(n) for projective parsing. • When training data contains both projective and non-projective trees, it learns both kinds of transitions → gives an expected linear time parsing speed. Transitions Hybrid Dependency Parsing IESL Approach UAS LAS Speed Note Zhang & Clark (2008) 92.10 Beam search Huang & Sagae (2010) 92.10 0.04 + Dynamic programming Zhang & Nivre (2011) 92.90 91.80 0.03 + Rich non-local features Bohnet & Nivre (2012) 93.38 92.44 0.40 + Joint Inference b t = 80, b t = 80 92.96 91.93 0.009 Using beam sizes of 16 or above during decoding gave almost the same results. b t = 80, b t = 64 92.96 91.93 0.009 b t = 80, b t = 32 92.96 91.94 0.009 b t = 80, b t = 16 92.96 91.94 0.008 b t = 80, b t = 1 92.26 91.25 0.002 Training with a higher beam size improved greedy parsing. b t = 1, b t = 1 92.06 91.05 0.002 • The # of transitions performed during decoding with respect to sentence lengths for Dutch. • Dutch consists of the highest number of non- projective trees among languages in CoNLL-X (5.4% in arcs, 36.4% in trees). Proj. Non Proj. 2nd parsing state in the 1st transition sequence 2nd-best prediction given s 11 Beam search Selectional branching Max. # of transition sequences b d = min(b, |λ| + 1) Max. # of parsing states b t d t - d(d − 1)/2

description

We present a novel approach, called selectional branching, which uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to a greedy transition-based dependency parsing approach. Selectional branching is guaranteed to perform a fewer number of transitions than beam search yet performs as accurately. We also present a new transition-based dependency parsing algorithm that gives a complexity of O(n) for projective parsing and an expected linear time speed for non-projective parsing. With the standard setup, our parser shows an unlabeled attachment score of 92.96% and a parsing speed of 9 milliseconds per sentence, which is faster and more accurate than the current state-of-the-art transition- based parser that uses beam search.

Transcript of Transition-based Dependency Parsing with Selectional Branching

Page 1: Transition-based Dependency Parsing with Selectional Branching

Transition-based Dependency ParsingWith Selectional Branching

Jinho D. Choi and Andrew McCallumDepartment of Computer Science, University of Massachusetts Amherst

Greedy vs. non-greedy dependency parsing

• Speed vs. Accuracy : Greedy parsing is faster (about 10-35 times faster). : Non-greedy parsing is more accurate (about 1-2% more accurate).• Non-greedy parsing approaches : Transition-based parsing using beam search. : Other approaches: graph-based parsing, linear programming, dual decomposition.

How many beams do we need during parsing?

• Intuition: simpler sentences need a fewer number of beams than more complex sentences to generate the most accurate parse output.• Rule of thumb: a greedy parser performs as accurately as a non-greedy parser using beam search (beam size = 80) for about 64% of time.• Motivation: the average parsing speed can be improved without compromising the overall parsing accuracy by allocating different beam sizes for different sentences.• Challenge: how can we determine the appropriate beam size given a sentence?

Introduction

Branching strategy

• sij represents a parsing state, where i is the index of the current transition sequence and j is

the index of the current parsing state, and pkj represents the k’th best prediction given s1j.

1. The one-best sequence T1 = [s11, … , s1t] is generated by a greedy parser.

2. While generating T1, the parser adds tuples (s1j, p2j), … , (s1j, pkj) to a list λ for each low

confidence prediction p1j given s1j (in our case, k = 2).

3. Then, new transition sequences are generated by using the b highest scoring predictions in λ,

where b is the beam size.4. The same greedy parser is used to generate these new sequences although it now starts with

s1j instead of an initial parsing state, applies pkj to s1j, and performs further transitions.

5. Finally, a parse tree is built from a sequence with the highest score, where the score of Ti is the average score of all predictions that lead to generate the sequence.

Comparison to beam search

• t is the maximum number of parsing states generated by any transition sequence.

Finding low confidence predictions

• The best prediction is low confident if there exists any other prediction whose margin (score difference) to the best prediction is less than a threshold, m (|Ck(x, m)| > 1).

• The optimal beam size and margin threshold are found during development using grid search.

Selectional Branching

Projective parsing experiments

• The standard setup on WSJ (Yamada and Matsumoto’s headrules, Nivre’s labeling rules).• Our speeds (seconds per sentence) are measured on an Intel Xeon 2.57GHz machine.• POS tags are generated by the ClearNLP POS tagger (97.5% accuracy on WSJ-23).• bt: beam size used during training, bd: beam size used during decoding.

Non-projective parsing experiments

• Danish, Dutch, Slovene, and Swedish data from the CoNLL-X shared task.• Nivre’06: pseudo-projective parsing, McDonald’06: 2nd order graph-based parsing.• Nivre’09: swap transition, N&M’08: ensemble between Nivre’06 and McDonald’06.• Fernandez-Gonzalez&Gomez-Rodriguez’12: buffer transition, Martins’10: linear programming.

Experiments

ApproachDanish Dutch Slovene Swedish

LAS UAS LAS UAS LAS UAS LAS UAS

Nivre’06 84.77 89.80 78.59 81.35 70.30 78.72 84.58 89.50

McDonald’06 84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93

Nivre’09 84.20 – – – 75.20 – – –

F&G’12 85.17 90.10 – – – – 83.55 89.30

N&M’08 86.67 – 81.63 – – 75.94 84.66 –

Martins’10 – 91.50 – 84.91 – 85.53 – 89.80

bt = 80, bt = 80 87.27 91.36 82.45 85.33 77.46 84.65 86.80 91.36

bt = 80, bt = 1 86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12

• Our non-projective parsing algorithm shows an expected linear-time parsing speed and gives state-of-the-art parsing accuracy compared to other non-projective parsing approaches.• Our selectional branching uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to greedy parsing.• Our parser is publicly available under the open source project, ClearNLP (clearnlp.com).

Conclusion

• We gratefully acknowledge a grant from the Defense Advanced Research Projects Agency under the DEFT project, solicitation #: DARPA-BAA-12-47.

Acknowledgments

Algorithm

• A hybrid between Nivre’s arc-eager and list-based algorithms (Nivre, 2003; Nivre, 2008).• When training data contains only projective trees, it learns only projective transitions. → gives a parsing complexity of O(n) for projective parsing. • When training data contains both projective and non-projective trees, it learns both kinds of transitions → gives an expected linear time parsing speed.

Transitions

Hybrid Dependency Parsing

IESL

Approach UAS LAS Speed Note

Zhang & Clark (2008) 92.10 – – Beam search

Huang & Sagae (2010) 92.10 – 0.04 + Dynamic programming

Zhang & Nivre (2011) 92.90 91.80 0.03 + Rich non-local features

Bohnet & Nivre (2012) 93.38 92.44 0.40 + Joint Inference

bt = 80, bt = 80 92.96 91.93 0.009

Using beam sizes of 16 or aboveduring decoding gave almost the

same results.

bt = 80, bt = 64 92.96 91.93 0.009

bt = 80, bt = 32 92.96 91.94 0.009

bt = 80, bt = 16 92.96 91.94 0.008

bt = 80, bt = 1 92.26 91.25 0.002 Training with a higher beamsize improved greedy parsing.

bt = 1, bt = 1 92.06 91.05 0.002

• The # of transitions performed during decoding with respect to sentence lengths for Dutch.

• Dutch consists of the highest number of non-projective trees among languages in CoNLL-X (5.4% in arcs, 36.4% in trees).

Proj.

NonProj.

2nd parsing state in the 1st transition sequence2nd-best prediction given s11

Beam search Selectional branching

Max. # of transition sequences b d = min(b, |λ| + 1)

Max. # of parsing states b ∙ t d ∙ t - d(d − 1)/2