Selectional restrictions are based on semantic frames: A case
Transition-based Dependency Parsing with Selectional Branching
-
Upload
jinho-d-choi -
Category
Documents
-
view
216 -
download
2
description
Transcript of Transition-based Dependency Parsing with Selectional Branching
Transition-based Dependency ParsingWith Selectional Branching
Jinho D. Choi and Andrew McCallumDepartment of Computer Science, University of Massachusetts Amherst
Greedy vs. non-greedy dependency parsing
• Speed vs. Accuracy : Greedy parsing is faster (about 10-35 times faster). : Non-greedy parsing is more accurate (about 1-2% more accurate).• Non-greedy parsing approaches : Transition-based parsing using beam search. : Other approaches: graph-based parsing, linear programming, dual decomposition.
How many beams do we need during parsing?
• Intuition: simpler sentences need a fewer number of beams than more complex sentences to generate the most accurate parse output.• Rule of thumb: a greedy parser performs as accurately as a non-greedy parser using beam search (beam size = 80) for about 64% of time.• Motivation: the average parsing speed can be improved without compromising the overall parsing accuracy by allocating different beam sizes for different sentences.• Challenge: how can we determine the appropriate beam size given a sentence?
Introduction
Branching strategy
• sij represents a parsing state, where i is the index of the current transition sequence and j is
the index of the current parsing state, and pkj represents the k’th best prediction given s1j.
1. The one-best sequence T1 = [s11, … , s1t] is generated by a greedy parser.
2. While generating T1, the parser adds tuples (s1j, p2j), … , (s1j, pkj) to a list λ for each low
confidence prediction p1j given s1j (in our case, k = 2).
3. Then, new transition sequences are generated by using the b highest scoring predictions in λ,
where b is the beam size.4. The same greedy parser is used to generate these new sequences although it now starts with
s1j instead of an initial parsing state, applies pkj to s1j, and performs further transitions.
5. Finally, a parse tree is built from a sequence with the highest score, where the score of Ti is the average score of all predictions that lead to generate the sequence.
Comparison to beam search
• t is the maximum number of parsing states generated by any transition sequence.
Finding low confidence predictions
• The best prediction is low confident if there exists any other prediction whose margin (score difference) to the best prediction is less than a threshold, m (|Ck(x, m)| > 1).
• The optimal beam size and margin threshold are found during development using grid search.
Selectional Branching
Projective parsing experiments
• The standard setup on WSJ (Yamada and Matsumoto’s headrules, Nivre’s labeling rules).• Our speeds (seconds per sentence) are measured on an Intel Xeon 2.57GHz machine.• POS tags are generated by the ClearNLP POS tagger (97.5% accuracy on WSJ-23).• bt: beam size used during training, bd: beam size used during decoding.
Non-projective parsing experiments
• Danish, Dutch, Slovene, and Swedish data from the CoNLL-X shared task.• Nivre’06: pseudo-projective parsing, McDonald’06: 2nd order graph-based parsing.• Nivre’09: swap transition, N&M’08: ensemble between Nivre’06 and McDonald’06.• Fernandez-Gonzalez&Gomez-Rodriguez’12: buffer transition, Martins’10: linear programming.
Experiments
ApproachDanish Dutch Slovene Swedish
LAS UAS LAS UAS LAS UAS LAS UAS
Nivre’06 84.77 89.80 78.59 81.35 70.30 78.72 84.58 89.50
McDonald’06 84.79 90.58 79.19 83.57 73.44 83.17 82.55 88.93
Nivre’09 84.20 – – – 75.20 – – –
F&G’12 85.17 90.10 – – – – 83.55 89.30
N&M’08 86.67 – 81.63 – – 75.94 84.66 –
Martins’10 – 91.50 – 84.91 – 85.53 – 89.80
bt = 80, bt = 80 87.27 91.36 82.45 85.33 77.46 84.65 86.80 91.36
bt = 80, bt = 1 86.75 91.04 80.75 83.59 75.66 83.29 86.32 91.12
• Our non-projective parsing algorithm shows an expected linear-time parsing speed and gives state-of-the-art parsing accuracy compared to other non-projective parsing approaches.• Our selectional branching uses confidence estimates to decide when to employ a beam, providing the accuracy of beam search at speeds close to greedy parsing.• Our parser is publicly available under the open source project, ClearNLP (clearnlp.com).
Conclusion
• We gratefully acknowledge a grant from the Defense Advanced Research Projects Agency under the DEFT project, solicitation #: DARPA-BAA-12-47.
Acknowledgments
Algorithm
• A hybrid between Nivre’s arc-eager and list-based algorithms (Nivre, 2003; Nivre, 2008).• When training data contains only projective trees, it learns only projective transitions. → gives a parsing complexity of O(n) for projective parsing. • When training data contains both projective and non-projective trees, it learns both kinds of transitions → gives an expected linear time parsing speed.
Transitions
Hybrid Dependency Parsing
IESL
Approach UAS LAS Speed Note
Zhang & Clark (2008) 92.10 – – Beam search
Huang & Sagae (2010) 92.10 – 0.04 + Dynamic programming
Zhang & Nivre (2011) 92.90 91.80 0.03 + Rich non-local features
Bohnet & Nivre (2012) 93.38 92.44 0.40 + Joint Inference
bt = 80, bt = 80 92.96 91.93 0.009
Using beam sizes of 16 or aboveduring decoding gave almost the
same results.
bt = 80, bt = 64 92.96 91.93 0.009
bt = 80, bt = 32 92.96 91.94 0.009
bt = 80, bt = 16 92.96 91.94 0.008
bt = 80, bt = 1 92.26 91.25 0.002 Training with a higher beamsize improved greedy parsing.
bt = 1, bt = 1 92.06 91.05 0.002
• The # of transitions performed during decoding with respect to sentence lengths for Dutch.
• Dutch consists of the highest number of non-projective trees among languages in CoNLL-X (5.4% in arcs, 36.4% in trees).
Proj.
NonProj.
2nd parsing state in the 1st transition sequence2nd-best prediction given s11
Beam search Selectional branching
Max. # of transition sequences b d = min(b, |λ| + 1)
Max. # of parsing states b ∙ t d ∙ t - d(d − 1)/2