Post on 16-Dec-2015
Exponential Decay Pruning for Bottom-Up Beam-Search Parsing
Nathan Bodenstab, Brian Roark, Aaron Dunlop, and Keith Hall
April 2010
2
Talk Outline
• Intro to Syntactic Parsing– Why Parse?
• Parsing Algorithms– CYK– Best-First– Beam-Search
• Exponential Decay Pruning• Results
3
Intro to Syntactic Parsing
• Hierarchically cluster and label syntactic word groups (constituents)• Provides structure and meaning
4
Intro to Syntactic Parsing
• Why Parse?– Machine Translation
• Synchronous Grammars
– Language Understanding• Semantic Role Labeling• Word Sense Disambiguation• Question-Answering• Document Summarization
– Language Modeling• Long-distance dependencies
– Because it’s fun
5
Intro to Syntactic Parsing
• What you (usually) need to parse– Supervised data: A treebank of sentences with annotated parse structure
• WSJ treebank: 50k sentences
– A Binarized Probabilistic Context Free Grammar induced from a treebank
– A parsing algorithm
• Example grammar rules:– S NP VP prob=0.2
– NP NP NN prob=0.1
– NP JJ NN prob=0.06
– Binarize: VP PP VB NN • VP PP @VP prob=0.2• @VP VB NN prob=0.5
6
Parsing Accuracy
Non-terminals
Grammar Size
Sec / Sent
F-Score
Baseline 2,500 64,000 0.1 74%
Parent Annotation (Johnson) 6,000 75,000 1.0 78%
Manual Refinement (Klein) 15,000 86%
Latent Variable (Petrov) 1,100 4,000,000 100.0 89%
Lexical (Collins, Charniak) Lots Implicit 89%
• Accuracy Improvements from grammar refinement– Split original non-terminal categories (Subject-NP vs. Object-NP)
– Accuracy at the cost of speed• Solution space becomes impractical to exhaustively search
7
Berkeley Grammar & Parser
• Petrov et al. automatically split non-terminals using latent variables• Example grammar rules:
– S_3 NP_12 VP_6 prob=0.2
– NP_12 NP_9 NN_7 prob=0.1
– NN_7 house prob=0.06
• Berkeley Coarse-to-Fine parser uses six latent variable grammars– Parse input sentence once with each grammar
– Posterior probabilities from pass n used to prune pass n+1
– Must know mapping between non-terminals from different grammars• Grammar(2) { NP_1, NP_6 } Grammar(3) { NP_2, NP_9, NP_14 }
8
Research Goals
• Our Research Goals– Find good solutions very quickly in this LARGE grammar space (not ML)– Algorithms should be grammar agnostic– Consider practical implications (speed, memory)
• This talk: Exponential Decay Pruning– Beam-Search parsing for efficient search– Searches the final grammar space directly– Balance overhead of targeted exploration (best-first) vs. memory and
cache benefits of local exploration (CYK)
9
Parsing Algorithms: CYK
• Intro to Syntactic Parsing– Why Parse?
• Parsing Algorithms– CYK– Best-First– Beam-Search
• Exponential Decay Pruning• Results
10
Parsing Algorithms: CYK
• Exhaustive population of all parse trees permitted by the grammar
• Dynamic Programming algorithm give Maximum Likelihood solution
11
Parsing Algorithms: CYK
• Fill in cells for SPAN=1,2,3,4,…
GrammarS NP VP (p=0.7)
NP NP NP (p=0.2)
NP NP VP (p=0.1)
NN court (p=0.4)
VB court (p=0.1)
….
12
Parsing Algorithms: CYK
GrammarS NP VP (p=0.7)
NP NP NP (p=0.2)
NP NP VP (p=0.1)
NN court (p=0.4)
VB court (p=0.1)
….
• N iterations through the grammar at each chart cell to consider all possible midpoints
13
Parsing Algorithms: Best-First
• Intro to Syntactic Parsing– Why Parse?
• Parsing Algorithms– CYK– Best-First– Beam-Search
• Exponential Decay Pruning• Results
14
Parsing Algorithms: Best-First
GrammarS NP VP (p=0.7)
VB court (p=0.1)
….
Frontier PQ[try][shooting,defendant]VP VB NP fom=28.1
[try,shooting][defendant]VP VB NP fom=14.7
[Juvenile][court]NP ADJ NN fom=13
• Frontier is a Priority Queue of all potentially buildable entries
• Add best entry from Frontier; expand Frontier with all possible chart + grammar extensions
15
Parsing Algorithms: Best-First
GrammarS NP VP (p=0.7)
VB court (p=0.1)
….
Frontier PQ[try][shooting,defendant]VP VB NP fom=28.1
[try,shooting][defendant]VP VB NP fom=14.7
[Juvenile][court]NP ADJ NN fom=13
• Frontier is a Priority Queue of all potentially buildable entries
• Add best entry from Frontier; expand Frontier with all possible chart + grammar extensions
16
Parsing Algorithms: Best-First
• How do we rank Frontier entries?– Figure-of-Merit (FOM)– FOM = Inside (grammar) * Outside (heuristic)
– Caraballo and Charniak, 1997 (C&C)– Problem with comparisons of different spans
GrammarS NP VP (p=0.7)
VB court (p=0.1)
….
Frontier PQ[try][shooting,defendant]VP VB NP fom=28.1
[try,shooting][defendant]VP VB NP fom=14.7
[Juvenile][court]NP ADJ NN fom=13
17
Parsing Algorithms: Beam-Search
• Intro to Syntactic Parsing– Why Parse?
• Parsing Algorithms– CYK– Best-First– Beam-Search
• Exponential Decay Pruning• Results
18
Parsing Algorithms: Beam-Search
• Beam-Search: Best of both worlds
• CKY exhaustive traversal (bottom-up)
• At each chart cell– Compute FOM for all possible cell entries– Rank entries in a (temporary) local priority queue– Only populate the cell with the n-best entries (beam-width)
• Less Memory– Not storing all cell entries (CYK) nor bad frontier entries (Best-First)
• Runs Faster– Search space is pruned (unlike CYK) and don’t need to maintain global
priority queue (Best-First)
• Eliminates problem of global cell entry comparison
19
Parsing Algorithms: Beam-Search
• Intro to Syntactic Parsing– Why Parse?
• Parsing Algorithms– CYK– Best-First– Beam-Search
• Exponential Decay Pruning• Results
20
Exponential Decay Pruning
• What is the optimal beam-width per chart cell?– Common solutions:
• Relative score difference from highest ranking entry• Global maximum number of candidates
• Exponential Decay Pruning– Adaptive beam-width conditioned on chart cell information– How reliable is our Figure-of-Merit per chart cell?– Plotted rank of Gold entry against span and sentence size
• FOM is more reliable for larger spans– Less dependent on outside estimate
• FOM is less reliable for short sentences– Atypical grammatical structure (in WSJ?)
21
Exponential Decay Pruning
• Confidence in FOM can be modeled with the Exponential Decay function– N0 = Global beam-width maximum
– n = sentence length– s = span length (number of words covered)– λ = tuning parameter
22
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
SpanLength / SentenceLength
Per
cen
t B
asel
ine
Co
nst
itu
ents
Ad
ded
to
Ch
art
baseline
n=5
n=10
n=20
n=40
Exponential Decay Pruning
• Confidence in FOM can be modeled with the Exponential Decay function
23
• Intro to Syntactic Parsing– Why Parse?
• Parsing Algorithms– CYK– Best-First– Beam-Search
• Exponential Decay Pruning• Results
24
Results
• Wall Street Journal treebank– Train: Sections 2-21 (40k sentences)
– Dev: Section 24 (1.3k sentences
– Test: Section 23 (2.4k sentences)
• Berkeley SM6 Latent Variable Grammar• Figure-of-Merit from Caraballo and Charniak, 1997 (C&C)• Also applied Cell Closing Constraints (Roark and Hollingshead, 2008)• External comparison with Berkeley Coarse-to-Fine parser using same
grammar
25
Results: Dev
Algorithm FOM Beam-Width
Cell Closing
Seconds per Sent
Chart Entries
F-Score
CYK 94.1 163537 87.2
Best-First Inside 138.0 152472 87.2
Best-First C&C 1.43 349 85.2
Beam-Search Inside Constant 5.68 35501 87.2
Beam-Search Inside Decay 3.01 20002 87.0
Beam-Search C&C Constant 0.62 7548 87.0
Beam-Search C&C Decay 0.37 5145 87.1
Beam-Search C&C Constant Yes 0.31 5333 87.4
Beam-Search C&C Decay Yes 0.20 3839 87.5
• Figure-of-Merit makes a big difference• Fast solution, but significant accuracy degradation
26
Results: Dev
Algorithm FOM Beam-Width
Cell Closing
Seconds per Sent
Chart Entries
F-Score
CYK 94.1 163537 87.2
Best-First Inside 138.0 152472 87.2
Best-First C&C 1.43 349 85.2
Beam-Search Inside Constant 5.68 35501 87.2
Beam-Search Inside Decay 3.01 20002 87.0
Beam-Search C&C Constant 0.62 7548 87.0
Beam-Search C&C Decay 0.37 5145 87.1
Beam-Search C&C Constant Yes 0.31 5333 87.4
Beam-Search C&C Decay Yes 0.20 3839 87.5
• Using the inside probability for the FOM– 95% speed reduction with Beam-Search over Best-First
– Exponential Decay adds additional 47% speed reduction
27
Results: Dev
Algorithm FOM Beam-Width
Cell Closing
Seconds per Sent
Chart Entries
F-Score
CYK 94.1 163537 87.2
Best-First Inside 138.0 152472 87.2
Best-First C&C 1.43 349 85.2
Beam-Search Inside Constant 5.68 35501 87.2
Beam-Search Inside Decay 3.01 20002 87.0
Beam-Search C&C Constant 0.62 7548 87.0
Beam-Search C&C Decay 0.37 5145 87.1
Beam-Search C&C Constant Yes 0.31 5333 87.4
Beam-Search C&C Decay Yes 0.20 3839 87.5• Using the C&C FOM
– Beam-Search is faster (57%) and more accurate than Best-First
– Exponential Decay adds additional 40% speed reduction
28
Results: Dev
Algorithm FOM Beam-Width
Cell Closing
Seconds per Sent
Chart Entries
F-Score
CYK 94.1 163537 87.2
Best-First Inside 138.0 152472 87.2
Best-First C&C 1.43 349 85.2
Beam-Search Inside Constant 5.68 35501 87.2
Beam-Search Inside Decay 3.01 20002 87.0
Beam-Search C&C Constant 0.62 7548 87.0
Beam-Search C&C Decay 0.37 5145 87.1
Beam-Search C&C Constant Yes 0.31 5333 87.4
Beam-Search C&C Decay Yes 0.20 3839 87.5
29
Results: Test
Algorithm FOM Beam-Width
Cell Closing
Seconds per Sent
F-Score
CYK 76.63 88.0
Beam-Search C&C Constant 0.45 87.9
Beam-Search C&C Decay 0.28 88.0
Beam-Search C&C Decay Yes 0.16 88.3
Berkeley C2F 0.21 88.3
• 38% relative speed-up (Decay vs. Constant beam-width)• Decay pruning and Cell Closing Constraints are complementary• Same ball-park as Coarse-to-Fine (perhaps a bit faster)• Requires no knowledge of the grammar
30
Thanks
31
FOM Details
• C&C FOM Details– FOM(NT) = Outsideleft * Inside * Outsideright
– Inside = Constituent grammar score for NT
– Outsideleft = Max { POS forward prob * POS-to-NT transition prob }
– Outsideright = Max { NT-to-POS transition prob * POS bkwd prob }
32
FOM Details
• C&C FOM Details
33
Research Goals
• Research Goals– Find good solutions very quickly in this LARGE grammar space (not ML)– Algorithms should be grammar agnostic– Consider practical implications (speed, memory)
• Current projects towards these goals– Better FOM function
• Inside estimate (grammar refinement)• Outside estimate (participation in complete parse tree)
– Optimal chart traversal strategy• Which areas of the search space are most promising?• Cell Closing Constraints (Roark and Hollingshead, 2008)
– Balance between targeted and exhaustive exploration• How much “work” should be done exploring the search space around these promising
areas?• Overhead of targeted exploration (best-first) vs. memory and cache benefits of local
exploration (CYK)