Yunzhe Yuan Yong Jiang Kewei Tufaculty.sist.shanghaitech.edu.cn/faculty/tukw/aaai19-poster.pdf ·...

1
Bidirectional Transition-Based Dependency Parsing Yunzhe Yuan · Yong Jiang · Kewei Tu School of Information Science and Technology, ShanghaiTech University Motivation In transition-based dependency parsing, an early prediction mistake may negatively impact many future decisions. Proposed Solution During training, we learn a left-to-right parser and a right-to-left parser separately. To parse a sentence, we perform joint decoding with the two parsers. Empirical Results Our methods lead to competitive parsing accuracy and our method based on dynamic oracle consistently achieves the best performance. Dependency Parsing Dependency parsing pipeline: I swam yesterday Honolulu is beautiful Useful Method in NLP PRP VB NN NNP VB JJ JJ NN IN NN I swam yesterday Honolulu is beautiful Useful Method in NLP Transition-Based Dependency Parsing Regards the parse tree as a sequence of state action pairs. A B C . The arc-hybrid transition system: b 0 b 0 Shift s 0 s 0 b 0 b 0 Reduce-Left s 0 s 1 s 1 s 0 Reduce-Right Decoding In one line: while(!done) {do best stuff} Exposure bias problem Idea: learn a left-to-right parser and a right-to-left parser separately; perform joint decoding with them. How? First Attempt: Vanilla Joint Scoring z = arg max t2T G(t) y = arg max t2T F (t) t * = arg max t∈T F (t) + G (t) 7t * = arg max t∈{y,z} F (t) + G (t) Second Attempt: Dual Decomposition (DD) Optimize an upper bounded objective function with a lagrange variable u L(u) = max y∈T (F (y) + X i ,j u(i , j )y(i , j )) + max z∈T (G (z) - X i ,j u(i , j )z(i , j )) Approximately optimize it with dual decomposition Better than vanilla with a small gain One possible reason: u only contains information of reduce actions Static Oracle (SO) vs Dynamic Oracle (DO) Third Attempt: Dynamic Oracle (DO) Iterative decoding: In each iteration, use the parse of one parser to construct a dynamic oracle that guides the other parser x (0) y (0) y (1) x (1) x (2) y (2) y (n) x (n) M l2r M r 2l Modify the scoring function of each parser according to the dynamic oracle Comparisons with DD: In DO, at least one action in each valid configuration would have its score modified. DO leads to significantly more change to the scoring function DO achieves fast convergence with at most 4 iterations in PTB Experiments and Analysis Dataset: PTB, CTB and 8 UD Treebanks Code: https://github.com/yuanyunzhe/bi-trans-parser Method PTB CTB UAS LAS UAS LAS L2R 93.54± 0.12 92.22± 0.17 86.21± 0.14 85.02± 0.13 R2L 93.56± 0.18 93.27± 0.25 86.44± 0.07 85.22± 0.07 Vanilla 94.35± 0.05 92.91± 0.11 87.36± 0.07 86.07± 0.06 DD 94.35± 0.05 93.01± 0.09 87.41± 0.09 86.18± 0.09 DO 94.60± 0.04 94.02± 0.13 88.07± 0.07 87.54± 0.14 DD + DO 94.60± 0.04 94.02± 0.13 88.09± 0.08 87.52± 0.10 C&M14 91.80 89.60 83.90 82.40 Dyer15 93.10 90.90 87.20 85.70 Weiss15 93.99 92.05 - - Andor16 94.61 92.79 - - Ballesteros16 93.56 91.42 87.65 86.21 K&G16 93.90 91.90 87.60 86.10 Zhang16 94.10 91.90 87.84 86.15 Shi17 94.53± 0.05 - 88.62± 0.09 - Table 1: Results on PTB and CTB Method DE EN ES FR UAS LAS UAS LAS UAS LAS UAS LAS L2R 81.62 76.14 88.87 86.79 86.52 82.90 87.33 83.17 R2L 81.54 76.03 89.13 87.10 86.78 83.05 87.63 83.57 Vanilla 82.62 76.90 90.20 88.02 87.49 83.60 88.25 84.04 DD 82.64 77.12 90.23 88.24 87.52 83.78 88.30 84.77 DO 83.02 79.58 90.56 89.48 87.83 85.69 88.81 87.82 Method IT NL PL ZH UAS LAS UAS LAS UAS LAS UAS LAS L2R 91.41 89.25 87.07 83.43 94.77 92.98 85.16 82.64 R2L 91.46 89.33 87.74 84.44 95.39 93.81 86.01 83.26 Vanilla 92.19 89.90 88.56 84.72 95.94 93.96 87.04 84.24 DD 92.22 90.61 88.58 85.04 95.96 94.47 87.06 84.38 DO 92.31 91.58 89.41 87.41 96.10 94.62 87.75 86.46 Table 2: Results on UD Analysis: We analyze the distribution of erroneous actions with respect to the relative position of the action in the complete action sequence produced by the parser. Figure 1: Comparisons between re- sults from the two bidirectional mod- els Figure 2: Comparisons between re- sults from the left-to-right model and the DO method Figure 3: Comparisons between re- sults from the right-to-left model and the DO method

Transcript of Yunzhe Yuan Yong Jiang Kewei Tufaculty.sist.shanghaitech.edu.cn/faculty/tukw/aaai19-poster.pdf ·...

Page 1: Yunzhe Yuan Yong Jiang Kewei Tufaculty.sist.shanghaitech.edu.cn/faculty/tukw/aaai19-poster.pdf · PRP VB NN NNP VB JJ JJ NN IN NN I swam yesterday Honolulu is beautiful Useful Method

Bidirectional Transition-Based DependencyParsing

Yunzhe Yuan · Yong Jiang · Kewei Tu

School of Information Science and Technology, ShanghaiTech University

Motivation

In transition-based dependency parsing, an earlyprediction mistake may negatively impact manyfuture decisions.

Proposed Solution

During training, we learn a left-to-right parser and aright-to-left parser separately. To parse a sentence, weperform joint decoding with the two parsers.

Empirical Results

Our methods lead to competitive parsing accuracyand our method based on dynamic oracle consistentlyachieves the best performance.

Dependency Parsing

Dependency parsing pipeline:

I swam yesterday Honolulu is beautiful

Useful Method in NLP

PRP VB NN NNP VB JJ

JJ NN IN NN

I swam yesterday

Honolulu is beautiful

Useful Method in NLP

b0

b0

Shift

s0

s0

b0

b0

Reduce-Left

s0s1

s1

s0

Reduce-Right

……z = arg max

t2TG(t)y = arg max

t2TF (t)

Transition-Based Dependency Parsing

Regards the parse tree as a sequence of state action pairs.

A B C.…

The arc-hybrid transition system:

I swam yesterday Shanghai is beautiful Useful Method in NLP

PRP VB NN NNP VB JJ

JJ NN IN NN

PRP VB NN

NNP VB JJ

JJ NN IN NN

b0

b0

Shift

s0

s0

b0

b0

Reduce-Left

s0s1

s1

s0

Reduce-Right

Decoding

– In one line: while(!done) {do best stuff}– Exposure bias problem

– Idea: learn a left-to-right parser and a right-to-left parser separately; performjoint decoding with them. ⇐ How?

First Attempt: Vanilla Joint Scoring

I swam yesterday Shanghai is beautiful Useful Method in NLP

PRP VB NN NNP VB JJ

JJ NN IN NN

PRP VB NN

NNP VB JJ

JJ NN IN NN

b0

b0

Shift

s0

s0

b0

b0

Reduce-Left

s0s1

s1

s0

Reduce-Right

……z = arg max

t2TG(t)y = arg max

t2TF (t)

t∗ = argmaxt∈T

(F (t) + G(t)

)7→ t∗ = arg max

t∈{y,z}

(F (t) + G(t)

)

Second Attempt: Dual Decomposition (DD)

– Optimize an upper bounded objective function with a lagrange variable u

L(u) = maxy∈T

(F (y) +∑i ,j

u(i , j)y(i , j)) +maxz∈T

(G(z)−∑i ,j

u(i , j)z(i , j))

– Approximately optimize it with dual decomposition

– Better than vanilla with a small gain

– One possible reason: u only contains information of reduce actions

Static Oracle (SO) vs Dynamic Oracle (DO)

Third Attempt: Dynamic Oracle (DO)

Iterative decoding:

– In each iteration, use the parse of one parser to construct a dynamic oracle thatguides the other parser

A B C.…

x(0)

y(0) y(1)

x(1) x(2)

y(2)

…y(n)

x(n)

Ml2r

Mr2l

– Modify the scoring function of each parser according to the dynamic oracle

Comparisons with DD:

– In DO, at least one action in each valid configuration would have its scoremodified.

– DO leads to significantly more change to the scoring function

– DO achieves fast convergence with at most 4 iterations in PTB

Experiments and Analysis

Dataset: PTB, CTB and 8 UD TreebanksCode: https://github.com/yuanyunzhe/bi-trans-parser

Method PTB CTBUAS LAS UAS LAS

L2R 93.54± 0.12 92.22± 0.17 86.21± 0.14 85.02± 0.13R2L 93.56± 0.18 93.27± 0.25 86.44± 0.07 85.22± 0.07Vanilla 94.35± 0.05 92.91± 0.11 87.36± 0.07 86.07± 0.06DD 94.35± 0.05 93.01± 0.09 87.41± 0.09 86.18± 0.09DO 94.60± 0.04 94.02± 0.13 88.07± 0.07 87.54± 0.14DD + DO 94.60± 0.04 94.02± 0.13 88.09± 0.08 87.52± 0.10C&M14 91.80 89.60 83.90 82.40Dyer15 93.10 90.90 87.20 85.70Weiss15 93.99 92.05 - -Andor16 94.61 92.79 - -Ballesteros16 93.56 91.42 87.65 86.21K&G16 93.90 91.90 87.60 86.10Zhang16 94.10 91.90 87.84 86.15Shi17 94.53± 0.05 - 88.62± 0.09 -

Table 1: Results on PTB and CTB

Method DE EN ES FRUAS LAS UAS LAS UAS LAS UAS LAS

L2R 81.62 76.14 88.87 86.79 86.52 82.90 87.33 83.17R2L 81.54 76.03 89.13 87.10 86.78 83.05 87.63 83.57Vanilla 82.62 76.90 90.20 88.02 87.49 83.60 88.25 84.04DD 82.64 77.12 90.23 88.24 87.52 83.78 88.30 84.77DO 83.02 79.58 90.56 89.48 87.83 85.69 88.81 87.82

Method IT NL PL ZHUAS LAS UAS LAS UAS LAS UAS LAS

L2R 91.41 89.25 87.07 83.43 94.77 92.98 85.16 82.64R2L 91.46 89.33 87.74 84.44 95.39 93.81 86.01 83.26Vanilla 92.19 89.90 88.56 84.72 95.94 93.96 87.04 84.24DD 92.22 90.61 88.58 85.04 95.96 94.47 87.06 84.38DO 92.31 91.58 89.41 87.41 96.10 94.62 87.75 86.46

Table 2: Results on UD

Hyperparameter ValueWord embedding dimension 100POS tag dimension 25BiLSTM layers 2LSTM dimensions 200/200MLP units 100

Table 3: Hyperparameters used in the experiments

after K iterations, we collect all the 2K parse trees pro-duced during decoding and select the best parse followingjoint scoring.

We compare six methods: the left-to-right parser (L2R),the right-to-left parser (R2L), vanilla joint scoring (Vanilla),joint decoding with dual decomposition (DD), joint decod-ing guided by dynamic oracles (DO), and the combinationof DD and DO (DD+DO).

For each dataset, we train each unidirectional parser for20 epochs, collect all the models after each of the 20 epochs,and choose the best model based on the UAS metric on the

development set. We then fix the unidirectional parsers andtune the hyperparameters of our joint decoding algorithms(cdd ∈ {0, 0.005, 0.01} and cdo ∈ {0, 1, 2, 3, 4}) based onthe UAS on the development set. We repeat the process forfive times with different random seeds and report the averageaccuracy and the standard deviation on the test set.

Table 3 shows all the other hyperparameters used in ourexperiments.

ResultsWe compare our methods with the following neuraltransition-based parsers on PTB and CTB: C&M14 (Chenand Manning 2014), Dyer15 (Dyer et al. 2015), Weiss15(Weiss et al. 2015), Ballesteros16 (Ballesteros et al. 2016),Andor16 (Andor et al. 2016), K&G16(Kiperwasser andGoldberg 2016), Zhang16 (Zhang, Cheng, and Lapata2017), and Shi17 (Shi, Huang, and Lee 2017). As can beseen from Table 1, our methods are very competitive com-pared with previous methods and our DO-based algorithmproduces the highest accuracies on PTB.

It can also be seen that DD only slightly outperforms

Analysis: We analyze the distribution of erroneous actions with respect to therelative position of the action in the complete action sequence produced by theparser.

0.0 0.2 0.4 0.6 0.8 1.0Position (Normalized)

0

50

100

150

200

250

300

#Erro

r

l2rr2l

Figure 1: Comparisons between re-sults from the two bidirectional mod-els

0.0 0.2 0.4 0.6 0.8 1.0Position (Normalized)

0

50

100

150

200

250

300

#Erro

r

l2rdo

Figure 2: Comparisons between re-sults from the left-to-right model andthe DO method

0.0 0.2 0.4 0.6 0.8 1.0Position (Normalized)

0

50

100

150

200

250

300

#Erro

r

r2ldo

Figure 3: Comparisons between re-sults from the right-to-left model andthe DO method

after K iterations, we collect all the 2K parse trees pro-duced during decoding and select the best parse followingjoint scoring.

We compare six methods: the left-to-right parser (L2R),the right-to-left parser (R2L), vanilla joint scoring (Vanilla),joint decoding with dual decomposition (DD), joint decod-ing guided by dynamic oracles (DO), and the combinationof DD and DO (DD+DO).

For each dataset, we train each unidirectional parser for20 epochs, collect all the models after each of the 20 epochs,and choose the best model based on the UAS metric on thedevelopment set. We then fix the unidirectional parsers andtune the hyperparameters of our joint decoding algorithms(cdd ∈ {0, 0.005, 0.01} and cdo ∈ {0, 1, 2, 3, 4}) based onthe UAS on the development set. We repeat the process forfive times with different random seeds and report the averageaccuracy and the standard deviation on the test set.

Table 5 shows all the other hyperparameters used in ourexperiments.

Figure 4: Accuracy vs. iteration for DD (corresponding todo=0), DO (corresponding to dd=0), and DD+DO

ResultsWe compare our methods with the following neuraltransition-based parsers on PTB and CTB: C&M14 (Chenand Manning 2014), Dyer15 (Dyer et al. 2015), Weiss15(Weiss et al. 2015), Ballesteros16 (Ballesteros et al. 2016),Andor16 (Andor et al. 2016), K&G16(Kiperwasser andGoldberg 2016), Zhang16 (Zhang, Cheng, and Lapata2017), and Shi17 (Shi, Huang, and Lee 2017). As can beseen from Table 4, our methods are very competitive com-pared with previous methods and our DO-based algorithmproduces the highest accuracies on PTB.

It can also be seen that DD only slightly outperformsVanilla, while DO achieves significant improvement overboth Vanilla and DD. Combining DD with DO does not leadto any improvement over DO.

We also run our methods without hyperparameter tuningon the eight UD datasets. As can be seen from Table 3, ourthree joint decoding algorithms consistently outperform theunidirectional baselines and the DO-based algorithm consis-tently achieves the highest accuracy.

Figure 4 plots the change of accuracy during our itera-tive joint decoding algorithms (DD, DO and DD+DO) onthe WSJ development set. It can be seen that the DO-basedmethod not only leads to higher accuracy, but also convergesvery fast (within 4 iterations). In comparison, DD does notconverge within 100 iterations. It can also be seen that whencdo is small, combining DD into the DO-based method ishelpful; but when cdo is large, adding DD to the DO-basedmethod has little impact on the accuracy.

AnalysisStatistics We count the number of action errors appearedin the relative position of parse trees from the left-to-rightparser, the right-to-left parser and results from the DOmethod. We plot the graph in Figure 1, 2 and 3. Accordingto the Figure, there are some interesting findings.

For the two unidirectional parsers, we also analyze thedistribution of erroneous actions with respect to the rela-tive position (normalized to the range between 0 and 1) ofthe action in the complete action sequence produced by theparser (Figure 1). For the right-to-left parser, we reverse the