A Neural Reordering Model for Phrase-based...

Post on 26-Sep-2020

5 views 0 download

Transcript of A Neural Reordering Model for Phrase-based...

A Neural Reordering Model for Phrase-based Translation

Peng Li Tsinghua University

pengli09@gmail.com

joint work with Yang Liu, Maosong Sun, Tatsuya Izuha, Dakun Zhang

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Bush

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Bush

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Bush held a talk

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Bush held a talk

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Bush held a talk with Sharon

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Bush held a talk with Sharon

segmentation

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Bush held a talk with Sharon

segmentation reordering

Phrase-based Translation

2

(Koehn et al., 2003; Och and Ney, 2004)

布什 与 沙⻰龙 举⾏行 了 会谈

Bush held a talk with Sharon

segmentation reordering translation

Reordering is Hard

3

Chinese

President

XiJinping

and

his

Us

counterpart Barack

Obama

opentwo

days

of

talks

in

California on

anumber

of

high-stakes issues

Reordering is Hard

3

Chinese

President

XiJinping

and

his

Us

counterpart Barack

Obama

opentwo

days

of

talks

in

California on

anumber

of

high-stakes issues

Q: Can you figure out a sentence using these words?

Reordering is Hard

4

Chinese President Xi Jinping and his Us counterpart

Barack Obama open two days

of

talks in California

on a number

of

high-stakes issues

Q: Can you figure out a sentence using these words?

Reordering is Hard• An NP-complete problem (Knight, 1999; Zaslavskiy et al., 2009)

• Reordering modeling has attracted intensive attention, e.g.

• Distance-based model (Koehn et al., 2003)

• Word-based lexicalized model (Koehn et al., 2007)

• Phrase-based lexicalized model (Tillman, 2004)

• Hierarchical phrase-based lexicalized model (Galley and Manning, 2008)

5

Distance-based Model

6

(Koehn et al., 2003)

+2

d=2

布什 与 沙⻰龙 举⾏行 了 会谈

Bush held a talk with Sharond=0 d=5

-5

Lexicalized Models

7

(Koehn et al., 2007; Tillman, 2004; Galley and Manning, 2008)

M布什 与 沙⻰龙 举⾏行 了 会谈

Bush held a talk with Sharon

Lexicalized Models

8

(Koehn et al., 2007; Tillman, 2004; Galley and Manning, 2008)

M布什 与 沙⻰龙 举⾏行 了 会谈

Bush held a talk with Sharon

D

Lexicalized Models

9

(Koehn et al., 2007; Tillman, 2004; Galley and Manning, 2008)

M布什 与 沙⻰龙 举⾏行 了 会谈

Bush held a talk with Sharon

S D

Challenge #1: Sparsity

10

Source Phrase Target Phrase M S D

布什 Bush 0.7 0.2 0.1

举⾏行 了 会谈 held a talk 0.1 0.1 0.8

与 沙⻰龙 with Sharon 0.7 0.1 0.2

举⾏行 了 held a 0.6 0.1 0.3

会谈 talk 0.4 0.3 0.3

Challenge #1: Sparsity• Probability distributions are estimated by MLE

11

Challenge #1: Sparsity• Probability distributions are estimated by MLE

11

P举⾏行 了 会谈

held a talkD =

Challenge #1: Sparsity• Probability distributions are estimated by MLE

11

P举⾏行 了 会谈

held a talkD =

# D举⾏行 了 会谈

held a talk

Challenge #1: Sparsity• Probability distributions are estimated by MLE

11

P举⾏行 了 会谈

held a talkD =

# D举⾏行 了 会谈

held a talk

# ●举⾏行 了 会谈

held a talk

Challenge #2: Ambiguity

12

沙⻰龙与布什

Challenge #3: Context Insensitivity

13

举⾏行

Bush held a talk with Sharon

S了 会谈

沙⻰龙与布什

Challenge #3: Context Insensitivity

13

举⾏行

Bush held a talk with Sharon

S了 会谈

How to resolve the three challenges?

Including More Contexts

14

与 沙⻰龙 举⾏行 了 会谈

held a talk

S

with Sharon

Sparsity

Ambiguity

Context Insensitivity

Including More Contexts

14

与 沙⻰龙 举⾏行 了 会谈

held a talk

S

with Sharon

Sparsity

Ambiguity

Context Insensitivity✔

Including More Contexts

14

与 沙⻰龙 举⾏行 了 会谈

held a talk

S

with Sharon

Sparsity

Ambiguity

Context Insensitivity✔

?

Sparsity• Including more contexts leads to severer sparsity

15

=P举⾏行 了 会谈

held a talkD

与 沙⻰龙

with Sharon

#举⾏行 了 会谈

held a talkD

与 沙⻰龙

with Sharon

#举⾏行 了 会谈

held a talk•

与 沙⻰龙

with Sharon

Sparsity• Including more contexts leads to severer sparsity

15

=P举⾏行 了 会谈

held a talkD

与 沙⻰龙

with Sharon

#举⾏行 了 会谈

held a talkD

与 沙⻰龙

with Sharon

#举⾏行 了 会谈

held a talk•

与 沙⻰龙

with Sharon

Reordering as Classification

Neural Reordering Model

16

Neural Reordering Model• A neural classifier for predicting reordering orientations

16

Neural Reordering Model• A neural classifier for predicting reordering orientations

• Conditioned on both the current and previous phrase pairs

• Improves context sensitivity

• Reduces reordering ambiguity

16

Neural Reordering Model• A neural classifier for predicting reordering orientations

• Conditioned on both the current and previous phrase pairs

• Improves context sensitivity

• Reduces reordering ambiguity

• A single classifier for all phrase pairs

• Uses vector space representations

• Alleviates the data sparsity problem

16

Recursive Autoencoder (RAE)

17

aheld talk(Pollack; 1990; Socher et. al, 2011)

Recursive Autoencoder (RAE)

17

aheld talk

x1

(Pollack; 1990; Socher et. al, 2011)

Recursive Autoencoder (RAE)

17

aheld talk

x1 x2

(Pollack; 1990; Socher et. al, 2011)

Recursive Autoencoder (RAE)

17

aheld talk

x1 x2 x3

(Pollack; 1990; Socher et. al, 2011)

Recursive Autoencoder (RAE)

17

aheld talk

x1 x2 x3

y1 = f

(1)(W (1)[x1;x2] + b

(1))y1

(Pollack; 1990; Socher et. al, 2011)

Recursive Autoencoder (RAE)

17

aheld talk

x1 x2 x3

y1 = f

(1)(W (1)[x1;x2] + b

(1))y1

x

01 x

02

[x01;x

02] = f

(2)(W (2)y1 + b

(2))

(Pollack; 1990; Socher et. al, 2011)

Recursive Autoencoder (RAE)

17

aheld talk

x1 x2 x3

y1

x

01 x

02

||x1 � x

01||2 ||x2 � x

02||2

(Pollack; 1990; Socher et. al, 2011)

Recursive Autoencoder (RAE)

18

aheld talk

x1 x2 x3

y1

(Pollack; 1990; Socher et. al, 2011)

Recursive Autoencoder (RAE)

18

aheld talk

x1 x2 x3

y1

y2y2 = f

(1)(W (1)[y1;x3] + b

(1))

(Pollack; 1990; Socher et. al, 2011)

Recursive Autoencoder (RAE)

18

aheld talk

x1 x2 x3

y1

y2y2 = f

(1)(W (1)[y1;x3] + b

(1))

[y01;x03] = f

(2)(W (2)y2 + b

(2))y01 x

03

(Pollack; 1990; Socher et. al, 2011)

Recursive Autoencoder (RAE)

18

aheld talk

x1 x2 x3

y1

y2

y01 x

03

||x3 � x

03||2

||y1 � y01||2

(Pollack; 1990; Socher et. al, 2011)

Recursive Autoencoder (RAE)

18

aheld talk

x1 x2 x3

y1

y2

(Pollack; 1990; Socher et. al, 2011)

Neural Classifier

19

orientations

current phrase pair previous phrase pair

Neural Classifier

19

orientations

current phrase pair previous phrase pair

RAE

Neural Classifier

19

orientations

current phrase pair previous phrase pair

softmax

Training

20

Reordering error on predicting orientations

Reconstruction error on recovering training examples

Training

20

Reordering error on predicting orientations

Reconstruction error on recovering training examples

linear interpolation

Reconstruction Error• Reconstruction error

!

• Source side average reconstruction error

!

• Total reconstruction error

21

Erec([c1; c2]; ✓) =1

2||[c1; c2]� [c01; c

02]||2

Erec,s(S; ✓) =1

Ns

X

i

X

p2T ✓R(ti,s)

Erec([p.c1, p.c2]; ✓)

Erec(S; ✓) = Erec,s(S; ✓) + Erec,t(S; ✓)

Reordering Error• Average cross-entropy error

!

• Joint training objective

22

E

reo

(S; ✓) =1

|S|X

i

�X

o

d

ti(o) · log(P✓

(o|ti

))

!

J = ↵Erec

(S; ✓) + (1� ↵)Ereo

(S; ✓) +R(✓)

R(✓) =�L

2||✓

L

� ✓L0 ||2 +

�rec

2||✓

rec

||2 + �reo

2||✓

reo

||2

Optimization• Hyper-parameters optimization

• Optimized by random search (Bergstra and Bengio, 2012)

• Training objective optimization: L-BFGS

• Using backpropagation through structures to compute the gradients (Goller and Kuchler, 1996)

23

↵,�L

,�rec

,�reo

Experiments• Chinese-English translation

• Training: 1.2M sentence pairs

• LM: 4-gram, 397.6M words

• Dev. set: NIST 06

• Test set: NIST 02-05, 08

• Case-insensitive BLEU

24

• Baselines

• Distance-based model

• Lexicalized model

Experiments• Chinese-English translation

• Training: 1.2M sentence pairs

• LM: 4-gram, 397.6M words

• Dev. set: NIST 06

• Test set: NIST 02-05, 08

• Case-insensitive BLEU

24

× M/S/D

left/right

orientations

word-based

phrase-based

hier. phrase-based

model type

M/S/D Orientations• Care about relative position and adjacency

25

与 沙⻰龙

with Sharon

举⾏行 了 会谈

held a talkBush

布什D

与 沙⻰龙

with Sharon

举⾏行 了 会谈

held a talkBush

布什S

Left/Right Orientations• Only care about relative position

26

与 沙⻰龙

with Sharon

举⾏行 了 会谈

held a talkBush

布什right

与 沙⻰龙

with Sharon

举⾏行 了 会谈

held a talkBush

布什left

Translation

27

Translation

27

BLEU

24

26.25

28.5

30.75

33

MT06 (dev) MT02 MT03 MT04 MT05 MT08

baseline neural M/S/D neural left/right

Translation

27

BLEU

24

26.25

28.5

30.75

33

MT06 (dev) MT02 MT03 MT04 MT05 MT08

baseline neural M/S/D neural left/right

Translation

27

BLEU

24

26.25

28.5

30.75

33

MT06 (dev) MT02 MT03 MT04 MT05 MT08

baseline neural M/S/D neural left/right

Non-Separability• The unaligned Chinese word “de” makes a big difference in

determining M/S/D orientations

28

⾦金⻔门 有 六 万 的 常住 ⼈人⼝口

Kinmen has 60000 resident population

jinmen you liu wan de changzhu renkou

M

Non-Separability• The unaligned Chinese word “de” makes a big difference in

determining M/S/D orientations

28

⾦金⻔门 有 六 万 的 常住 ⼈人⼝口

Kinmen has 60000 resident population

jinmen you liu wan de changzhu renkou

M

Non-Separability• The unaligned Chinese word “de” makes a big difference in

determining M/S/D orientations

28

⾦金⻔门 有 六 万 的 常住 ⼈人⼝口

Kinmen has 60000 resident population

jinmen you liu wan de changzhu renkou

M

Non-Separability• The unaligned Chinese word “de” makes a big difference in

determining M/S/D orientations

29

⾦金⻔门 有 六 万 的 常住 ⼈人⼝口

Kinmen has 60000 resident population

jinmen you liu wan de changzhu renkou

D

Non-Separability

30

M S D

六 万 的

60000

常住 ⼈人⼝口

resident population

六 万

60000

常住 ⼈人⼝口

resident population

Non-Separability• Left/right orientations are not so sensitive to unaligned words

31

⾦金⻔门 有 六 万 的 常住 ⼈人⼝口

Kinmen has 60000 resident population

jinmen you liu wan de changzhu renkou

right

Non-Separability• Left/right orientations are not so sensitive to unaligned words

32

⾦金⻔门 有 六 万 的 常住 ⼈人⼝口

Kinmen has 60000 resident population

jinmen you liu wan de changzhu renkou

right

Non-Separability

33

left right

六 万 的

60000

常住 ⼈人⼝口

resident population

六 万

60000

常住 ⼈人⼝口

resident population

Non-Separability

34

left rightM S D

Distortion Limit

35

2 4 6 822

23

24

25

Distortion Limit

BL

EU

neural

lexicalized

Word Vectors

36

BLEU

20

23.5

27

30.5

34

MT06 (dev) MT02 MT03 MT04 MT05 MT08

task-oriented vectors word2vec vectors

Vector Space Representations

37

but is willing toeconomy is required to

range of services to said his visit is to

is making use of

june 18, 2001

late 2011

as detention centergroup all togethertake care of old

by june 1

and complete by end 1998

or other economicand for other

within the agencies

Conclusion• We propose a neural reordering model for phrase-based translation

• It improves the context sensitivity, reduces ambiguity and alleviates the data sparsity problem

38

Conclusion• We propose a neural reordering model for phrase-based translation

• It improves the context sensitivity, reduces ambiguity and alleviates the data sparsity problem

• Future work

• Train MT system and neural classifier jointly

• Develop more efficient models to leverage larger contexts

• Extend our work to syntax-based and n-gram based models

38

Thanks!

39