A Neural Reordering Model for Phrase-based...

A Neural Reordering Model for Phrase-based Translation

Peng Li Tsinghua University

pengli09@gmail.com

joint work with Yang Liu, Maosong Sun, Tatsuya Izuha, Dakun Zhang

Phrase-based Translation

(Koehn et al., 2003; Och and Ney, 2004)

布什与沙⻰龙举⾏行了会谈

Bush held a talk

Bush held a talk with Sharon

segmentation

segmentation reordering

segmentation reordering translation

Reordering is Hard

Chinese

President

XiJinping

counterpart Barack

opentwo

California on

anumber

high-stakes issues

Reordering is Hard

Chinese

President

XiJinping

counterpart Barack

opentwo

California on

anumber

high-stakes issues

Q: Can you figure out a sentence using these words?

Reordering is Hard

Chinese President Xi Jinping and his Us counterpart

Barack Obama open two days

talks in California

on a number

high-stakes issues

Q: Can you figure out a sentence using these words?

Reordering is Hard• An NP-complete problem (Knight, 1999; Zaslavskiy et al., 2009)

• Reordering modeling has attracted intensive attention, e.g.

• Distance-based model (Koehn et al., 2003)

• Word-based lexicalized model (Koehn et al., 2007)

• Phrase-based lexicalized model (Tillman, 2004)

• Hierarchical phrase-based lexicalized model (Galley and Manning, 2008)

Distance-based Model

(Koehn et al., 2003)

Bush held a talk with Sharond=0 d=5

Lexicalized Models

(Koehn et al., 2007; Tillman, 2004; Galley and Manning, 2008)

M布什与沙⻰龙举⾏行了会谈

Lexicalized Models

Challenge #1: Sparsity

Source Phrase Target Phrase M S D

布什 Bush 0.7 0.2 0.1

举⾏行了会谈 held a talk 0.1 0.1 0.8

与沙⻰龙 with Sharon 0.7 0.1 0.2

举⾏行了 held a 0.6 0.1 0.3

会谈 talk 0.4 0.3 0.3

Challenge #1: Sparsity• Probability distributions are estimated by MLE

P举⾏行了会谈

held a talkD =

P举⾏行了会谈

held a talkD =

# D举⾏行了会谈

held a talk

P举⾏行了会谈

held a talkD =

# D举⾏行了会谈

held a talk

# ●举⾏行了会谈

held a talk

Challenge #2: Ambiguity

沙⻰龙与布什

Challenge #3: Context Insensitivity

举⾏行

S了会谈

沙⻰龙与布什

Challenge #3: Context Insensitivity

举⾏行

S了会谈

How to resolve the three challenges?

Including More Contexts

与沙⻰龙举⾏行了会谈

held a talk

with Sharon

Sparsity

Ambiguity

Context Insensitivity

held a talk

with Sharon

Sparsity

Ambiguity

Context Insensitivity✔

held a talk

with Sharon

Sparsity

Ambiguity

Context Insensitivity✔

Sparsity• Including more contexts leads to severer sparsity

=P举⾏行了会谈

held a talkD

与沙⻰龙

with Sharon

#举⾏行了会谈

held a talkD

与沙⻰龙

with Sharon

#举⾏行了会谈

held a talk•

与沙⻰龙

with Sharon

Sparsity• Including more contexts leads to severer sparsity

=P举⾏行了会谈

held a talkD

与沙⻰龙

with Sharon

#举⾏行了会谈

held a talkD

与沙⻰龙

with Sharon

#举⾏行了会谈

held a talk•

与沙⻰龙

with Sharon

Reordering as Classification

Neural Reordering Model

Neural Reordering Model• A neural classifier for predicting reordering orientations

• Conditioned on both the current and previous phrase pairs

• Improves context sensitivity

• Reduces reordering ambiguity

Neural Reordering Model• A neural classifier for predicting reordering orientations

• Conditioned on both the current and previous phrase pairs

• Improves context sensitivity

• Reduces reordering ambiguity

• A single classifier for all phrase pairs

• Uses vector space representations

• Alleviates the data sparsity problem

Recursive Autoencoder (RAE)

aheld talk(Pollack; 1990; Socher et. al, 2011)

aheld talk

(Pollack; 1990; Socher et. al, 2011)

aheld talk

x1 x2 x3

aheld talk

x1 x2 x3

y1 = f

(1)(W (1)[x1;x2] + b

(1))y1

aheld talk

x1 x2 x3

y1 = f

(1)(W (1)[x1;x2] + b

(1))y1

[x01;x

02] = f

(2)(W (2)y1 + b

aheld talk

x1 x2 x3

||x1 � x

01||2 ||x2 � x

aheld talk

x1 x2 x3

aheld talk

x1 x2 x3

y2y2 = f

(1)(W (1)[y1;x3] + b

aheld talk

x1 x2 x3

y2y2 = f

(1)(W (1)[y1;x3] + b

[y01;x03] = f

(2)(W (2)y2 + b

(2))y01 x

aheld talk

x1 x2 x3

||x3 � x

||y1 � y01||2

aheld talk

x1 x2 x3

Neural Classifier

orientations

current phrase pair previous phrase pair

Neural Classifier

orientations

Neural Classifier

orientations

softmax

Training

Reordering error on predicting orientations

Reconstruction error on recovering training examples

Training

Reordering error on predicting orientations

Reconstruction error on recovering training examples

linear interpolation

Reconstruction Error• Reconstruction error

• Source side average reconstruction error

• Total reconstruction error

Erec([c1; c2]; ✓) =1

2||[c1; c2]� [c01; c

02]||2

Erec,s(S; ✓) =1

p2T ✓R(ti,s)

Erec([p.c1, p.c2]; ✓)

Erec(S; ✓) = Erec,s(S; ✓) + Erec,t(S; ✓)

Reordering Error• Average cross-entropy error

• Joint training objective

(S; ✓) =1

ti(o) · log(P✓

J = ↵Erec

(S; ✓) + (1� ↵)Ereo

(S; ✓) +R(✓)

R(✓) =�L

2||✓

� ✓L0 ||2 +

�rec

2||✓

||2 + �reo

2||✓

Optimization• Hyper-parameters optimization

• Optimized by random search (Bergstra and Bengio, 2012)

• Training objective optimization: L-BFGS

• Using backpropagation through structures to compute the gradients (Goller and Kuchler, 1996)

↵,�L

,�rec

,�reo

Experiments• Chinese-English translation

• Training: 1.2M sentence pairs

• LM: 4-gram, 397.6M words

• Dev. set: NIST 06

• Test set: NIST 02-05, 08

• Case-insensitive BLEU

• Baselines

• Distance-based model

• Lexicalized model

Experiments• Chinese-English translation

• Training: 1.2M sentence pairs

• LM: 4-gram, 397.6M words

• Dev. set: NIST 06

• Test set: NIST 02-05, 08

• Case-insensitive BLEU

× M/S/D

left/right

orientations

word-based

phrase-based

hier. phrase-based

model type

M/S/D Orientations• Care about relative position and adjacency

与沙⻰龙

with Sharon

举⾏行了会谈

held a talkBush

布什D

与沙⻰龙

with Sharon

举⾏行了会谈

held a talkBush

布什S

Left/Right Orientations• Only care about relative position

与沙⻰龙

with Sharon

举⾏行了会谈

held a talkBush

布什right

与沙⻰龙

with Sharon

举⾏行了会谈

held a talkBush

布什left

Translation

MT06 (dev) MT02 MT03 MT04 MT05 MT08

baseline neural M/S/D neural left/right

Translation

Non-Separability• The unaligned Chinese word “de” makes a big difference in

determining M/S/D orientations

⾦金⻔门有六万的常住⼈人⼝口

Kinmen has 60000 resident population

jinmen you liu wan de changzhu renkou

Non-Separability

六万的

常住⼈人⼝口

resident population

六万

常住⼈人⼝口

resident population

Non-Separability• Left/right orientations are not so sensitive to unaligned words

Non-Separability

left right

六万的

常住⼈人⼝口

resident population

六万

常住⼈人⼝口

resident population

Non-Separability

left rightM S D

Distortion Limit

2 4 6 822

Distortion Limit

neural

lexicalized

Word Vectors

task-oriented vectors word2vec vectors

Vector Space Representations

but is willing toeconomy is required to

range of services to said his visit is to

is making use of

june 18, 2001

late 2011

as detention centergroup all togethertake care of old

by june 1

and complete by end 1998

or other economicand for other

within the agencies

Conclusion• We propose a neural reordering model for phrase-based translation

• It improves the context sensitivity, reduces ambiguity and alleviates the data sparsity problem

Conclusion• We propose a neural reordering model for phrase-based translation

• It improves the context sensitivity, reduces ambiguity and alleviates the data sparsity problem

• Future work

• Train MT system and neural classifier jointly

• Develop more efficient models to leverage larger contexts

• Extend our work to syntax-based and n-gram based models

Thanks!

A Neural Reordering Model for Phrase-based...

Documents

Transcript of A Neural Reordering Model for Phrase-based...

coling two

De-coling units, protection film and control panels of machine.

COLING-2016: Program of the Main Conference December 13 9:00 ...

COLING 2014: Joint Opinion Relation Detection Using One-Class Deep Neural Network

Advanced Dynamic Programming in CLweb.engr.oregonstate.edu/~huanlian/slides/COLING-tutorial-anim.pdfLiang Huang (Penn) Dynamic Programming Dynamic Programming • Dynamic Programming

Distributed System and Middleware Distributed Systems Distributed File System Dr. Sunny Jeong. spjeong@uic.edu.hk Mr. Coling Zhang colinzhang@uic.edu.hk.

The ballistic support of the “SPECTR-RG” spacecraft flight to the L2 point of the Sun-Earth system I.S. Ilin, G.S. Zaslavskiy, S.M. Lavrenov, V.V. Sazonov,

RING panel discussion, Coling 2010 ( E. Hovy + M. Zock)

Proceedings of COLING 2012: Technical Papers · Proceedings of COLING 2012: Technical Papers ... (TDIL) project of Department of IT, ... Manish Shrivastava

Oceans 2003 MTS/IEEE Conference (San Diego, Calif.) : … · 2008-07-15 · Energy Spectra bythe Bottlenose Dolphin 633 Dr. Gennadi L. Zaslavskiy, Ramot, Tel-Aviv University, Israel

CoLing 2016

Davis Center, Harvard University November 2018 Ilya Zaslavskiy · Nord Stream 2 • With the help of the Spanish prosecution files and the work of German newspapers, Russian investigative

A System for Multilingual Sentiment Learning On Large Data ... · Proceedings of COLING 2012: Technical Papers, pages 577–592, COLING 2012, Mumbai, December 2012. avbi2K 6Q` JmHiBHBM;m

Joint Modeling of Trigger Identification and Event Type ...Proceedings of COLING 2012: Technical Papers, pages 1635–1652, COLING 2012, Mumbai, December 2012. Joint Modeling of Trigger

COLING 2012: Some facts and figures

THE FINITE STRINGTHE FINITE . STRING . 11-3 . INTERNATIONAL . CONFERENCE - COLING . 76 . . . . . . , . . . . . 3 . WORKSHOP - theoretics-1 . issues . in natural language processing

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND ...fbach/path_following_pami.pdfIEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 31, NO. 12, DECEMBER 2009 2227 M. Zaslavskiy

Characterizing network traffic by means of the NetMine frameworkusers.cis.fiu.edu/~lpeng/Traffic/1_Characterizing network... · 2015. 4. 30. · Characterizing network trafﬁc by

A path following algorithm for the graph matching problem · A path following algorithm for the graph matching problem Mikhail Zaslavskiy, Francis Bach, and Jean-Philippe Vert ∗†

United States vs. Maksim Zaslavskiy