Chunk-based Decoder for Neural Machine Translationynaga/papers/ishiwatari-acl...Chunk-based Decoder...

1
Chunk-based Decoder for Neural Machine Translaon Shonosuke Ishiwatari¹, Jingtao Yao², Shujie Liu³, Mu Li³, Ming Zhou³, Naoki Yoshinaga, Masaru Kitsuregawa⁴⁵, Weijia Jia² ¹The University of Tokyo ²Shanghai Jiao Tong University ³Microsoſt Research Asia ⁴Instute of Industrial Science, the University of Tokyo ⁵Naonal Instute of Informacs Overview Proposal: Chunk-based Decoder for Neural Machine Translaon Translaon examples Idea: Using a chunk rather than a word as basic translaon unit SOTA performance in distant languages (En -> Ja) Translaon between distant language pairs are difficult Word-based Encoder-Decoder だれ !"" !"# $%&&'( $) *+,'+(' " -+. * α(4, dog) !"" /&&'(&%+( 0'1+-'2 3(1+-'2 Difficules of translaon of En -> Ja Experiments Our decoder for NMT Quality of translaon Quality of generated chunks Scores inside () are not our implementaons Data - ASPEC [Nakazawa+ 16] , 1.6M En/Ja pairs Preprocessing - Bunsetsu chunking with J.DepP [Yoshinaga & Kitsuregawa 09] Baseline systems 1. Word-based encoder-decoder [Bahdanau+ 15] 2. Tree-based encoder [Eriguchi+ 16] (SOTA) Results I heard that someone was bien by a dog, weren’ t you injured? だれ が / 犬 に / 噛ま そう けれど、/ 君 は / 怪我 なかっ Japanese sentence has - longer sequence (En: 25 vs. Ja: 30 [words/sentence]) - free chunk order (e.g., 「だれかが / 犬に」 = 「犬に / だれかが」 ) Enc-Dec with Aenon [Bahdanau+ 15] - encodes / decodes “word-by-word” Two step decoding - First a chunk, then words inside the chunk Two addional connecons 1. to capture the interacon between chunks 2. to memorize peprevious outputs well !"#$%& ! " # % ! & ' ( ) " # * + , - . ' / 0 + ( , " % * - # + ( % "'&'-#1#'2( #.#,!*+, 0+(,"%*-# '3!%+1#0 34 %)).4+1- %.!#*1%!+1- 5'.!%-# %6!#* +1!*'02,+1-%!&'()"#*+, 7# -%( +1 % !4)+,%. 0+#.#,!*+, 3%**+#* 0+(,"%*-# *#%,!'* 8 3#,%2(# 2(#* ')#*%!+'1 +( +&)'*!%1! 6 ' * ! " # + 0 # % ( 2 ) ) ' * ! +1 &%!#*+%. 0#5#.')&#1! 9 %1 +1!#*6%,# 6'* % (23(!%1,# ')#*%!+'1 %! %!'&+, .#5#. /%( 0#5#.')#0 8 '&(&$&)%& ,07 )+ 42 % vn X 5 ua [ HN? la N?D G 6 R^ r JNAL > kV / 2 ej ar > s_ 0< 4 c := < qT 5 la 3 ( < # ph ZU 6 * , < .6 15 $ "& # % $ M%E 7 g 7 td + `] 3 ( < 18 $ ¥Y PKO 3 7 bm td 7 18 7 @Q G%IB@F > ZU / 1 *"$+,-./&+ 0 1.2+.).#3 567 vn X 5 ua [ HN? la fi o 6 * ) 2 R^ S 7 &' CF > kV / 1 W $ ej > s_ 0< -4 6 9 ; $ ,0 )+ 42 % qT 5 la 3 ( < # ph ZU 7 3! / # % M%E td + `] 3 ( ; $ ¥Y PKO 3 7 bm td 7 18 7 @QG IB @F + ZU . = 2 ) < # 82#)9 , -./&+ 0:$":"/&+7 ,0 - )+ 42 % $ : v n X5 : ua [ HN? la fio6 : R^ &' CF > : kV/1 : W$ : ej > : s _ /2 : c := < : qT 5 : la3(<# ph ZU 6 : *,< : '(*' 15 # % : M% E td + : `] 3 ( < 7 3 $ : ¥Y PKO 3 7 : bm td 7 : 18 7 : @Q G IB@F > : ZU / 1 # *$"); %2#)9 <"%.=>") 15 ?/#::"$=@ >/ A>//>); !"#$%"& ()*" +,-. /0+-1 !"#$%&'()*+ ,-.+*/ 12 2345 67327 !"#$%&'()*+ ,-.+*/ 32 4546 78596 !"#$%&'()*+ ,-.+*/ 82 457: 7;594 <.=+&'()*+ 457> 7;541 8$%"9 +,-. /0+-1 <.=+&'()*+ *$?.+*= @ :;<=>?@AB"% %"#$%"& CD&$*$B"%E 8$%"9 F FG374 2737F -.+*/ 3 17568 6853: -.+*/ 8 1954; 685;8 A=**&'()*+ *$?.+*= BC=DE#?"D@ 8>F @ <.=+&'()*+ +*?.+*= ,195:82 ,685>>2 <.=+&'()*+ *$?.+*=&+*?.+*= BG("+($(#@ 87F 1>511 ,195>92 68533 ,685>;2 1. Some languages use many words to represent one thing while others use less words 2. Some languages are free word-order while others are not decodes sentences in a “chunk-by-chunk” manner to overcome the differences of length and word-order Sequence of a sentence becomes much shorter Fixed word order and free chunk order can be modeled independenlty だれかが 犬に end-of- chunk !"#$%&'()(' +,, -"./0(/ 1(2#($3(14 だれ 5./6&'()(' +,, #7680( 9.6('1 :;<(6 =./6 ./6(/ 8$6 :/(( =./6 ./6(/ ;$6(7($6($0'>4 9.6(' ? end-of- chunk !"#$%&'()(' +,, だれ -./0&'()(' +,, 1.0(' 23 4$5(/&6"#$% 6.$$(657.$ 1.0(' 83 -./0&5.&6"#$% 9((0:;6%

Transcript of Chunk-based Decoder for Neural Machine Translationynaga/papers/ishiwatari-acl...Chunk-based Decoder...

Page 1: Chunk-based Decoder for Neural Machine Translationynaga/papers/ishiwatari-acl...Chunk-based Decoder for Neural Machine Translation Shonosuke Ishiwatari¹, Jingtao Yao², Shujie Liu³,

Chunk-based Decoder for Neural Machine TranslationShonosuke Ishiwatari¹, Jingtao Yao², Shujie Liu³, Mu Li³, Ming Zhou³, Naoki Yoshinaga⁴, Masaru Kitsuregawa⁴⁵, Weijia Jia²

¹The University of Tokyo ²Shanghai Jiao Tong University ³Microsoft Research Asia

⁴Institute of Industrial Science, the University of Tokyo ⁵National Institute of Informatics

Overview

Proposal: Chunk-based Decoder for Neural Machine Translation

Translation examples

Idea: Using a chunk rather than a word as basic translation unit

SOTA performance in distant languages (En -> Ja)

Translation between distant language pairs are difficult

Word-based Encoder-Decoder

か が 犬だれ

!""

!"# $%&&'( $)*+,'+(' " -+.

+ * α(4, dog)

!""

/&&'(&%+(

0'1+-'2

3(1+-'2

Difficulties of translation of En -> Ja

Experiments

Our decoder for NMT

Quality of translationQuality of generated chunks

※Scores inside () are not our implementations

Data - ASPEC [Nakazawa+ 16], 1.6M En/Ja pairs

Preprocessing - Bunsetsu chunking with J.DepP [Yoshinaga & Kitsuregawa 09]

Baseline systems 1. Word-based encoder-decoder [Bahdanau+ 15] 2. Tree-based encoder [Eriguchi+ 16] (SOTA)

Results

I heard that someone was bitten by a dog, weren’ t you injured?だれ か が / 犬 に / 噛ま れ た そう だ けれど、/ 君 は / 怪我 し なかっ た ?

Japanese sentence has - longer sequence (En: 25 vs. Ja: 30 [words/sentence]) - free chunk order (e.g., 「だれかが/犬に」 = 「犬に/だれかが」)

Enc-Dec with Attention [Bahdanau+ 15] - encodes / decodes “word-by-word”

Two step decoding - First a chunk, then words inside the chunk

Two additional connections 1. to capture the interaction between chunks 2. to memorize peprevious outputs well

!"#$%&

!"#$%!&'()"#*+,$-.'/$0+(,"%*-#$+( %$"'&'-#1#'2($#.#,!*+,$0+(,"%*-#$'3!%+1#0$34$%)).4+1-$%.!#*1%!+1-$5'.!%-#$%6!#*$+1!*'02,+1-$%!&'()"#*+,$7#$-%($+1$%$!4)+,%.$0+#.#,!*+,$3%**+#*$0+(,"%*-#$*#%,!'*$8

3#,%2(#$2(#*$')#*%!+'1$+($+&)'*!%1!$6'*$!"#$+0#%$(2))'*! +1$&%!#*+%.$0#5#.')&#1!$9$%1$+1!#*6%,#$6'*$%$(23(!%1,#$')#*%!+'1$%!$%!'&+,$.#5#.$/%($0#5#.')#0$8$

'&(&$&)%&

*"$+,-./&+01.2+.).#34567

82#)9,-./&+0:$":"/&+7

:$:$:$ :$:$ :$ :$

:$ :$ :$

:$ :$$$$$:$

:$:$ :$

:$ :$:$

*$");4%2#)94<"%.=>") ?/#::"$=@4>/4A>//>);

!"#$%"&'()*" +,-. /0+-1

!"#$%&'()*+,-.+*/012 2345 67327!"#$%&'()*+,-.+*/032 4546 78596!"#$%&'()*+,-.+*/082 457: 7;594

<.=+&'()*+ 457> 7;541

8$%"9 +,-. /0+-1

<.=+&'()*+0*$?.+*=@0:;<=>?@AB"%'%"#$%"&'CD&$*$B"%E

8$%"9'F FG374 2737F

-.+*/ 3 17568 6853:

-.+*/08 1954; 685;8

A=**&'()*+0*$?.+*=0BC=DE#?"D@08>F0@0<.=+&'()*+0+*?.+*= ,195:82 ,685>>2

<.=+&'()*+0*$?.+*=&+*?.+*=0BG("+($(#@087F 1>511,195>92

68533,685>;2

1. Some languages use many words to represent one thing while others use less words

2. Some languages are free word-order while others are not

decodes sentences in a “chunk-by-chunk” manner toovercome the differencesof length and word-order

Sequence of a sentence becomes much shorter Fixed word order and free chunk order can be modeled independenlty

か が 犬 に

だれかが

犬に

end-of-chunk

!"#$%&'()('*+,,

-"./0(/*1(2#($3(14

だれ

5./6&'()('*+,,

#7680(

9.6('1*:;<(6*=./6*./6(/*8$6*:/((*=./6*./6(/*;$6(7($6($0'>4

9.6('*?

か 犬 にend-of-chunkが

!"#$%&'()('*+,,

だれ

-./0&'()('*+,,

1.0('*23*4$5(/&6"#$%*6.$$(657.$

1.0('*83*-./0&5.&6"#$%*9((0:;6%