[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for...

20
MT Study Group Supervised Phrase Table Triangula8on with Neural Word Embeddings for LowResource Languages Tomer Levinboim and David Chiang Proc. of EMNLP 2015, Lisbon, Portugal Introduced by Akiva Miura, AHCLab 15/10/15 2015©Akiva Miura AHCLab, IS, NAIST 1

Transcript of [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for...

Page 1: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

MT  Study  Group    

Supervised  Phrase  Table  Triangula8on  with  Neural  Word  Embeddings  for  Low-­‐Resource  Languages  

 Tomer  Levinboim    and    David  Chiang  

 Proc.  of  EMNLP  2015,  Lisbon,  Portugal  

Introduced  by  Akiva  Miura,  AHC-­‐Lab  

15/10/15 2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST 1

Page 2: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

Contents  

15/10/15 2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST 2

1.  Introduc8on  2.  Preliminaries  3.  Supervised  Word  Transla8ons  4.  Experiments  5.  Conclusion  6.  Impression

Page 3: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

1.  Introduc8on  

15/10/15 3

Problem:  Scarceness  of  Bilingual  Data      l  PBMT  systems  require  considerable  amounts  of  source-­‐target  

parallel  data  to  produce  good  quality  transla8on  Ø A  triangulated  source-­‐target  phrase  table  can  be  composed  from  a  source-­‐pivot  and  pivot-­‐target  phrase  table,  but  s8ll  noisy  

l  This  paper  shows  a  supervised  learning  technique  that  improves  noisy  phrase  transla;on  scores  by  extrac8on  of  word  transla8on  distribu8ons  from  small  amounts  of  bilingual  data  Ø This  method  gained  improvement  on  Malagasy-­‐to-­‐French  and  Spanish-­‐to-­‐French  transla8on  tasks  via  English  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

Page 4: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

2.  Preliminaries  

15/10/15 4

Denota8on:  2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- words in source, pivot, and target languages respectively 2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- phrases in …

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- a phrase table estimated over a parallel corpus

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- a triangulated phrase table

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- phrase translation features

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- lexical-weighting features

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- word translation probabilities

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

Page 5: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

2.1  Triangula8on  (weak  baseline)  

15/10/15 5

l  A  source-­‐target  phrase  table  Tst  is  constructed  by  combining  a  source-­‐pivot  and  pivot-­‐target  phrase  table  Tsp,  Tpt  

l  Combining  alignment:  

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

=  l  Lexical  weigh8ng  probability  es8ma8on:  

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

l  The  triangulated  phrase  transla8on  scores  are  computed  by  analogy  with  Eq.  1  

l  Compu8ng  these  scores  in  the  reverse  direc8on  by  swapping  the  source  and  target  languages  

Page 6: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

2.2  Interpola8on  (strong  baseline)  

15/10/15 6

l  Given  access  to  source-­‐target  data,  an  ordinary  source-­‐target  phrase  table  Tst  can  be  es8mated  directly  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

l  Interpola8on  of  phrase  pairs  entries  that  occur  in  both  tables:  

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

Phrase  pairs  appearing  in  only  one  phrase  table  are  added  as-­‐is  

Page 7: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

3.  Supervised  Word  Transla8on  

15/10/15 7

l  The  effect  of  interpola8on  (Eq.  3)  is  limited  to  phrase  pairs  appearing  in  both  phrase  tables.  

l  The  idea  of  this  paper  is  to  regard  word  transla8on  distribu8ons  derived  from  source-­‐target  bilingual  data  (through  word  alignments  or  dic8onary  entries)  as  the  correct  transla8on,  and  use  them  to  learn  discriminately  •  correct  target  words  should  become  likely  transla;ons  •  incorrect  ones  should  be  down-­‐weighted  

Ø  To  generalize  beyond  the  vocabulary  of  the  source-­‐target  data,  the  authors  appeal  to  word  embeddings  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

Page 8: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

3.1  Model  

15/10/15 8

Defining:  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- the number of times source word s was aligned to target word t (in word alignment, or in the dictionary)

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- the word translation distribution

where

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- the word translation probabilities we wish to learn

l  We  consider  maximizing  the  log-­‐likelihood  func8on:  

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

Cleary,  the  solu8on  

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

maximizes  L  Ø  However,  we  would  like  a  solu8on  that  generalizes  to  source  words  s  

beyond  those  observed  in  the  source-­‐target  corpus  

Page 9: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

3.1  Model  (cont’d)  

15/10/15 9

l  In  order  to  generalize,  we  abstract  from  words  to  vector  representa8ons  of  words  Ø We  constrain  q  to  the  following  parameteriza8on:  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- vectors of monolingual features (word embeddings)

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080- a vector of bilingual features (triangulated scores)

2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- parameters to be learned l  For  normaliza8on:  

that fst := wst. Therefore, the matrix A is a lin-ear transformation between the source and targetembedding spaces, and h (now a scalar) quantifieshow the triangulated scores w are to be trusted.

In the normalization factor Zs, we let t rangeonly over possible translations of s suggested byeither wsup or the triangulated word probabilities.That is:

T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.

This restriction makes e�cient computation pos-sible, as otherwise the normalization term wouldhave to be computed over the entire target vocab-ulary.

Under this parameterization, our goal is to solvethe following maximization problem:

maxA,h

L(A, h) = maxA,h

X

s,t

csupst log q(t | s). (4)

3.2 Optimization

The objective function in Eq. 4 is concave in bothA and h. This is because after taking the log, weare left with a weighted sum of linear and concave(negative log-sum-exp) terms in A and h. We cantherefore reach the global solution of the problemusing gradient descent.

Taking derivatives, the gradient is

@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst

where the scalar mst = csupst � csup

s q(t | s) for thecurrent value of q.

For quick results, we limited the number of gra-dient steps to 200 and selected the iteration thatminimized the total variation distance to wsup overa held out dev set:

X

s

||q(· | s) � wsup(· | s)||1. (5)

We obtained better convergence rate by us-ing a batch version of the e↵ective and easy-to-implement Adagrad technique (Duchi et al.,2011). See Figure 1.

3.3 Re-estimating lexical weights

Having learned the model (A and h), we can nowuse q(t | s) to estimate the lexical weights (Eq. 2)of any aligned phrase pairs (s, t, a), assuming it iscomposed of embeddable words.

Figure 1: The (target-to-source) objective functionper iteration. Applying batch Adagrad (blue) sig-nificantly accelerates convergence.

However, we found the supervised word trans-lation scores q to be too sharp, sometimes assign-ing all probability mass to a single target word. Wetherefore interpolated q with the triangulated wordtranslation scores w:

q� = �q + (1 � �)w. (6)

To integrate the lexical weights induced by q�(Eq. 2), we simply appended them as new featuresin the phrase table in addition to the existing lexi-cal weights. Following this, we can search for a �value that maximizes B���on a tuning set.

3.4 Summary of method

In summary, to improve upon a triangulated or in-terpolated phrase table, we:

1. Learn word translation distributions q by super-vision against distributions wsup derived fromthe source-target bilingual data (§3.1).

2. Smooth the learned distributions q by interpo-lating with triangulated word translation scoresw (§3.3).

3. Compute new lexical weights and append themto the phrase table (§3.3).

4 Experiments

To test our method, we conducted two low-resource translation experiments using thephrase-based MT system Moses (Koehn et al.,2007).

1081

Ø  Under  the  parameteriza8on,  our  goal  is  to  solve  the  following:  

that fst := wst. Therefore, the matrix A is a lin-ear transformation between the source and targetembedding spaces, and h (now a scalar) quantifieshow the triangulated scores w are to be trusted.

In the normalization factor Zs, we let t rangeonly over possible translations of s suggested byeither wsup or the triangulated word probabilities.That is:

T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.

This restriction makes e�cient computation pos-sible, as otherwise the normalization term wouldhave to be computed over the entire target vocab-ulary.

Under this parameterization, our goal is to solvethe following maximization problem:

maxA,h

L(A, h) = maxA,h

X

s,t

csupst log q(t | s). (4)

3.2 Optimization

The objective function in Eq. 4 is concave in bothA and h. This is because after taking the log, weare left with a weighted sum of linear and concave(negative log-sum-exp) terms in A and h. We cantherefore reach the global solution of the problemusing gradient descent.

Taking derivatives, the gradient is

@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst

where the scalar mst = csupst � csup

s q(t | s) for thecurrent value of q.

For quick results, we limited the number of gra-dient steps to 200 and selected the iteration thatminimized the total variation distance to wsup overa held out dev set:

X

s

||q(· | s) � wsup(· | s)||1. (5)

We obtained better convergence rate by us-ing a batch version of the e↵ective and easy-to-implement Adagrad technique (Duchi et al.,2011). See Figure 1.

3.3 Re-estimating lexical weights

Having learned the model (A and h), we can nowuse q(t | s) to estimate the lexical weights (Eq. 2)of any aligned phrase pairs (s, t, a), assuming it iscomposed of embeddable words.

Figure 1: The (target-to-source) objective functionper iteration. Applying batch Adagrad (blue) sig-nificantly accelerates convergence.

However, we found the supervised word trans-lation scores q to be too sharp, sometimes assign-ing all probability mass to a single target word. Wetherefore interpolated q with the triangulated wordtranslation scores w:

q� = �q + (1 � �)w. (6)

To integrate the lexical weights induced by q�(Eq. 2), we simply appended them as new featuresin the phrase table in addition to the existing lexi-cal weights. Following this, we can search for a �value that maximizes B���on a tuning set.

3.4 Summary of method

In summary, to improve upon a triangulated or in-terpolated phrase table, we:

1. Learn word translation distributions q by super-vision against distributions wsup derived fromthe source-target bilingual data (§3.1).

2. Smooth the learned distributions q by interpo-lating with triangulated word translation scoresw (§3.3).

3. Compute new lexical weights and append themto the phrase table (§3.3).

4 Experiments

To test our method, we conducted two low-resource translation experiments using thephrase-based MT system Moses (Koehn et al.,2007).

1081

Page 10: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

3.2  Op8miza8on  

15/10/15 10

l  The  objec8ve  func8on  in  Eq.  4  is  concave  in  both  A  and  h  Ø We  can  reach  the  global  solu8on  of  the  problem  using  gradient  descent  

l  Taking  deriva8ves,  the  gradient  is  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

that fst := wst. Therefore, the matrix A is a lin-ear transformation between the source and targetembedding spaces, and h (now a scalar) quantifieshow the triangulated scores w are to be trusted.

In the normalization factor Zs, we let t rangeonly over possible translations of s suggested byeither wsup or the triangulated word probabilities.That is:

T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.

This restriction makes e�cient computation pos-sible, as otherwise the normalization term wouldhave to be computed over the entire target vocab-ulary.

Under this parameterization, our goal is to solvethe following maximization problem:

maxA,h

L(A, h) = maxA,h

X

s,t

csupst log q(t | s). (4)

3.2 Optimization

The objective function in Eq. 4 is concave in bothA and h. This is because after taking the log, weare left with a weighted sum of linear and concave(negative log-sum-exp) terms in A and h. We cantherefore reach the global solution of the problemusing gradient descent.

Taking derivatives, the gradient is

@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst

where the scalar mst = csupst � csup

s q(t | s) for thecurrent value of q.

For quick results, we limited the number of gra-dient steps to 200 and selected the iteration thatminimized the total variation distance to wsup overa held out dev set:

X

s

||q(· | s) � wsup(· | s)||1. (5)

We obtained better convergence rate by us-ing a batch version of the e↵ective and easy-to-implement Adagrad technique (Duchi et al.,2011). See Figure 1.

3.3 Re-estimating lexical weights

Having learned the model (A and h), we can nowuse q(t | s) to estimate the lexical weights (Eq. 2)of any aligned phrase pairs (s, t, a), assuming it iscomposed of embeddable words.

Figure 1: The (target-to-source) objective functionper iteration. Applying batch Adagrad (blue) sig-nificantly accelerates convergence.

However, we found the supervised word trans-lation scores q to be too sharp, sometimes assign-ing all probability mass to a single target word. Wetherefore interpolated q with the triangulated wordtranslation scores w:

q� = �q + (1 � �)w. (6)

To integrate the lexical weights induced by q�(Eq. 2), we simply appended them as new featuresin the phrase table in addition to the existing lexi-cal weights. Following this, we can search for a �value that maximizes B���on a tuning set.

3.4 Summary of method

In summary, to improve upon a triangulated or in-terpolated phrase table, we:

1. Learn word translation distributions q by super-vision against distributions wsup derived fromthe source-target bilingual data (§3.1).

2. Smooth the learned distributions q by interpo-lating with triangulated word translation scoresw (§3.3).

3. Compute new lexical weights and append themto the phrase table (§3.3).

4 Experiments

To test our method, we conducted two low-resource translation experiments using thephrase-based MT system Moses (Koehn et al.,2007).

1081

that fst := wst. Therefore, the matrix A is a lin-ear transformation between the source and targetembedding spaces, and h (now a scalar) quantifieshow the triangulated scores w are to be trusted.

In the normalization factor Zs, we let t rangeonly over possible translations of s suggested byeither wsup or the triangulated word probabilities.That is:

T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.

This restriction makes e�cient computation pos-sible, as otherwise the normalization term wouldhave to be computed over the entire target vocab-ulary.

Under this parameterization, our goal is to solvethe following maximization problem:

maxA,h

L(A, h) = maxA,h

X

s,t

csupst log q(t | s). (4)

3.2 Optimization

The objective function in Eq. 4 is concave in bothA and h. This is because after taking the log, weare left with a weighted sum of linear and concave(negative log-sum-exp) terms in A and h. We cantherefore reach the global solution of the problemusing gradient descent.

Taking derivatives, the gradient is

@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst

where the scalar mst = csupst � csup

s q(t | s) for thecurrent value of q.

For quick results, we limited the number of gra-dient steps to 200 and selected the iteration thatminimized the total variation distance to wsup overa held out dev set:

X

s

||q(· | s) � wsup(· | s)||1. (5)

We obtained better convergence rate by us-ing a batch version of the e↵ective and easy-to-implement Adagrad technique (Duchi et al.,2011). See Figure 1.

3.3 Re-estimating lexical weights

Having learned the model (A and h), we can nowuse q(t | s) to estimate the lexical weights (Eq. 2)of any aligned phrase pairs (s, t, a), assuming it iscomposed of embeddable words.

Figure 1: The (target-to-source) objective functionper iteration. Applying batch Adagrad (blue) sig-nificantly accelerates convergence.

However, we found the supervised word trans-lation scores q to be too sharp, sometimes assign-ing all probability mass to a single target word. Wetherefore interpolated q with the triangulated wordtranslation scores w:

q� = �q + (1 � �)w. (6)

To integrate the lexical weights induced by q�(Eq. 2), we simply appended them as new featuresin the phrase table in addition to the existing lexi-cal weights. Following this, we can search for a �value that maximizes B���on a tuning set.

3.4 Summary of method

In summary, to improve upon a triangulated or in-terpolated phrase table, we:

1. Learn word translation distributions q by super-vision against distributions wsup derived fromthe source-target bilingual data (§3.1).

2. Smooth the learned distributions q by interpo-lating with triangulated word translation scoresw (§3.3).

3. Compute new lexical weights and append themto the phrase table (§3.3).

4 Experiments

To test our method, we conducted two low-resource translation experiments using thephrase-based MT system Moses (Koehn et al.,2007).

1081

l  For  quick  results,  this  research  limited  the  number  of  gradient  steps  to  200  and  selected  the  itera8on  that  minimized  the  total  varia8on  distance  to  wsup  over  a  held  out  dev  set:  

that fst := wst. Therefore, the matrix A is a lin-ear transformation between the source and targetembedding spaces, and h (now a scalar) quantifieshow the triangulated scores w are to be trusted.

In the normalization factor Zs, we let t rangeonly over possible translations of s suggested byeither wsup or the triangulated word probabilities.That is:

T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.

This restriction makes e�cient computation pos-sible, as otherwise the normalization term wouldhave to be computed over the entire target vocab-ulary.

Under this parameterization, our goal is to solvethe following maximization problem:

maxA,h

L(A, h) = maxA,h

X

s,t

csupst log q(t | s). (4)

3.2 Optimization

The objective function in Eq. 4 is concave in bothA and h. This is because after taking the log, weare left with a weighted sum of linear and concave(negative log-sum-exp) terms in A and h. We cantherefore reach the global solution of the problemusing gradient descent.

Taking derivatives, the gradient is

@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst

where the scalar mst = csupst � csup

s q(t | s) for thecurrent value of q.

For quick results, we limited the number of gra-dient steps to 200 and selected the iteration thatminimized the total variation distance to wsup overa held out dev set:

X

s

||q(· | s) � wsup(· | s)||1. (5)

We obtained better convergence rate by us-ing a batch version of the e↵ective and easy-to-implement Adagrad technique (Duchi et al.,2011). See Figure 1.

3.3 Re-estimating lexical weights

Having learned the model (A and h), we can nowuse q(t | s) to estimate the lexical weights (Eq. 2)of any aligned phrase pairs (s, t, a), assuming it iscomposed of embeddable words.

Figure 1: The (target-to-source) objective functionper iteration. Applying batch Adagrad (blue) sig-nificantly accelerates convergence.

However, we found the supervised word trans-lation scores q to be too sharp, sometimes assign-ing all probability mass to a single target word. Wetherefore interpolated q with the triangulated wordtranslation scores w:

q� = �q + (1 � �)w. (6)

To integrate the lexical weights induced by q�(Eq. 2), we simply appended them as new featuresin the phrase table in addition to the existing lexi-cal weights. Following this, we can search for a �value that maximizes B���on a tuning set.

3.4 Summary of method

In summary, to improve upon a triangulated or in-terpolated phrase table, we:

1. Learn word translation distributions q by super-vision against distributions wsup derived fromthe source-target bilingual data (§3.1).

2. Smooth the learned distributions q by interpo-lating with triangulated word translation scoresw (§3.3).

3. Compute new lexical weights and append themto the phrase table (§3.3).

4 Experiments

To test our method, we conducted two low-resource translation experiments using thephrase-based MT system Moses (Koehn et al.,2007).

1081

Page 11: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

3.2  Op8miza8on  (cont’d)  

15/10/15 11 2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

that fst := wst. Therefore, the matrix A is a lin-ear transformation between the source and targetembedding spaces, and h (now a scalar) quantifieshow the triangulated scores w are to be trusted.

In the normalization factor Zs, we let t rangeonly over possible translations of s suggested byeither wsup or the triangulated word probabilities.That is:

T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.

This restriction makes e�cient computation pos-sible, as otherwise the normalization term wouldhave to be computed over the entire target vocab-ulary.

Under this parameterization, our goal is to solvethe following maximization problem:

maxA,h

L(A, h) = maxA,h

X

s,t

csupst log q(t | s). (4)

3.2 Optimization

The objective function in Eq. 4 is concave in bothA and h. This is because after taking the log, weare left with a weighted sum of linear and concave(negative log-sum-exp) terms in A and h. We cantherefore reach the global solution of the problemusing gradient descent.

Taking derivatives, the gradient is

@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst

where the scalar mst = csupst � csup

s q(t | s) for thecurrent value of q.

For quick results, we limited the number of gra-dient steps to 200 and selected the iteration thatminimized the total variation distance to wsup overa held out dev set:

X

s

||q(· | s) � wsup(· | s)||1. (5)

We obtained better convergence rate by us-ing a batch version of the e↵ective and easy-to-implement Adagrad technique (Duchi et al.,2011). See Figure 1.

3.3 Re-estimating lexical weights

Having learned the model (A and h), we can nowuse q(t | s) to estimate the lexical weights (Eq. 2)of any aligned phrase pairs (s, t, a), assuming it iscomposed of embeddable words.

Figure 1: The (target-to-source) objective functionper iteration. Applying batch Adagrad (blue) sig-nificantly accelerates convergence.

However, we found the supervised word trans-lation scores q to be too sharp, sometimes assign-ing all probability mass to a single target word. Wetherefore interpolated q with the triangulated wordtranslation scores w:

q� = �q + (1 � �)w. (6)

To integrate the lexical weights induced by q�(Eq. 2), we simply appended them as new featuresin the phrase table in addition to the existing lexi-cal weights. Following this, we can search for a �value that maximizes B���on a tuning set.

3.4 Summary of method

In summary, to improve upon a triangulated or in-terpolated phrase table, we:

1. Learn word translation distributions q by super-vision against distributions wsup derived fromthe source-target bilingual data (§3.1).

2. Smooth the learned distributions q by interpo-lating with triangulated word translation scoresw (§3.3).

3. Compute new lexical weights and append themto the phrase table (§3.3).

4 Experiments

To test our method, we conducted two low-resource translation experiments using thephrase-based MT system Moses (Koehn et al.,2007).

1081

Page 12: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

3.3  Re-­‐es8ma8ng  lexical  weights  

15/10/15 12

l  Having  learned  the  model  (A  and  h),  we  can  now  use  q(t  |  s)  to  es8mate  the  lexical  weights  (Eq.  2)  of  any  aligned  phrase  pairs                                    ,  assuming  it  is  composed  of  embeddable  words  

l  However,  the  authors  found  the  supervised  word  transla8on  scores  q  to  be  too  sharp,  some8mes  assigning  all  probability  mass  to  a  single  target  word  

Ø  They  therefore  interpolated  q  with  the  triangulated  word  transla8on  scores:  

•  To  integrate  the  lexical  weights  induced  by  qβ  (Eq.  2),  they  simply  appended  them  as  new  features  in  the  phrase  table  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

that fst := wst. Therefore, the matrix A is a lin-ear transformation between the source and targetembedding spaces, and h (now a scalar) quantifieshow the triangulated scores w are to be trusted.

In the normalization factor Zs, we let t rangeonly over possible translations of s suggested byeither wsup or the triangulated word probabilities.That is:

T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.

This restriction makes e�cient computation pos-sible, as otherwise the normalization term wouldhave to be computed over the entire target vocab-ulary.

Under this parameterization, our goal is to solvethe following maximization problem:

maxA,h

L(A, h) = maxA,h

X

s,t

csupst log q(t | s). (4)

3.2 Optimization

The objective function in Eq. 4 is concave in bothA and h. This is because after taking the log, weare left with a weighted sum of linear and concave(negative log-sum-exp) terms in A and h. We cantherefore reach the global solution of the problemusing gradient descent.

Taking derivatives, the gradient is

@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst

where the scalar mst = csupst � csup

s q(t | s) for thecurrent value of q.

For quick results, we limited the number of gra-dient steps to 200 and selected the iteration thatminimized the total variation distance to wsup overa held out dev set:

X

s

||q(· | s) � wsup(· | s)||1. (5)

We obtained better convergence rate by us-ing a batch version of the e↵ective and easy-to-implement Adagrad technique (Duchi et al.,2011). See Figure 1.

3.3 Re-estimating lexical weights

Having learned the model (A and h), we can nowuse q(t | s) to estimate the lexical weights (Eq. 2)of any aligned phrase pairs (s, t, a), assuming it iscomposed of embeddable words.

Figure 1: The (target-to-source) objective functionper iteration. Applying batch Adagrad (blue) sig-nificantly accelerates convergence.

However, we found the supervised word trans-lation scores q to be too sharp, sometimes assign-ing all probability mass to a single target word. Wetherefore interpolated q with the triangulated wordtranslation scores w:

q� = �q + (1 � �)w. (6)

To integrate the lexical weights induced by q�(Eq. 2), we simply appended them as new featuresin the phrase table in addition to the existing lexi-cal weights. Following this, we can search for a �value that maximizes B���on a tuning set.

3.4 Summary of method

In summary, to improve upon a triangulated or in-terpolated phrase table, we:

1. Learn word translation distributions q by super-vision against distributions wsup derived fromthe source-target bilingual data (§3.1).

2. Smooth the learned distributions q by interpo-lating with triangulated word translation scoresw (§3.3).

3. Compute new lexical weights and append themto the phrase table (§3.3).

4 Experiments

To test our method, we conducted two low-resource translation experiments using thephrase-based MT system Moses (Koehn et al.,2007).

1081

that fst := wst. Therefore, the matrix A is a lin-ear transformation between the source and targetembedding spaces, and h (now a scalar) quantifieshow the triangulated scores w are to be trusted.

In the normalization factor Zs, we let t rangeonly over possible translations of s suggested byeither wsup or the triangulated word probabilities.That is:

T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.

This restriction makes e�cient computation pos-sible, as otherwise the normalization term wouldhave to be computed over the entire target vocab-ulary.

Under this parameterization, our goal is to solvethe following maximization problem:

maxA,h

L(A, h) = maxA,h

X

s,t

csupst log q(t | s). (4)

3.2 Optimization

The objective function in Eq. 4 is concave in bothA and h. This is because after taking the log, weare left with a weighted sum of linear and concave(negative log-sum-exp) terms in A and h. We cantherefore reach the global solution of the problemusing gradient descent.

Taking derivatives, the gradient is

@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst

where the scalar mst = csupst � csup

s q(t | s) for thecurrent value of q.

For quick results, we limited the number of gra-dient steps to 200 and selected the iteration thatminimized the total variation distance to wsup overa held out dev set:

X

s

||q(· | s) � wsup(· | s)||1. (5)

We obtained better convergence rate by us-ing a batch version of the e↵ective and easy-to-implement Adagrad technique (Duchi et al.,2011). See Figure 1.

3.3 Re-estimating lexical weights

Having learned the model (A and h), we can nowuse q(t | s) to estimate the lexical weights (Eq. 2)of any aligned phrase pairs (s, t, a), assuming it iscomposed of embeddable words.

Figure 1: The (target-to-source) objective functionper iteration. Applying batch Adagrad (blue) sig-nificantly accelerates convergence.

However, we found the supervised word trans-lation scores q to be too sharp, sometimes assign-ing all probability mass to a single target word. Wetherefore interpolated q with the triangulated wordtranslation scores w:

q� = �q + (1 � �)w. (6)

To integrate the lexical weights induced by q�(Eq. 2), we simply appended them as new featuresin the phrase table in addition to the existing lexi-cal weights. Following this, we can search for a �value that maximizes B���on a tuning set.

3.4 Summary of method

In summary, to improve upon a triangulated or in-terpolated phrase table, we:

1. Learn word translation distributions q by super-vision against distributions wsup derived fromthe source-target bilingual data (§3.1).

2. Smooth the learned distributions q by interpo-lating with triangulated word translation scoresw (§3.3).

3. Compute new lexical weights and append themto the phrase table (§3.3).

4 Experiments

To test our method, we conducted two low-resource translation experiments using thephrase-based MT system Moses (Koehn et al.,2007).

1081

Page 13: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

3.4  Summary  of  method  

15/10/15 13

In  summary,  to  improve  upon  a  triangulated  or  interpolated  phrase  table,  the  authors:    1.  Learn  word  transla8on  distribu8ons  q  by  supervision  against  

distribu8ons  wsup  derived  from  the  source-­‐target  bilingual  data  (§3.1)  

2.  Smooth  the  learned  distribu8ons  q  by  interpola8ng  with  triangulated  word  transla8on  scores    

3.  Compute  new  lexical  weights  and  append  them  to  the  phrase  table  (§3.3)  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

that fst := wst. Therefore, the matrix A is a lin-ear transformation between the source and targetembedding spaces, and h (now a scalar) quantifieshow the triangulated scores w are to be trusted.

In the normalization factor Zs, we let t rangeonly over possible translations of s suggested byeither wsup or the triangulated word probabilities.That is:

T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.

This restriction makes e�cient computation pos-sible, as otherwise the normalization term wouldhave to be computed over the entire target vocab-ulary.

Under this parameterization, our goal is to solvethe following maximization problem:

maxA,h

L(A, h) = maxA,h

X

s,t

csupst log q(t | s). (4)

3.2 Optimization

The objective function in Eq. 4 is concave in bothA and h. This is because after taking the log, weare left with a weighted sum of linear and concave(negative log-sum-exp) terms in A and h. We cantherefore reach the global solution of the problemusing gradient descent.

Taking derivatives, the gradient is

@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst

where the scalar mst = csupst � csup

s q(t | s) for thecurrent value of q.

For quick results, we limited the number of gra-dient steps to 200 and selected the iteration thatminimized the total variation distance to wsup overa held out dev set:

X

s

||q(· | s) � wsup(· | s)||1. (5)

We obtained better convergence rate by us-ing a batch version of the e↵ective and easy-to-implement Adagrad technique (Duchi et al.,2011). See Figure 1.

3.3 Re-estimating lexical weights

Having learned the model (A and h), we can nowuse q(t | s) to estimate the lexical weights (Eq. 2)of any aligned phrase pairs (s, t, a), assuming it iscomposed of embeddable words.

Figure 1: The (target-to-source) objective functionper iteration. Applying batch Adagrad (blue) sig-nificantly accelerates convergence.

However, we found the supervised word trans-lation scores q to be too sharp, sometimes assign-ing all probability mass to a single target word. Wetherefore interpolated q with the triangulated wordtranslation scores w:

q� = �q + (1 � �)w. (6)

To integrate the lexical weights induced by q�(Eq. 2), we simply appended them as new featuresin the phrase table in addition to the existing lexi-cal weights. Following this, we can search for a �value that maximizes B���on a tuning set.

3.4 Summary of method

In summary, to improve upon a triangulated or in-terpolated phrase table, we:

1. Learn word translation distributions q by super-vision against distributions wsup derived fromthe source-target bilingual data (§3.1).

2. Smooth the learned distributions q by interpo-lating with triangulated word translation scoresw (§3.3).

3. Compute new lexical weights and append themto the phrase table (§3.3).

4 Experiments

To test our method, we conducted two low-resource translation experiments using thephrase-based MT system Moses (Koehn et al.,2007).

1081

Page 14: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

4.  Experiments  

15/10/15 14

l  To  test  the  proposed  method,  the  authors  conducted  two  low-­‐resource  transla8on  experiments  using  Moses  

Transla8on  Tasks:  l  Fixing  the  pivot  language  to  English,  they  applied  their  method  

on  two  data  scenarios:  1.   Spanish-­‐to-­‐French:  

two  related  languages  used  to  simulate  a  low-­‐resource  seeng.  The  baseline  is  phrase  table  interpola8on  (Eq.  3)  

2.   Malagasy-­‐to-­‐French:  two  unrelated  languages  for  which  they  have  a  small  dic8onary,  but  no  parallel  corpus.  The  baseline  is  triangula8on  alone.  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

Page 15: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

4.1  Data  

15/10/15 15

Datasets:  •  European-­‐language  bitext  were  extracted  from  Europarl  •  For  Malagasy-­‐English,  Global  Voices  parallel  data  available  online  •  The  Malagasy-­‐French  dic8onary  from  online  resources  and  small  

Malagasy-­‐French  tune/tests  from  Global  Voices  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

4.1 DataFixing the pivot language to English, we appliedour method on two data scenarios:

1. Spanish-to-French: two related languagesused to simulate a low-resource setting. Thebaseline is phrase table interpolation (Eq. 3).

2. Malagasy-to-French: two unrelated languagesfor which we have a small dictionary, but noparallel corpus (aside from tuning and testingdata). The baseline is triangulation alone (thereis no source-target model to interpolate with).

Table 1 lists some statistics of the bilin-gual data we used. European-language bitextswere extracted from Europarl (Koehn, 2005). ForMalagasy-English, we used the Global Voices par-allel data available online.1 The Malagasy-Frenchdictionary was extracted from online resources2

and the small Malagasy-French tune/test sets wereextracted3 from Global Voices.

lines of datalanguage pair train tune test

sp-fr 4k 1.5k 1.5kmg-fr 1.1k 1.2k 1.2ksp-en 50k – –mg-en 100k – –en-fr 50k – –

Table 1: Bilingual datasets. Legend: sp=Spanish,fr=French, en=English, mg=Malagasy.

Table 2 lists token statistics of the monolin-gual data used. We used word2vec4 to generateFrench, Spanish and Malagasy word embeddings.The French and Spanish embeddings were (in-dependently) estimated over their combined to-kenized and lowercased Gigaword5 and Leipzignews corpora.6 The Malagasy embeddings weresimilarly estimated over data form Global Voices,7

the Malagasy Wikipedia and the Malagasy Com-mon Crawl.8 In addition, we estimated a 5-gramFrench language model over the French monolin-gual data.

1http://www.ark.cs.cmu.edu/global-voices2http://motmalgache.org/bins/homePage3https://github.com/vchahun/gv-crawl4https://radimrehurek.com/gensim/models/word2vec.html5http://catalog.ldc.upenn.edu6http://corpora.uni-leipzig.de/download.html7http://www.isi.edu/˜qdou/downloads.html8https://commoncrawl.org/the-data/

language wordsFrench 1.5GSpanish 1.4GMalagasy 58M

Table 2: Size of monolingual corpus per languageas measured in number of tokens.

4.2 Spanish-French ResultsTo produce wsup, we aligned the small Spanish-French parallel corpus in both directions, andsymmetrized using the intersection heuristic. Thiswas done to obtain high precision alignments (theoften-used grow-diag-final-and heuristic is opti-mized for phrase extraction, not precision).

We used the skip-gram model to estimate theSpanish and French word embeddings and set thedimension to d = 200 and context window tow = 5 (default). Subsequently, to run our method,we filtered out source and target words that eitherdid not appear in the triangulation, or, did not havean embedding. We took words that appeared morethan 10 times in the parallel corpus for the trainingset (⇠690 words), and between 5–9 times for theheld out dev set (⇠530 words). This was done inboth source-target and target-source directions.

In Table 3 we show that the distributions learnedby our method are much better approximations ofwsup compared to those obtained by triangulation.

Method source!target target!sourcetriangulation 71.6% 72.0%our scores 30.2% 33.8%

Table 3: Average total variation distance (Eq. 5)to the dev set portion of wsup (computed only overwords whose translations in wsup appear in the tri-angulation). Using word embeddings, our methodis able to better generalize on the dev set.

We then examined the e↵ect of appending oursupervised lexical weights. We fixed the wordlevel interpolation � := 0.95 (e↵ectively assigningvery little mass to triangulated word translationsw) and searched for ↵ 2 {0.9, 0.8, 0.7, 0.6} in Eq. 3to maximize B���on the tuning set.

Our MT results are reported in Table 4. Whileinterpolation improves over triangulation alone by+0.8 B���, our method adds another +0.7 B���ontop of interpolation, a statistically significant gain(p < 0.01) according to a bootstrap resamplingsignificance test (Koehn, 2004).

1082

4.1 DataFixing the pivot language to English, we appliedour method on two data scenarios:

1. Spanish-to-French: two related languagesused to simulate a low-resource setting. Thebaseline is phrase table interpolation (Eq. 3).

2. Malagasy-to-French: two unrelated languagesfor which we have a small dictionary, but noparallel corpus (aside from tuning and testingdata). The baseline is triangulation alone (thereis no source-target model to interpolate with).

Table 1 lists some statistics of the bilin-gual data we used. European-language bitextswere extracted from Europarl (Koehn, 2005). ForMalagasy-English, we used the Global Voices par-allel data available online.1 The Malagasy-Frenchdictionary was extracted from online resources2

and the small Malagasy-French tune/test sets wereextracted3 from Global Voices.

lines of datalanguage pair train tune test

sp-fr 4k 1.5k 1.5kmg-fr 1.1k 1.2k 1.2ksp-en 50k – –mg-en 100k – –en-fr 50k – –

Table 1: Bilingual datasets. Legend: sp=Spanish,fr=French, en=English, mg=Malagasy.

Table 2 lists token statistics of the monolin-gual data used. We used word2vec4 to generateFrench, Spanish and Malagasy word embeddings.The French and Spanish embeddings were (in-dependently) estimated over their combined to-kenized and lowercased Gigaword5 and Leipzignews corpora.6 The Malagasy embeddings weresimilarly estimated over data form Global Voices,7

the Malagasy Wikipedia and the Malagasy Com-mon Crawl.8 In addition, we estimated a 5-gramFrench language model over the French monolin-gual data.

1http://www.ark.cs.cmu.edu/global-voices2http://motmalgache.org/bins/homePage3https://github.com/vchahun/gv-crawl4https://radimrehurek.com/gensim/models/word2vec.html5http://catalog.ldc.upenn.edu6http://corpora.uni-leipzig.de/download.html7http://www.isi.edu/˜qdou/downloads.html8https://commoncrawl.org/the-data/

language wordsFrench 1.5GSpanish 1.4GMalagasy 58M

Table 2: Size of monolingual corpus per languageas measured in number of tokens.

4.2 Spanish-French ResultsTo produce wsup, we aligned the small Spanish-French parallel corpus in both directions, andsymmetrized using the intersection heuristic. Thiswas done to obtain high precision alignments (theoften-used grow-diag-final-and heuristic is opti-mized for phrase extraction, not precision).

We used the skip-gram model to estimate theSpanish and French word embeddings and set thedimension to d = 200 and context window tow = 5 (default). Subsequently, to run our method,we filtered out source and target words that eitherdid not appear in the triangulation, or, did not havean embedding. We took words that appeared morethan 10 times in the parallel corpus for the trainingset (⇠690 words), and between 5–9 times for theheld out dev set (⇠530 words). This was done inboth source-target and target-source directions.

In Table 3 we show that the distributions learnedby our method are much better approximations ofwsup compared to those obtained by triangulation.

Method source!target target!sourcetriangulation 71.6% 72.0%our scores 30.2% 33.8%

Table 3: Average total variation distance (Eq. 5)to the dev set portion of wsup (computed only overwords whose translations in wsup appear in the tri-angulation). Using word embeddings, our methodis able to better generalize on the dev set.

We then examined the e↵ect of appending oursupervised lexical weights. We fixed the wordlevel interpolation � := 0.95 (e↵ectively assigningvery little mass to triangulated word translationsw) and searched for ↵ 2 {0.9, 0.8, 0.7, 0.6} in Eq. 3to maximize B���on the tuning set.

Our MT results are reported in Table 4. Whileinterpolation improves over triangulation alone by+0.8 B���, our method adds another +0.7 B���ontop of interpolation, a statistically significant gain(p < 0.01) according to a bootstrap resamplingsignificance test (Koehn, 2004).

1082

Page 16: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

4.2  Spanish-­‐French  Results  

15/10/15 16

l  To  produce  wsup,  the  authors  aligned  the  small  Spanish-­‐French  parallel  corpus  in  both  direc8ons,  and  symmetrized  using  the  intersec8on  heuris8c  to  obtain  high  precision  (not  grow-­‐diag-­‐final-­‐and)  

l  To  train  skip-­‐gram  model,  dimension  d  =  200  and  context  window  w  =  5  l  They  took  words  that  appeared  more  than  10  8mes  in  the  parallel  

corpus  for  the  training  set  (〜690  words),  and  5-­‐9  8mes  for  the  held  out  dev  set  (〜530  words)  

l  They  fixed  β  :=  0.95  to  examine  the  effect  of  their  supervised  method  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

4.1 DataFixing the pivot language to English, we appliedour method on two data scenarios:

1. Spanish-to-French: two related languagesused to simulate a low-resource setting. Thebaseline is phrase table interpolation (Eq. 3).

2. Malagasy-to-French: two unrelated languagesfor which we have a small dictionary, but noparallel corpus (aside from tuning and testingdata). The baseline is triangulation alone (thereis no source-target model to interpolate with).

Table 1 lists some statistics of the bilin-gual data we used. European-language bitextswere extracted from Europarl (Koehn, 2005). ForMalagasy-English, we used the Global Voices par-allel data available online.1 The Malagasy-Frenchdictionary was extracted from online resources2

and the small Malagasy-French tune/test sets wereextracted3 from Global Voices.

lines of datalanguage pair train tune test

sp-fr 4k 1.5k 1.5kmg-fr 1.1k 1.2k 1.2ksp-en 50k – –mg-en 100k – –en-fr 50k – –

Table 1: Bilingual datasets. Legend: sp=Spanish,fr=French, en=English, mg=Malagasy.

Table 2 lists token statistics of the monolin-gual data used. We used word2vec4 to generateFrench, Spanish and Malagasy word embeddings.The French and Spanish embeddings were (in-dependently) estimated over their combined to-kenized and lowercased Gigaword5 and Leipzignews corpora.6 The Malagasy embeddings weresimilarly estimated over data form Global Voices,7

the Malagasy Wikipedia and the Malagasy Com-mon Crawl.8 In addition, we estimated a 5-gramFrench language model over the French monolin-gual data.

1http://www.ark.cs.cmu.edu/global-voices2http://motmalgache.org/bins/homePage3https://github.com/vchahun/gv-crawl4https://radimrehurek.com/gensim/models/word2vec.html5http://catalog.ldc.upenn.edu6http://corpora.uni-leipzig.de/download.html7http://www.isi.edu/˜qdou/downloads.html8https://commoncrawl.org/the-data/

language wordsFrench 1.5GSpanish 1.4GMalagasy 58M

Table 2: Size of monolingual corpus per languageas measured in number of tokens.

4.2 Spanish-French ResultsTo produce wsup, we aligned the small Spanish-French parallel corpus in both directions, andsymmetrized using the intersection heuristic. Thiswas done to obtain high precision alignments (theoften-used grow-diag-final-and heuristic is opti-mized for phrase extraction, not precision).

We used the skip-gram model to estimate theSpanish and French word embeddings and set thedimension to d = 200 and context window tow = 5 (default). Subsequently, to run our method,we filtered out source and target words that eitherdid not appear in the triangulation, or, did not havean embedding. We took words that appeared morethan 10 times in the parallel corpus for the trainingset (⇠690 words), and between 5–9 times for theheld out dev set (⇠530 words). This was done inboth source-target and target-source directions.

In Table 3 we show that the distributions learnedby our method are much better approximations ofwsup compared to those obtained by triangulation.

Method source!target target!sourcetriangulation 71.6% 72.0%our scores 30.2% 33.8%

Table 3: Average total variation distance (Eq. 5)to the dev set portion of wsup (computed only overwords whose translations in wsup appear in the tri-angulation). Using word embeddings, our methodis able to better generalize on the dev set.

We then examined the e↵ect of appending oursupervised lexical weights. We fixed the wordlevel interpolation � := 0.95 (e↵ectively assigningvery little mass to triangulated word translationsw) and searched for ↵ 2 {0.9, 0.8, 0.7, 0.6} in Eq. 3to maximize B���on the tuning set.

Our MT results are reported in Table 4. Whileinterpolation improves over triangulation alone by+0.8 B���, our method adds another +0.7 B���ontop of interpolation, a statistically significant gain(p < 0.01) according to a bootstrap resamplingsignificance test (Koehn, 2004).

1082

Method ↵ tune testsource-target – 26.8 25.3triangulation – 29.2 28.4interpolation 0.7 30.2 29.2interpolation+our scores 0.6 30.8 29.9

Table 4: Spanish-French B���scores. Append-ing lexical weights obtained by supervision overa small source-target corpus significantly out-performs phrase table interpolation (Eq. 3) by+0.7 B���.

4.3 Malagasy-French ResultsFor Malagasy-French, the wsup distributions usedfor supervision were taken to be uniform distri-butions over the dictionary translations. For eachtraining direction, we used a 70%/30% split of thedictionary to form the train and dev sets.

Having significantly less Malagasy monolin-gual data, we used d = 100 dimensional embed-dings and a w = 3 context window to estimate bothMalagasy and French words.

As before, we added our supervised lexicalweights as new features in the phrase table. How-ever, instead of fixing � = 0.95 as above, wesearched for � 2 {0.9, 0.8, 0.7, 0.6} in Eq. 6 to max-imize B���on a small tune set. We report our re-sults in Table 5. Using only a dictionary, we areable to improve over triangulation by +0.5 B���, astatistically significant di↵erence (p < 0.01).

Method � tune testtriangulation – 12.2 11.1triangulation+our scores 0.6 12.4 11.6

Table 5: Malagasy-French B���. Supervision witha dictionary significantly improves upon simpletriangulation by +0.5 B���.

5 Conclusion

In this paper, we argued that constructing a trian-gulated phrase table independently from even verylimited source-target data (a small dictionary orparallel corpus) underutilizes that parallel data.

Following this argument, we designed a super-vised learning algorithm that relies on word trans-lation distributions derived from the parallel dataas well as a distributed representation of words(embeddings). The latter enables our algorithm toassign translation probabilities to word pairs thatdo not appear in the source-target bilingual data.

We then used our model to generate new lexi-cal weights for phrase pairs appearing in a trian-gulated or interpolated phrase table and demon-strated improvements in MT quality on two tasks.This is despite the fact that the distributions (wsup)we fit our model to were estimated automatically,or even naıvely as uniform distributions.

Acknowledgements

The authors would like to thank Daniel Marcu andKevin Knight for initial discussions and a sup-portive research environment at ISI, as well as theanonymous reviewers for their helpful comments.This research was supported in part by a GoogleFaculty Research Award to Chiang.

ReferencesTrevor Cohn and Mirella Lapata. 2007. Machine

translation by triangulation: Making e↵ective use ofmulti-parallel corpora. In Proc. ACL, pages 728–735.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. J. Machine LearningResearch, 12:2121–2159, July.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Proc.NAACL HLT, pages 48–54.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, Alexan-dra Constantin, and Evan Herbst. 2007. Moses:Open source toolkit for statistical machine transla-tion. In Proc. ACL, Interactive Poster and Demon-stration Sessions, pages 177–180.

Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Proc. EMNLP,pages 388–395.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In Proc. MT Summit,pages 79–86.

Tomas Mikolov, Kai Chen, Greg Corrado, and Je↵reyDean. 2013. E�cient estimation of word represen-tations in vector space. In Proc. ICLR, WorkshopTrack.

Masao Utiyama and Hitoshi Isahara. 2007. A com-parison of pivot methods for phrase-based statisticalmachine translation. In Proc. HLT-NAACL, pages484–491.

Hua Wu and Haifeng Wang. 2007. Pivot language ap-proach for phrase-based statistical machine transla-tion. In Proc. ACL, pages 856–863.

1083

Page 17: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

4.3  Malagasy-­‐French  Results  

15/10/15 17

l  The  wsup  distribu8ons  used  for  supervision  were  taken  to  be  uniform  distribu8ons  over  the  dic8onary  transla8ons  •  For  each  training  direc8on,  they  used  70%/30%  split  of  the  dic8onary  to  form  the  train  and  dev  sets  

l  To  train  skip-­‐gram  model,  d  =  100,  w  =  3  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

Method ↵ tune testsource-target – 26.8 25.3triangulation – 29.2 28.4interpolation 0.7 30.2 29.2interpolation+our scores 0.6 30.8 29.9

Table 4: Spanish-French B���scores. Append-ing lexical weights obtained by supervision overa small source-target corpus significantly out-performs phrase table interpolation (Eq. 3) by+0.7 B���.

4.3 Malagasy-French ResultsFor Malagasy-French, the wsup distributions usedfor supervision were taken to be uniform distri-butions over the dictionary translations. For eachtraining direction, we used a 70%/30% split of thedictionary to form the train and dev sets.

Having significantly less Malagasy monolin-gual data, we used d = 100 dimensional embed-dings and a w = 3 context window to estimate bothMalagasy and French words.

As before, we added our supervised lexicalweights as new features in the phrase table. How-ever, instead of fixing � = 0.95 as above, wesearched for � 2 {0.9, 0.8, 0.7, 0.6} in Eq. 6 to max-imize B���on a small tune set. We report our re-sults in Table 5. Using only a dictionary, we areable to improve over triangulation by +0.5 B���, astatistically significant di↵erence (p < 0.01).

Method � tune testtriangulation – 12.2 11.1triangulation+our scores 0.6 12.4 11.6

Table 5: Malagasy-French B���. Supervision witha dictionary significantly improves upon simpletriangulation by +0.5 B���.

5 Conclusion

In this paper, we argued that constructing a trian-gulated phrase table independently from even verylimited source-target data (a small dictionary orparallel corpus) underutilizes that parallel data.

Following this argument, we designed a super-vised learning algorithm that relies on word trans-lation distributions derived from the parallel dataas well as a distributed representation of words(embeddings). The latter enables our algorithm toassign translation probabilities to word pairs thatdo not appear in the source-target bilingual data.

We then used our model to generate new lexi-cal weights for phrase pairs appearing in a trian-gulated or interpolated phrase table and demon-strated improvements in MT quality on two tasks.This is despite the fact that the distributions (wsup)we fit our model to were estimated automatically,or even naıvely as uniform distributions.

Acknowledgements

The authors would like to thank Daniel Marcu andKevin Knight for initial discussions and a sup-portive research environment at ISI, as well as theanonymous reviewers for their helpful comments.This research was supported in part by a GoogleFaculty Research Award to Chiang.

ReferencesTrevor Cohn and Mirella Lapata. 2007. Machine

translation by triangulation: Making e↵ective use ofmulti-parallel corpora. In Proc. ACL, pages 728–735.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. J. Machine LearningResearch, 12:2121–2159, July.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Proc.NAACL HLT, pages 48–54.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, Alexan-dra Constantin, and Evan Herbst. 2007. Moses:Open source toolkit for statistical machine transla-tion. In Proc. ACL, Interactive Poster and Demon-stration Sessions, pages 177–180.

Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Proc. EMNLP,pages 388–395.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In Proc. MT Summit,pages 79–86.

Tomas Mikolov, Kai Chen, Greg Corrado, and Je↵reyDean. 2013. E�cient estimation of word represen-tations in vector space. In Proc. ICLR, WorkshopTrack.

Masao Utiyama and Hitoshi Isahara. 2007. A com-parison of pivot methods for phrase-based statisticalmachine translation. In Proc. HLT-NAACL, pages484–491.

Hua Wu and Haifeng Wang. 2007. Pivot language ap-proach for phrase-based statistical machine transla-tion. In Proc. ACL, pages 856–863.

1083

Page 18: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

5.  Conclusion  

15/10/15 18

In  this  paper:  l  The  authors  argued  that  construc8ng  a  triangulated  phrase  table  

independently  from  even  very  limited  source-­‐target  data  underu;lizes  that  parallel  data  

Ø  They  designed  a  supervised  learning  algorithm  that  relies  on  word  transla8ons  distribu8ons  derived  from  the  parallel  data  as  well  as  a  distributed  representa;on  of  words  (embeddings)  

Ø  The  laker  enables  their  algorithm  to  assign  transla8on  probabili8es  to  word  pairs  that  do  not  appear  in  the  source-­‐target  bilingual  data  

l  Model  with  the  new  lexical  weights  genera8on  demonstrates  improvements  in  MT  quality  on  two  tasks  despite  the  fact  that  wsup  were  es8mated  automa8cally  or  even  naïvely  as  uniform  distribu;ons  

2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

Page 19: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

6.  Impression  

15/10/15 19 2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST

Page 20: [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for Low-Resource Languages

End  Slide  

15/10/15 20 2015©Akiva  Miura      AHC-­‐Lab,  IS,  NAIST