[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for...

MT Study Group

Supervised Phrase Table Triangula8on with Neural Word Embeddings for Low-‐Resource Languages

Tomer Levinboim and David Chiang

Proc. of EMNLP 2015, Lisbon, Portugal

Introduced by Akiva Miura, AHC-‐Lab

15/10/15 2015©Akiva Miura AHC-‐Lab, IS, NAIST 1

Contents

15/10/15 2015©Akiva Miura AHC-‐Lab, IS, NAIST 2

1.  Introduc8on 2.  Preliminaries 3.  Supervised Word Transla8ons 4.  Experiments 5.  Conclusion 6.  Impression

1. Introduc8on

15/10/15 3

Problem: Scarceness of Bilingual Data l  PBMT systems require considerable amounts of source-‐target

parallel data to produce good quality transla8on Ø A triangulated source-‐target phrase table can be composed from a source-‐pivot and pivot-‐target phrase table, but s8ll noisy

l  This paper shows a supervised learning technique that improves noisy phrase transla;on scores by extrac8on of word transla8on distribu8ons from small amounts of bilingual data Ø This method gained improvement on Malagasy-‐to-‐French and Spanish-‐to-‐French transla8on tasks via English

2015©Akiva Miura AHC-‐Lab, IS, NAIST

2. Preliminaries

15/10/15 4

Denota8on: 2 Preliminaries

Let s, p, t denote words and s,p, t denote phrasesin the source, pivot, and target languages, respec-tively. Also, let T denote a phrase table estimatedover a parallel corpus and T denote a triangu-lated phrase table. We use similar notation for theirrespective phrase translation features �, lexical-weighting features lex, and the word translationprobabilities w.

2.1 Triangulation (weak baseline)In phrase table triangulation, a source-targetphrase table Tst is constructed by combining asource-pivot and pivot-target phrase table Tsp,Tpt,each estimated on its respective parallel data. Foreach resulting phrase pair (s, t), we can also com-pute an alignment a as the most frequent align-ment obtained by combining source-pivot andpivot-target alignments asp and apt across all pivotphrases p as follows: {(s, t) | 9p : (s, p) 2 asp ^

(p, t) 2 apt}.The triangulated source-to-target lexical

weights, denoted clexst, are approximated in twosteps: First, word translation scores wst are ap-proximated by marginalizing over the pivot words:

wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)

Next, given a (triangulated) phrase pair (s, t) withalignment a, let as,: = {t | (s, t) 2 a}; the lexical-weighting probability is (Koehn et al., 2003):

clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)

The triangulated phrase translation scores, de-noted �st, are computed by analogy with Eq. 1.

We also compute these scores in the reversedirection by swapping the source and target lan-guages.

2.2 Interpolation (strong baseline)Given access to source-target data, an ordinarysource-target phrase table Tst can be estimated di-rectly. Wu and Wang (2007) suggest interpolatingphrase pairs entries that occur in both tables:

Tinterp = ↵Tst + (1 � ↵)Tst. (3)

Phrase pairs appearing in only one phrase table areadded as-is. We refer to the resulting table as theinterpolated phrase table.

3 Supervised Word Translations

While interpolation (Eq. 3) may help correct someof the noisy triangulated scores, its e↵ect is lim-ited to phrase pairs appearing in both phrase ta-bles. Here, we suggest a discriminative supervisedlearning method that can a↵ect all phrase pairs.

Our idea is to regard word translation distri-butions derived from source-target bilingual data(through word alignments or dictionary entries)as the correct translation distributions, and usethem to learn discriminately: correct target wordsshould become likely translations, and incorrectones should be down-weighted. To generalize be-yond the vocabulary of the source-target data, weappeal to word embeddings.

We present our formulation in the source-to-target direction. The target-to-source direction isobtained simply by swapping the source and tar-get languages.

3.1 ModelLet csup

st denote the number of times source words was aligned to target word t (in word alignment,or in the dictionary). We define the word transla-tion distributions wsup(t | s) = csup

st /csups , where

csups =

Pt csup

st . Furthermore, let q(t | s) denote theword translation probabilities we wish to learn andconsider maximizing the log-likelihood function:

arg maxq

L(q) = arg maxq

X

(s,t)

csupst log q(t | s).

Clearly, the solution q(· | s) := wsup(· | s) maxi-mizes L. However, we would like a solution thatgeneralizes to source words s beyond those ob-served in the source-target corpus – in particular,those source words that appear in the triangulatedphrase table T , but not in T .

In order to generalize, we abstract from wordsto vector representations of words. Specifically,we constrain q to the following parameterization:

q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.

Here, the vectors vs and vt represent monolingualfeatures and the vector fst represents bilingual fea-tures. The parameters A and h are to be learned.

In this work, we use monolingual word embed-dings for vs and vt, and set the vector fst to con-tain only the value of the triangulated score, such

1080

- words in source, pivot, and target languages respectively 2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

- phrases in …

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

- a phrase table estimated over a parallel corpus

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

- a triangulated phrase table

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

- phrase translation features

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

- lexical-weighting features

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

- word translation probabilities


2.1 Triangula8on (weak baseline)

15/10/15 5

l  A source-‐target phrase table Tst is constructed by combining a source-‐pivot and pivot-‐target phrase table Tsp, Tpt

l  Combining alignment:

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080


= l  Lexical weigh8ng probability es8ma8on:

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

l  The triangulated phrase transla8on scores are computed by analogy with Eq. 1

l  Compu8ng these scores in the reverse direc8on by swapping the source and target languages

2.2 Interpola8on (strong baseline)

15/10/15 6

l  Given access to source-‐target data, an ordinary source-‐target phrase table Tst can be es8mated directly


l  Interpola8on of phrase pairs entries that occur in both tables:

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

Phrase pairs appearing in only one phrase table are added as-‐is

3. Supervised Word Transla8on

15/10/15 7

l  The effect of interpola8on (Eq. 3) is limited to phrase pairs appearing in both phrase tables.

l  The idea of this paper is to regard word transla8on distribu8ons derived from source-‐target bilingual data (through word alignments or dic8onary entries) as the correct transla8on, and use them to learn discriminately •  correct target words should become likely transla;ons •  incorrect ones should be down-‐weighted

Ø  To generalize beyond the vocabulary of the source-‐target data, the authors appeal to word embeddings


3.1 Model

15/10/15 8

Defining:


2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

- the number of times source word s was aligned to target word t (in word alignment, or in the dictionary)

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

- the word translation distribution

where

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

- the word translation probabilities we wish to learn

l  We consider maximizing the log-‐likelihood func8on:

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

Cleary, the solu8on

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

maximizes L Ø  However, we would like a solu8on that generalizes to source words s

beyond those observed in the source-‐target corpus

3.1 Model (cont’d)

15/10/15 9

l  In order to generalize, we abstract from words to vector representa8ons of words Ø We constrain q to the following parameteriza8on:


2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

- vectors of monolingual features (word embeddings)

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080- a vector of bilingual features (triangulated scores)

2 Preliminaries





wst(t | s) =X

p

wsp(p | s) · wpt(t | p). (1)


clexst(t | s, a) =Y

s2s

1|as,:|

X

t2as,:

wst(t | s). (2)










3.1 ModelLet csup


st /csups , where

csups =

Pt csup


arg maxq

L(q) = arg maxq

X

(s,t)




q(t | s) =1Zs

exp⇣vT

s Avt + f Tst h⌘

Zs =X

t2T (s)

exp⇣vT

s Avt + f Tst h⌘.



1080

- parameters to be learned l  For normaliza8on:

that fst := wst. Therefore, the matrix A is a lin-ear transformation between the source and targetembedding spaces, and h (now a scalar) quantifieshow the triangulated scores w are to be trusted.

In the normalization factor Zs, we let t rangeonly over possible translations of s suggested byeither wsup or the triangulated word probabilities.That is:

T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.

This restriction makes e�cient computation pos-sible, as otherwise the normalization term wouldhave to be computed over the entire target vocab-ulary.

Under this parameterization, our goal is to solvethe following maximization problem:

maxA,h

L(A, h) = maxA,h

X

s,t

csupst log q(t | s). (4)

3.2 Optimization

The objective function in Eq. 4 is concave in bothA and h. This is because after taking the log, weare left with a weighted sum of linear and concave(negative log-sum-exp) terms in A and h. We cantherefore reach the global solution of the problemusing gradient descent.

Taking derivatives, the gradient is

@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst

where the scalar mst = csupst � csup

s q(t | s) for thecurrent value of q.

For quick results, we limited the number of gra-dient steps to 200 and selected the iteration thatminimized the total variation distance to wsup overa held out dev set:

X

s

||q(· | s) � wsup(· | s)||1. (5)

We obtained better convergence rate by us-ing a batch version of the e↵ective and easy-to-implement Adagrad technique (Duchi et al.,2011). See Figure 1.

3.3 Re-estimating lexical weights

Having learned the model (A and h), we can nowuse q(t | s) to estimate the lexical weights (Eq. 2)of any aligned phrase pairs (s, t, a), assuming it iscomposed of embeddable words.

Figure 1: The (target-to-source) objective functionper iteration. Applying batch Adagrad (blue) sig-nificantly accelerates convergence.

However, we found the supervised word trans-lation scores q to be too sharp, sometimes assign-ing all probability mass to a single target word. Wetherefore interpolated q with the triangulated wordtranslation scores w:

q� = �q + (1 � �)w. (6)

To integrate the lexical weights induced by q�(Eq. 2), we simply appended them as new featuresin the phrase table in addition to the existing lexi-cal weights. Following this, we can search for a �value that maximizes B��on a tuning set.

3.4 Summary of method

In summary, to improve upon a triangulated or in-terpolated phrase table, we:

1. Learn word translation distributions q by super-vision against distributions wsup derived fromthe source-target bilingual data (§3.1).

2. Smooth the learned distributions q by interpo-lating with triangulated word translation scoresw (§3.3).

3. Compute new lexical weights and append themto the phrase table (§3.3).

4 Experiments

To test our method, we conducted two low-resource translation experiments using thephrase-based MT system Moses (Koehn et al.,2007).

1081

Ø  Under the parameteriza8on, our goal is to solve the following:



T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.



maxA,h

L(A, h) = maxA,h

X

s,t


3.2 Optimization



@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst




X

s

||q(· | s) � wsup(· | s)||1. (5)






q� = �q + (1 � �)w. (6)







4 Experiments


1081

3.2 Op8miza8on

15/10/15 10

l  The objec8ve func8on in Eq. 4 is concave in both A and h Ø We can reach the global solu8on of the problem using gradient descent

l  Taking deriva8ves, the gradient is




T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.



maxA,h

L(A, h) = maxA,h

X

s,t


3.2 Optimization



@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst




X

s

||q(· | s) � wsup(· | s)||1. (5)






q� = �q + (1 � �)w. (6)







4 Experiments


1081



T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.



maxA,h

L(A, h) = maxA,h

X

s,t


3.2 Optimization



@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst




X

s

||q(· | s) � wsup(· | s)||1. (5)






q� = �q + (1 � �)w. (6)







4 Experiments


1081

l  For quick results, this research limited the number of gradient steps to 200 and selected the itera8on that minimized the total varia8on distance to wsup over a held out dev set:



T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.



maxA,h

L(A, h) = maxA,h

X

s,t


3.2 Optimization



@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst




X

s

||q(· | s) � wsup(· | s)||1. (5)






q� = �q + (1 � �)w. (6)







4 Experiments


1081

3.2 Op8miza8on (cont’d)

15/10/15 11 2015©Akiva Miura AHC-‐Lab, IS, NAIST



T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.



maxA,h

L(A, h) = maxA,h

X

s,t


3.2 Optimization



@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst




X

s

||q(· | s) � wsup(· | s)||1. (5)






q� = �q + (1 � �)w. (6)







4 Experiments


1081

3.3 Re-‐es8ma8ng lexical weights

15/10/15 12

l  Having learned the model (A and h), we can now use q(t | s) to es8mate the lexical weights (Eq. 2) of any aligned phrase pairs , assuming it is composed of embeddable words

l  However, the authors found the supervised word transla8on scores q to be too sharp, some8mes assigning all probability mass to a single target word

Ø  They therefore interpolated q with the triangulated word transla8on scores:

•  To integrate the lexical weights induced by qβ (Eq. 2), they simply appended them as new features in the phrase table




T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.



maxA,h

L(A, h) = maxA,h

X

s,t


3.2 Optimization



@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst




X

s

||q(· | s) � wsup(· | s)||1. (5)






q� = �q + (1 � �)w. (6)







4 Experiments


1081



T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.



maxA,h

L(A, h) = maxA,h

X

s,t


3.2 Optimization



@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst




X

s

||q(· | s) � wsup(· | s)||1. (5)






q� = �q + (1 � �)w. (6)







4 Experiments


1081


15/10/15 13

In summary, to improve upon a triangulated or interpolated phrase table, the authors: 1.  Learn word transla8on distribu8ons q by supervision against

distribu8ons wsup derived from the source-‐target bilingual data (§3.1)

2.  Smooth the learned distribu8ons q by interpola8ng with triangulated word transla8on scores

3.  Compute new lexical weights and append them to the phrase table (§3.3)




T (s) = {t | wsup(t | s) > 0 _ w(t | s) > 0}.



maxA,h

L(A, h) = maxA,h

X

s,t


3.2 Optimization



@L@A=X

s,t

mstvsvTt

@L@h=X

s,t

mst fst




X

s

||q(· | s) � wsup(· | s)||1. (5)






q� = �q + (1 � �)w. (6)







4 Experiments


1081

4. Experiments

15/10/15 14

l  To test the proposed method, the authors conducted two low-‐resource transla8on experiments using Moses

Transla8on Tasks: l  Fixing the pivot language to English, they applied their method

on two data scenarios: 1.   Spanish-‐to-‐French:

two related languages used to simulate a low-‐resource seeng. The baseline is phrase table interpola8on (Eq. 3)

2.   Malagasy-‐to-‐French: two unrelated languages for which they have a small dic8onary, but no parallel corpus. The baseline is triangula8on alone.


4.1 Data

15/10/15 15

Datasets: •  European-‐language bitext were extracted from Europarl •  For Malagasy-‐English, Global Voices parallel data available online •  The Malagasy-‐French dic8onary from online resources and small

Malagasy-‐French tune/tests from Global Voices


4.1 DataFixing the pivot language to English, we appliedour method on two data scenarios:

1. Spanish-to-French: two related languagesused to simulate a low-resource setting. Thebaseline is phrase table interpolation (Eq. 3).

2. Malagasy-to-French: two unrelated languagesfor which we have a small dictionary, but noparallel corpus (aside from tuning and testingdata). The baseline is triangulation alone (thereis no source-target model to interpolate with).

Table 1 lists some statistics of the bilin-gual data we used. European-language bitextswere extracted from Europarl (Koehn, 2005). ForMalagasy-English, we used the Global Voices par-allel data available online.1 The Malagasy-Frenchdictionary was extracted from online resources2

and the small Malagasy-French tune/test sets wereextracted3 from Global Voices.

lines of datalanguage pair train tune test

sp-fr 4k 1.5k 1.5kmg-fr 1.1k 1.2k 1.2ksp-en 50k – –mg-en 100k – –en-fr 50k – –

Table 1: Bilingual datasets. Legend: sp=Spanish,fr=French, en=English, mg=Malagasy.

Table 2 lists token statistics of the monolin-gual data used. We used word2vec4 to generateFrench, Spanish and Malagasy word embeddings.The French and Spanish embeddings were (in-dependently) estimated over their combined to-kenized and lowercased Gigaword5 and Leipzignews corpora.6 The Malagasy embeddings weresimilarly estimated over data form Global Voices,7

the Malagasy Wikipedia and the Malagasy Com-mon Crawl.8 In addition, we estimated a 5-gramFrench language model over the French monolin-gual data.

1http://www.ark.cs.cmu.edu/global-voices2http://motmalgache.org/bins/homePage3https://github.com/vchahun/gv-crawl4https://radimrehurek.com/gensim/models/word2vec.html5http://catalog.ldc.upenn.edu6http://corpora.uni-leipzig.de/download.html7http://www.isi.edu/˜qdou/downloads.html8https://commoncrawl.org/the-data/

language wordsFrench 1.5GSpanish 1.4GMalagasy 58M

Table 2: Size of monolingual corpus per languageas measured in number of tokens.

4.2 Spanish-French ResultsTo produce wsup, we aligned the small Spanish-French parallel corpus in both directions, andsymmetrized using the intersection heuristic. Thiswas done to obtain high precision alignments (theoften-used grow-diag-final-and heuristic is opti-mized for phrase extraction, not precision).

We used the skip-gram model to estimate theSpanish and French word embeddings and set thedimension to d = 200 and context window tow = 5 (default). Subsequently, to run our method,we filtered out source and target words that eitherdid not appear in the triangulation, or, did not havean embedding. We took words that appeared morethan 10 times in the parallel corpus for the trainingset (⇠690 words), and between 5–9 times for theheld out dev set (⇠530 words). This was done inboth source-target and target-source directions.

In Table 3 we show that the distributions learnedby our method are much better approximations ofwsup compared to those obtained by triangulation.

Method source!target target!sourcetriangulation 71.6% 72.0%our scores 30.2% 33.8%

Table 3: Average total variation distance (Eq. 5)to the dev set portion of wsup (computed only overwords whose translations in wsup appear in the tri-angulation). Using word embeddings, our methodis able to better generalize on the dev set.

We then examined the e↵ect of appending oursupervised lexical weights. We fixed the wordlevel interpolation � := 0.95 (e↵ectively assigningvery little mass to triangulated word translationsw) and searched for ↵ 2 {0.9, 0.8, 0.7, 0.6} in Eq. 3to maximize B��on the tuning set.

Our MT results are reported in Table 4. Whileinterpolation improves over triangulation alone by+0.8 B��, our method adds another +0.7 B��ontop of interpolation, a statistically significant gain(p < 0.01) according to a bootstrap resamplingsignificance test (Koehn, 2004).

1082





















1082

4.2 Spanish-‐French Results

15/10/15 16

l  To produce wsup, the authors aligned the small Spanish-‐French parallel corpus in both direc8ons, and symmetrized using the intersec8on heuris8c to obtain high precision (not grow-‐diag-‐final-‐and)

l  To train skip-‐gram model, dimension d = 200 and context window w = 5 l  They took words that appeared more than 10 8mes in the parallel

corpus for the training set (〜690 words), and 5-‐9 8mes for the held out dev set (〜530 words)

l  They fixed β := 0.95 to examine the effect of their supervised method






















1082

Method ↵ tune testsource-target – 26.8 25.3triangulation – 29.2 28.4interpolation 0.7 30.2 29.2interpolation+our scores 0.6 30.8 29.9

Table 4: Spanish-French B��scores. Append-ing lexical weights obtained by supervision overa small source-target corpus significantly out-performs phrase table interpolation (Eq. 3) by+0.7 B��.

4.3 Malagasy-French ResultsFor Malagasy-French, the wsup distributions usedfor supervision were taken to be uniform distri-butions over the dictionary translations. For eachtraining direction, we used a 70%/30% split of thedictionary to form the train and dev sets.

Having significantly less Malagasy monolin-gual data, we used d = 100 dimensional embed-dings and a w = 3 context window to estimate bothMalagasy and French words.

As before, we added our supervised lexicalweights as new features in the phrase table. How-ever, instead of fixing � = 0.95 as above, wesearched for � 2 {0.9, 0.8, 0.7, 0.6} in Eq. 6 to max-imize B��on a small tune set. We report our re-sults in Table 5. Using only a dictionary, we areable to improve over triangulation by +0.5 B��, astatistically significant di↵erence (p < 0.01).

Method � tune testtriangulation – 12.2 11.1triangulation+our scores 0.6 12.4 11.6

Table 5: Malagasy-French B��. Supervision witha dictionary significantly improves upon simpletriangulation by +0.5 B��.

5 Conclusion

In this paper, we argued that constructing a trian-gulated phrase table independently from even verylimited source-target data (a small dictionary orparallel corpus) underutilizes that parallel data.

Following this argument, we designed a super-vised learning algorithm that relies on word trans-lation distributions derived from the parallel dataas well as a distributed representation of words(embeddings). The latter enables our algorithm toassign translation probabilities to word pairs thatdo not appear in the source-target bilingual data.

We then used our model to generate new lexi-cal weights for phrase pairs appearing in a trian-gulated or interpolated phrase table and demon-strated improvements in MT quality on two tasks.This is despite the fact that the distributions (wsup)we fit our model to were estimated automatically,or even naıvely as uniform distributions.

Acknowledgements

The authors would like to thank Daniel Marcu andKevin Knight for initial discussions and a sup-portive research environment at ISI, as well as theanonymous reviewers for their helpful comments.This research was supported in part by a GoogleFaculty Research Award to Chiang.

ReferencesTrevor Cohn and Mirella Lapata. 2007. Machine

translation by triangulation: Making e↵ective use ofmulti-parallel corpora. In Proc. ACL, pages 728–735.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. J. Machine LearningResearch, 12:2121–2159, July.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Proc.NAACL HLT, pages 48–54.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, Alexan-dra Constantin, and Evan Herbst. 2007. Moses:Open source toolkit for statistical machine transla-tion. In Proc. ACL, Interactive Poster and Demon-stration Sessions, pages 177–180.

Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Proc. EMNLP,pages 388–395.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In Proc. MT Summit,pages 79–86.

Tomas Mikolov, Kai Chen, Greg Corrado, and Je↵reyDean. 2013. E�cient estimation of word represen-tations in vector space. In Proc. ICLR, WorkshopTrack.

Masao Utiyama and Hitoshi Isahara. 2007. A com-parison of pivot methods for phrase-based statisticalmachine translation. In Proc. HLT-NAACL, pages484–491.

Hua Wu and Haifeng Wang. 2007. Pivot language ap-proach for phrase-based statistical machine transla-tion. In Proc. ACL, pages 856–863.

1083

4.3 Malagasy-‐French Results

15/10/15 17

l  The wsup distribu8ons used for supervision were taken to be uniform distribu8ons over the dic8onary transla8ons •  For each training direc8on, they used 70%/30% split of the dic8onary to form the train and dev sets

l  To train skip-‐gram model, d = 100, w = 3


Method ↵ tune testsource-target – 26.8 25.3triangulation – 29.2 28.4interpolation 0.7 30.2 29.2interpolation+our scores 0.6 30.8 29.9

Table 4: Spanish-French B��scores. Append-ing lexical weights obtained by supervision overa small source-target corpus significantly out-performs phrase table interpolation (Eq. 3) by+0.7 B��.

4.3 Malagasy-French ResultsFor Malagasy-French, the wsup distributions usedfor supervision were taken to be uniform distri-butions over the dictionary translations. For eachtraining direction, we used a 70%/30% split of thedictionary to form the train and dev sets.

Having significantly less Malagasy monolin-gual data, we used d = 100 dimensional embed-dings and a w = 3 context window to estimate bothMalagasy and French words.

As before, we added our supervised lexicalweights as new features in the phrase table. How-ever, instead of fixing � = 0.95 as above, wesearched for � 2 {0.9, 0.8, 0.7, 0.6} in Eq. 6 to max-imize B��on a small tune set. We report our re-sults in Table 5. Using only a dictionary, we areable to improve over triangulation by +0.5 B��, astatistically significant di↵erence (p < 0.01).

Method � tune testtriangulation – 12.2 11.1triangulation+our scores 0.6 12.4 11.6

Table 5: Malagasy-French B��. Supervision witha dictionary significantly improves upon simpletriangulation by +0.5 B��.

5 Conclusion

In this paper, we argued that constructing a trian-gulated phrase table independently from even verylimited source-target data (a small dictionary orparallel corpus) underutilizes that parallel data.

Following this argument, we designed a super-vised learning algorithm that relies on word trans-lation distributions derived from the parallel dataas well as a distributed representation of words(embeddings). The latter enables our algorithm toassign translation probabilities to word pairs thatdo not appear in the source-target bilingual data.

We then used our model to generate new lexi-cal weights for phrase pairs appearing in a trian-gulated or interpolated phrase table and demon-strated improvements in MT quality on two tasks.This is despite the fact that the distributions (wsup)we fit our model to were estimated automatically,or even naıvely as uniform distributions.

Acknowledgements

The authors would like to thank Daniel Marcu andKevin Knight for initial discussions and a sup-portive research environment at ISI, as well as theanonymous reviewers for their helpful comments.This research was supported in part by a GoogleFaculty Research Award to Chiang.

ReferencesTrevor Cohn and Mirella Lapata. 2007. Machine

translation by triangulation: Making e↵ective use ofmulti-parallel corpora. In Proc. ACL, pages 728–735.

John Duchi, Elad Hazan, and Yoram Singer. 2011.Adaptive subgradient methods for online learningand stochastic optimization. J. Machine LearningResearch, 12:2121–2159, July.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Proc.NAACL HLT, pages 48–54.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, Alexan-dra Constantin, and Evan Herbst. 2007. Moses:Open source toolkit for statistical machine transla-tion. In Proc. ACL, Interactive Poster and Demon-stration Sessions, pages 177–180.

Philipp Koehn. 2004. Statistical significance tests formachine translation evaluation. In Proc. EMNLP,pages 388–395.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In Proc. MT Summit,pages 79–86.

Tomas Mikolov, Kai Chen, Greg Corrado, and Je↵reyDean. 2013. E�cient estimation of word represen-tations in vector space. In Proc. ICLR, WorkshopTrack.

Masao Utiyama and Hitoshi Isahara. 2007. A com-parison of pivot methods for phrase-based statisticalmachine translation. In Proc. HLT-NAACL, pages484–491.

Hua Wu and Haifeng Wang. 2007. Pivot language ap-proach for phrase-based statistical machine transla-tion. In Proc. ACL, pages 856–863.

1083

5. Conclusion

15/10/15 18

In this paper: l  The authors argued that construc8ng a triangulated phrase table

independently from even very limited source-‐target data underu;lizes that parallel data

Ø  They designed a supervised learning algorithm that relies on word transla8ons distribu8ons derived from the parallel data as well as a distributed representa;on of words (embeddings)

Ø  The laker enables their algorithm to assign transla8on probabili8es to word pairs that do not appear in the source-‐target bilingual data

l  Model with the new lexical weights genera8on demonstrates improvements in MT quality on two tasks despite the fact that wsup were es8mated automa8cally or even naïvely as uniform distribu;ons


6. Impression


End Slide


[Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for...

Technology

Transcript of [Paper Introduction] Supervised Phrase Table Triangulation with Neural Word Embeddings for...