1 Gholamreza Haffari Simon Fraser University MT Summit, August 2009 Machine Learning approaches for...

1

Gholamreza Haffari

Simon Fraser University

MT Summit, August 2009

Machine Learning approaches for dealing with Limited Bilingual Data in SMT

2

Acknowledgments

Special thanks to: Anoop Sarkar

Some slides are adapted or used from Chris Callison Burch Trevor Cohn Dragos Stefan Munteanu

3

Statistical Machine Translation

Translate from a source language to a target language by computer using a statistical model

MFE is a standard log-linear model

MFESource Lang. F Target Lang. E

4

Log-Linear Models

Feature functions Weights

In the test time, the best output t* for a given s is chosen by

t* = arg max t i wi . fi (t,s)

5

Phrase-based SMT

MFE is composed of two main components:

The language model flm : Takes care of the fluency of the generated translation

The phrase table fpt : Takes care of the content of the source sentence in the generated translation

Huge bitext is needed to learn a high quality

phrase dictionary

6

Bilingual Parallel Data

Source Text Target Text

7

This Talk

What if we don’t have large bilingual

text to learn a good phrase table?

8

Motivations

Low-density Language pairs Population speaking the language is small / Limited online resources

Adapting to a new style/domain/topic

Overcome training and testing mismatch

9

Available Resources

Small bilingual parallel corpora

Large amounts of monolingual data

Comparable corpora

Small translation dictionary

Multilingual parallel corpora which includes multiple source languages but not the target language

10

The Map

source-targetsmall bitext

MT system

large comparable source-target

bitext

parallel sentenceextraction

bilingual dictionary induction

large source monotext

semi-supervised/active learning

source-anotherlanguage bitext

paraphrasing

source-anotheranother-targetsource-target

bitexts

triangulation/co-training

11

Learning Problems (I)

Supervised Learning: Given a sample of object-label pairs (xi,yi), find the

predictive relationship between object and labels

Un-supervised learning: Given a sample consisting of only objects, look for

interesting structures in the data, and group similar objects

12

Learning Problems (II)

Now consider a training data consisting of: Labeled data: Object-label pairs (xi,yi)

Unlabeled data: Objects xj

Leads to the following learning scenarios: Semi-Supervised Learning: Find the best mapping from

objects to labels benefiting from Unlabeled data

Transductive Learning: Find the labels of unlabeled data

Active Learning: Find the mapping while actively query the oracle for the label of unlabeled data

13

The Big Picture

Unlabeled{xj}

(monotext)

Labeled{(xi,yi)}(bitext)

Data

Train M Select

Self-Training

14

Mining More Bilingual Parallel Data

Comparable Corpora are texts which are not parallel in the strict sense but convey overlapping information Wikipedia pages New agencies: BBC, CNN

From comparable corpora, we can extract sentence pairs which are (approximately) translation of each other

15

Extracting Parallel Sentences

(Munteanu & Marcu, 2005)

Un-matched Documents

Parallelsentences

16

Article Selection


Select the n-most relevant target-language docs to a source-language document using an information retrieval (IR) system:

Translate each source-lang article into a target-lang query using the bilingual dictionary

Un-matched Documents

17

Candidate Sentence Pair Selection


Consider all of the sentence pairs from the source-lang article and relevant target-lang articles. Filter the sentence pairs if:

The ratio of the length is more than 2

At least half of the words in each sentence does not have a translation in the other sentence

18

Parallel Sentence Selection


Each candidate sentence pair (s,t) is classified into c0=‘parallel’ or c1=‘not parallel’ according to the following log-linear model:

The weights are learned during training phase using training data

19

Model Features & Training Data


The features of the log-linear classifier include: Length of the sentences, as well as their ratio

Percentage of words in one side that do not have translation in the other side / are not connected by alignment links

Training data can be prepared by a parallel corpus containing K sentence pairs

This gives K positive and K2 – K negative examples (which can be filtered further using the previous heuristics)

20

Improvement in SMT (Arabic to English)


Initial out-of-domain parallel corpus

Initial + extracted corpus

Initial + human translated data

21

Outline

Introduction

Semi-supervised Learning for SMT Background (EM, Self-training, Co-Training) SSL for Alignments / Phrases / Sentences

Active Learning for SMT Single-language pair Multiple Language Pairs

22

Inductive vs.Transductive

Transductive: Produce label only for the available unlabeled data. The output of the method is not a classifier It’s like writing answers for the take-home exam!

Inductive: Not only produce label for unlabeled data, but also produce a classifier. It’s like preparation for writing answers for the in-class

exam!

23

Self-Training

Iteration: 0

+

-

A Model

trained by SL

Choose instances labeled with high confidence

Iteration: 1

+

-

Add them to thepool of current labeled training data

……

Iteration: 2

+

-

(Yarowsky 1995)

24

EM

Use EM to maximize the joint log-likelihood of labeled and unlabeled data:

: Log-likelihood of labeled data

: Log-likelihood of unlabeled data

(Dempster et al 1977)

25

EM

Iteration: 0

+

-

A Model

trained by SL Clone new

weighted labeled instances with unlab instancesusing (probabilisitc) model

Iteration: 1

+

-

……

(Yarowsky 1995)

w+i

w-i

Iteration: 2

+

-

26

Co-Training Instances contain two sufficient sets of features

i.e. an instance is x=(x1,x2)

Each set of features is called a View

Two views are independent given the label:

Two views are consistent:

xx1 x2

(Blum & Mitchell 1998)

27

Co-Training

Iteration: t

+

-

Iteration: t+1

+

-

……

C1: A Classifiertrained

on view 1

C2: A Classifiertrained

on view 2

Allow C1 to label Some instances

Allow C2 to label Some instances

Add self-labeled instances to the pool of training data

28

Outline

Introduction



29

Word Alignment & Translation Quality

(Fraser & Marcu 2006a) presented an SSL method for learning a better word alignment

A small / big set of sentence pairs annotated/unannotated with word alignments (~ 100 / ~ 2-3 million)

They showed that improvement in the word alignment caused improvement in the BLEU

The same conclusion was made later in (Ganchev et al 2008) for other translation tasks

30

Word Alignment Model Consider the following log-linear model for word

alignment:

The feature functions are sub-models used in the IBM model 4, such as Translation probability t(f|e) Fertility probs n(|e): number of words generated by e …

31

SS-Word Alignment (Fraser & Marcu 2006a) tuned the word alignment model

parameters on the small labeled data in a discriminative fashion

With the current , generate the n-best list

Manipulate so that the best alignment stands out, i.e. the one which maximizes modified f-measure (MERT style alg)

Use to find the word alignments of the big unlabeled data Estimate the feature functions’ parameters based on these best (Viterbi)

alignments: 1 iteration of the EM algorithm

Repeat the above two steps

32

Outline

Introduction



33

Paraphrasing If a word is unseen then SMT will not be able to

translate it Keep/omit/transliterate source word or use regular

expression to translate it (dates, …)

If a phrase is unseen, but its individual words are, then SMT will be less likely to produce a correct translation

The idea: Use paraphrases in the source language to replace unknown words/phrases Paraphrases are alternative ways of conveying the same

information(Callison Burch, 2007)

34

Coverage Problem in SMT

Percentage of Test Item Types vs Corpus Size

(Callison Burch, 2007)

35

Behavior on Unseen Data A system trained on 10,000 sentences (~200,000

words) may translate: Es positivo llegar a un acuerdo sobre los procedimientos, pero debemos

encargarnos de que este sistema no sea susceptible de ser usado como arma pol´ıtica.

as It is good reach an agreement on procedures, but we must encargarnos that

this system is not susceptible to be usado as political weapon.

Since the translations of encargarnos and usado were not learned, they are either reproduced in the translation, or omitted entirely


36

Substituting Paraphrases then Translating

It is good reach an agreement on procedures, but we must encargarnos that this system is not susceptible to be usado as political weapon.

encargarnos ?

usado ?


37


It is good reach an agreement on procedures, but we must encargarnos that this system is not susceptible to be usado as political weapon.

encargarnos ?

garantizar

velar

procurar

Asegurarnos

usado ?

utilizado

empleado

uso

utiliza


38


It is good reach an agreement on procedures, but we must guarantee that this system is not susceptible to be used as political weapon.

encargarnos ?

garantizar

velar

procurar

Asegurarnos

guarantee, ensure, guaranteed, assure, provided

ensure, ensuring, safeguard, making sure

ensure that, try to, ensure, endeavour to

ensure, secure, make certain

usado ?

utilizado

empleado

uso

utiliza

used, use, spent, utilized

used, spent, employee

use, used, usage

used, uses, used, being used


39

Learning paraphrases (I)

From monolingual parallel corpora Multiple source sentences which are conveying the same

information Extract paraphrases seen in the same context in the aligned

source sentences

Emma burst into tears and he tried to comfort her, saying things to make her smile.

Emma cried, and he tried to console her, adorning his words with puns.


40




source sentences

burst into tears = cried comfort= console

Emma burst into tears and he tried to comfort her, saying things to make her smile.

Emma cried, and he tried to console her, adorning his words with puns.


41




source sentences

Problems with this approach Monolingual parallel corpora are relatively uncommon Limits what paraphrases we can generate, e.g. limited

number of paraphrases


42


From monolingual source corpora

For each unknown phrase x, build a distributional profile DPx which shows the co-occurrences of the surrounding words with x

Select the top-k phrases which have the most similar distributional profile with DPx

Is position important when building the profile? Should we simply count words, or use TF/IDF, or …? Which vector similarity measure should be used?

Needs smart tricks to make it scalable(Marton et al 2009)

43

Learning paraphrases (II)

From bilingual parallel corpora However no longer we have access to identical contexts Adopt techniques from phrase-based SMT Use aligned foreign language phrases as pivot


44

Paraphrase Probability

Generate multiple paraphrases for a given phrase

We give them probabilities so they can be ranked

Define translation model probability:

45

Refined Paraphrase Probability

Using multiple bilingual corpora, e.g. English-Spanish, English-German, …

C is the set of bilingual corpora and c is the weight of the corpus c, e.g. we may put more weight on larger corpora

Taking word sense into account In a paraphrase, replace each word with its word_sense item

46

Plugging Paraphrases into SMT Model

For each paraphrase s2 having a translation t, we expand the phrase table by adding new entries (t,s1)

s1 s2 t

Add a new feature function into the SMT log-linear model to take into account the paraphrase probabilities

p(s2 | s1) If phrase table entry (t,s1) is generated from (t,s2)

1 Otherwise

f(t,s1) =

47

Results of Paraphrasing


48

Improvement in Coverage


49

Triangulation

We can find additional data by focusing on: Multi-parallel corpora Collection of bitexts with some common language(s)

(Cohn & Lapata, 2007)

50

Triangulation



51

Triangulation



52

Phrase-Level Triangulation

Triangulation (Kay, 1997) Translate source phrase into an intermediate language phrase Translate this intermediate phrase into the target phrase

Example: Translating a hot potato into French


53

A Generative Model for Triangulation

Marginalize out the intermediate phrases:

The generative model for p(s|t) :


54


Conditional independence assumption: i fully represents the information in t needed to translating s

Extends trivially to many intermediate languages

p(s|i) and p(i|t) are estimated using phrase frequencies



55



Conditional independence may be violated

Translation model is estimated from noisy alignments

Missing contexts, i, in p(s|i)

Fewer large or rare phrases can be translated(Cohn & Lapata, 2007)

56

Plugging Triangulated Phrases into Model

A mixture model of phrase pair probabilities from training set (standard) and the newly learned phrase pairs by triangulation:

As a new feature in the log-linear model

standard triang

+ (1-)

57

Coverage Benefit

58

For any Language Pair?

10k bilingual sentences, interpolated with 3 intermediate langs: /


59

Larger Corpora

For French to English with Spanish as the intermediate language using different sizes for bitext(s)

triang: only triangulated

phrases

interp: mixture model

of the two phrase tables


60

What Languages are best for triangulation?

10K bilingual sentences, translating from French to English


61

How many languages are required?

10K bilingual sentences, translating from French to English, ordered by language family


62

Paraphrasing vs Triangulation

Paraphrasing Uses bilingual projection to translate to and from a

source phrase It is employed to improve the source side coverage

Triangulation Generalizes the paraphrasing method to any

translation pathway linking the source and target Improves both source and target coverage


63

Bilingual Lexicon Induction The goal is to induce a larger bilingual dictionary. It can

be used, for example, to augment the phrase table/parallel text

Suppose we have access to a small bilingual dictionary plus large monolingual text

Build distributional profile using use monolingual source text

Map the profile using seed rules (initial bilingual dictionary) to the target language vocabulary space

Select the top-k target language words with most similar distributional profiles

(Rapp, 1999)

64

Context-based Rapp Model

(Garera et al 2009)

65

Dependency Context Usually words in a fixed-size window are used to represent the

context

(Garera et al 2009) uses the latent structure in the dependency parse tree to represent the context

(Garera et al 2009)

66

Dependency Context Usually words in a fixed-size window are used to represent the

context

(Garera et al 2009) uses the latent structure in the dependency parse tree to represent the context

Dynamic context size

Accounts for reordering

(Garera et al 2009)

67

Bilingual Lexicon Induction (more references)

(Koehn & Knight 2002) takes into account the orthographic features in addition to the context

(Haghighi et al 2008) devise a generative model which generates the (feature vector of) related words in the source and target languages

Each word is represented by a feature vector containing both contextual and ortographic features

(Mann & Yarowsky 2001) and (Schafer & Yarowsky 2002) use a bridge language to induce bilingual lexicon

68

Bilingual Phrase Induction (non-comparable corpora)

Non-comparable corpora contain “... disparate, very nonparallel bilingual documents that could either be on the same topic (on-topic) or not” (Fung & Cheung 2004) The goal is to extract parallel sub-sentential fragments, as

opposed to extracting parallel sentences

Assume we have a lexical dictionary P(t | s): the probability the source word s translates into target word t

Using some heuristics, specify the candidate sentence pairs

(Munteanu & Marcu 2006)

69

The Signal Processing Approach

target

source

70


target

source

71


target

source

72


P(t|s)

target

source

73


target

source

74


target

source

75


target

source

Average of “signals”from neighbors

76


target

source

Average of “signals”from neighbors

77

Bilingual Phrase Induction (non-comparable corpora)

Retain “positive fragments”, i.e. those fragments for which the corresponding filtered signal values are positive

Repeat the procedure in the other direction (target to source) to obtain the fragments for source, and consider the resulting two text chunks as parallel

The signal filtering function is simple, more advanced filters might work better


78

The Effect of Parallel Fragments for SMT


Explained in the beginning of the talk

The method just explained

79

Outline

Introduction



80

Self-Training for SMT

Train

MFE

Bilingual text

FF EE

Monolingual text

DecodeTranslated text

FF EE

FF EE

Selecthigh quality Sent. pairs


Re-Log-linear Model

Re-training the SMT model


81


Train

MFE

Bilingual text

FF EE

Monolingual text


FF EE

FF EE



Re-Log-linear Model



(Ueffing et al 2007a)

82

Scoring & Selecting Sentence Pairs

Scoring: Use normalized decoder’s score Confidence estimation method (Ueffing & Ney 2007)

Selecting: Importance sampling: Those whose score is above a threshold Keep all sentence pairs

83

Confidence Estimation

A log linear combination of Word posterior probabilities: The chance of seeing

a word in a particular position in translations Phrase posterior probabilities Language model score

The weights are tuned to minimize the classification error rate Translations having a WER above a threshold are

considered incorrect

84


Train

MFE

Bilingual text

FF EE

Monolingual text


FF EE

FF EE



Re-Log-linear Model



(Ueffing et al 2007a)

85

Re-Training the SMT Model (I)

Simply add the newly selected sentence pairs to the initial bitext, and fully re-train the phrase table

A mixture model of phrase pair probabilities from training set combined with phrase pairs from the newly selected sentence pairs

Initial Phrase Table New Phrase Table

+ (1-)(Ueffing et al 2007a)

86

Re-training the SMT Model (II)

Use new sentence pairs to train an additional phrase table and use it as a new feature function in the SMT log-linear model One phrase table trained on sentences for which we have

the true translations One phrase table trained on sentences with their generated

translations

Phrase Table 1 Phrase Table 2

87

Results (Chinese to English, Transductive)

Selection Scoring BLEU% WER% PER%

Baseline 27.9 .7 67.2 .6 44.0 .5

Keep all 28.1 66.5 44.2

Importance Sampling

Norm. score 28.7 66.1 43.6

Confidence 28.4 65.8 43.2

Threshold Norm. score 28.3 66.1 43.5

confidence 29.3 65.6 43.2

• WER: Lower is better (Word error rate)• PER: Lower is better (Position independent WER )• BLEU: Higher is better

Bold: best result, italic: significantly better

Using additional phrase table

88

Results (Chinese to English, Inductive)

system BLEU% WER% PER%

Eval-04 (4 refs.)

Baseline 31.8 .7 66.8 .7 41.5 .5

Add Chinese data Iter 1 32.8 65.7 40.9

Iter 4 32.6 65.8 40.9

Iter 10 32.5 66.1 41.2



Using importance sampling and additional phrase table

89

Why does it work (I)

Reinforces parts of the phrase translation model which are relevant for test corpus, hence obtain more focused probability distribution

source | target prob

A B | a b e

A B | c d

…

.5

.5

…

Decode monotext

---- A B ----- ---- c d -----

“c d” is chosen since LM picks it according to signals from context

source | target prob

A B | a b e

A B | c d

…

.2

.8

…

Use this to resolve ambiguity of translating “A B” in other parts of the text

Retraining

(Ueffing et al 2008)

90

Why does it work (II)

Composes new phrases, for example:

Original parallel corpus Additional source data Possible new phrases

‘A B’, ‘C D E’ ‘A B C D E’ ‘A B C’, ‘B C D E’, …

Source: ----- A B C D E -----

Translation: ----- a b c d e ----- ----- A B C D E -----

----- a b c d e -----

----- A B C D E -----

----- a b c d e -----

(Ueffing et al 2008)

91

Analysis

New phrases are used rarely, hence most of the benefit comes from focused probability distributions

92

Co-training for SMT

Source sentence is a view onto the translation

Existing translations of a source sentence can be used as additional views on the translation


93

Co-Training for SMT


94

Co-Training for SMT


Having initial bitexts, train SMT models from source languages to the target language

95

Co-Training for SMT


Translate a multilingual parallel sentence in the source languages using the trained SMT models

96

Co-Training for SMT


Choose the best generated translation

97

Co-Training for SMT


Add the new sentence pairs to the bitexts and re-train the SMT models

98

Results of Co-Training

20k initial labeled sentences, 60k unlabeled parallel sentences in 5 languages, select 10k pseudo-labeled sentences in each iteration


99

Coaching

Suppose we have no German-English bitext There is a French-English bitext There is a French-German bitext

Train a French to English translation model

Translate the French to English and align the generated translations with German

100

Results of Coaching

Coaching of German to English by a French to English translation model


101

Results of Coaching

Coaching of German to English by multiple translation models


102

Outline

Introduction



103

Shortage of Bilingual Data: A Solution

Suppose we are given a large monolingual text in the source language F

Pay a human expert and ask him/her to translate these sentences into the target language E This way, we will have a bigger bilingual text

But our budget is limited ! We cannot afford to translate all monolingual sentences

104

A Better Solution

Choose a subset of monolingual sentences for which:

if we had the translation,

the SMT performance would increase the most

Only ask the human expert for the translation of these highly informative sentences

This is the goal of Active Learning

105

Active Learning for SMT

Train

MFE

Bilingual text

FF EE

Monolingual text


FF EE

Translate by human

FF EE FF

SelectInformative Sentences


Re-Log-linear Model

Re-training the SMT models


(Haffari et al 2009)

106


Train

MFE

Bilingual text

FF EE

Monolingual text


FF EE

Translate by human

FF EE FF



Re-Log-linear Model



107

Sentence Selection Strategies

Baselines: Randomly choose sentences from the pool of monolingual

sentences Choose longer sentences from the monolingual corpus

Other methods Decoder’s confidence for the translations (Kato & Barnard,

2007)

Reverse model Utility of the translation units


108

Decoder’s Confidence

Sentences for which the model is not confident about their translations are selected first

Hopefully high confident translations are good ones

Normalize the confidence score by the sentence length


109

Reverse Model

Comparing the original sentence, and the final sentence

Tells us something about the value of the sentence

I will let you know about the issue later

Je vais vous faire plus tard sur la question

I will later on the question

MEF

Rev: MFE


110

Sentence Selection Strategies

Baselines: Randomly choose sentences from the pool of monolingual

sentences Choose longer sentences from the monolingual corpus

Other methods Decoder’s confidence for the translations (Kato & Barnard,

2007)

Reverse model Utility of the translation units


111

Utility of the Translation Units

Phrases are the basic units of translations in phrase-based SMT

I will let you know about the issue later

Monolingual Text6

6

18

3

Bilingual Text5

6

12

3

7

The more frequent a phrase is in the monolingual text, the more important it is

The more frequent a phrase is in the bilingual text, the less important it is

m b

112

Generative Models for Phrases

Monolingual Text Bilingual Text

66183

Count

.25

.25

.05

.33

.12

Probability

561237

Count Probability

.21

.22

.05

.09

.14

.29

m b

113

Sentence Selection: Probability Ratio Score

For a monolingual sentence S

Consider the bag of its phrases:

Score of S depends on its probability ratio:

= { , , }

m ( )

b ( )

m ( )

b ( )

m ( )

b ( )


114

Sentence Selection: Probability Ratio Score

For a monolingual sentence S

Consider the bag of its phrases:

Score of S depends on its probability ratio:

Phrase probability ratio captures our intuition about the utility of the translation units

= { , , }

Phrase Prob. Ratio

115

Extensions of the Score

Instead of using phrases, we may use n-grams

We may alternatively use the following score


116

Sentence Segmentation

How to prepare the bag of phrases for a sentence S?

For the bilingual text, we have the segmentation from the training phase of the SMT model

For the monolingual text, we run the SMT model to produce the top-n translations and segmentations

What about OOV fragments in the sentences of the monolingual text?

(Haffari & Sarkar 2009)

117

OOV Fragments: An Example

i will go to school on fridayOOV Fragment

go to school on friday



OOV Phrases

Which can be long

(Haffari & Sarkar 2009b)

118

Counting OOV Phrases

Fix an OOV fragment x

Put a uniform distribution over all possible segmentations of x

Use the expected count of OOV Phrases under this uniform distribution

See (Haffari & Sarkar 2009b) for how to compute these expectations efficiently

x:

…

(Haffari & Sarkar 2009)

119


Train

MFE

Bilingual text

FF EE

Monolingual text


FF EE

Translate by human

FF EE FF



Re-Log-linear Model



120

Re-training the SMT Models

We use two phrase tables in each SMT model MFiE

One trained on sents for which we have the true translations

One trained on sents with their generated translations (Self-training)

Fi Ei


121

Experimental Setup

Dataset size:

We select 200 sentences from the monolingual sentence set for 25 iterations

We use Portage from NRC as the underlying SMT system (Ueffing et al, 2007)

Bitext Monotext test

French-English 5K 20K 2K

122

The Simulated AL Setting

Utility of phrases

Random

Decoder’s Confidence

Bet

ter

123

The Simulated AL SettingB

ette

r

124

Outline

Introduction



125

Multiple Language-Pair AL-SMT

E(English)

Add a new lang. to a multilingual parallel corpus To build high quality SMT systems from existing

languages to the new lang.

F1

(German) F2

(French) F3

(Spanish)

… AL

Translation Quality

126

AL-SMT: Multilingual Setting

Train

MFEF1,F2, …F1,F2, … EE

Monolingual text

DecodeE1,E2,..E1,E2,..

Translate by human



Re-Log-linear Model



F1,F2, …F1,F2, …

F1,F2, …F1,F2, …F1,F2, …F1,F2, … EE

127

Selecting Multilingual Sents. (I)

• Alternate Method: To choose informative sents. based on a specific Fi in each AL iteration

F1 F2 F3

… … …

2

35

1

3

19

2

2

17

3

Rank

(Reichart et al, 2008)

128

Selecting Multilingual Sents. (II)

• Combined Method: To sort sents. based on their ranks in all lists

F1 F2 F3

… … …

2

35

1

3

19

2

2

17

3

Combined Rank

…

7=2+3+2

71=35+19+17

6=1+2+3

(Reichart et al, 2008)

129

Selecting Multilingual Sents. (III)

• Disagreement Method – Pairwise BLEU score of the generated translations– Sum of BLEU scores from a consensus translation

F1 F2 F3

… … …

E1

…

E2

…

E3

…

Consensus Translation

130

AL-SMT: Multilingual Setting

Train

MFEF1,F2, …F1,F2, … EE

Monolingual text

DecodeE1,E2,..E1,E2,..

Translate by human



Re-Log-linear Model



F1,F2, …F1,F2, …

F1,F2, …F1,F2, …F1,F2, …F1,F2, … EE

131

Re-training the SMT Models (I)

We use two phrase tables in each SMT model MFiE

One trained on sents for which we have the true translations

One trained on sents with their generated translations (Self-training)

Fi Ei


132

Re-training the SMT Models (II)

Phrase Table 2: We can instead use the consensus translations (Co-Training)

Fi

Phrase Table 1

E1 E2 E3 Econsensus

Phrase Table 2

133

Experimental Setup

We want to add English to a multilingual parallel corpus containing Germanic languages in EuroParl: Germanic Langs: German, Dutch, Danish, Swedish

Sizes of dataset and selected sentences Initially there are 5k multilingual sents parallel to English

sents 20k parallel sents in multilingual corpora. 10 AL iterations, and select 500 sentences in each iteration

We use Portage from NRC as the underlying SMT system (Ueffing et al, 2007b)

134

Self-training vs Co-training

Germanic Langs to English

Co-Training mode outperforms Self-Training mode

19.75

20.20

135

Germanic Languages to English

method Self-TrainingWER / PER / BLEU

Co-TrainingWER / PER / BLEU

Combined Rank

Alternate

Random


41.0

40.2

41.6

40.1

40.0

40.5

30.2

30.0

31.0

30.1

29.6

30.7

19.9

20.0

19.4

20.2

20.3

20.2


136

Conclusion

source-targetsmall bitext

MT system

large comparable source-target

bitext

parallel sentenceextraction

bilingual dictionary induction

large source monotext

semi-supervised/active learning

source-anotherlanguage bitext

paraphrasing

source-anotheranother-targetsource-target

bitexts

triangulation/co-training

137

Finish

138

References (Blum & Mitchell 1998) A. Blum and T. Mitchell, “Combining Labeled and

Unlabeled Data with Co-Training”, COLT.

(Callison Burch 2007) C. Callison Burch, “Paraphrasing and Translation”, PhD thesis, University of Edinburgh.

(Callison Burch 2003) C. Callison Burch, “Co-Training for Statistical Machine Translation”, Master’s thesis, University of Edinburgh.

(Cohn & Lapata 2007) T. Cohn and M. Lapata, “Machine Translation by Triangulation: Making Effective Use of Multi-Parallel Corpora”, ACL.

(Dempster et al 1977) A. P. Dempster, N. M. Laird, D. B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm”, Journal of the Royal Statistical Society. Series B.

(Fraser & Marcu 2006a) A. Fraser and D. Marcu, “Semi-Supervised Training for Statistical Word Alignment”, ACL.

139

References (Fraser & Marcu 2006b) A. Fraser and D. Marcu, “Measuring Word

Alignment Quality for Statistical Machine Translation”, Technical Report ISI-TR-616, ISI/University of Southern California.

(Fung & Cheung 2004) P. Fung and P. Cheung, “ Mining very non-parallel corpora: Parallel sentence and lexicon extraction vie bootstrapping and EM”, EMNLP.

(Garera et al 2009) N. Garera, C. Callison-Burch and D. Yarowsky, “Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences”, CoNLL.

(Haffari et al 2009) G. Haffari, M. Roy, A. Sarkar, “Active Learning for Statistical Phrase-based Machine Translation ”, NAACL.

(Haffari & Sarkar 2009) G. Haffari and A. Sarkar, “Active Learning for Multilingual Statistical Machine Translation ”, ACL-IJCNLP.

(Haghighi et al 2008) A. Haghighi, P. Liang, T. Berg-Kirkpatrick, and D. Klein, ”Learning bilingual lexicons from monolingual Corpora”, ACL.

140

References (Kuzman et al 2008) K. Ganchev, J. Graca and B. Taskar , “Better

Alignments = Better Translations?”, ACL.

(Koehn & Knight 2002) P. Koehn and K. Knight, ”Learning a translation lexicon from monolingual corpora”, ACL Workshop on Unsupervised Lexical Acquisition.

(Mann & Yarowsky 2001) G.Mann and D. Yarowsky, “Multi-path translation lexicon induction via bridge languages”, NAACL.

(Munteanu Marcu 2006) D. Munteanu and D. Marcu, “Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora”, COLING-ACL.

(Marton et al 2009) Y. Marton, C. Callison-Burch and P. Resnik, “Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases ”, EMNLP.

(Munteanu & Marcu, 2005) D. Munteanu and D. Marcu, “Improving Machine Translation Performance by Exploiting Non-parallel Corpora”, Computational Linguistics, 31(4).

141

References (Rapp 1999) R. Rapp, “Automatic identification of word translations from

unrelated english and german corpora”, ACL.

(Reichart et al 2008) R. Reichart, K. Tomanek, U. Hahn and A. Rappoport, “Multi-Task Active Learning for Linguistic Annotations”, ACL.

(Schafer & Yarowsky 2001) C. Schafer and D. Yarowsky, “Inducing translation lexicons via diverse similarity measures and bridge languages”, COLING.

(Ueffing & Ney 2007) N. Ueffing and H. Ney, “ Word-Level Confidence Estimation for Machine Translation”, Computational Linguistics.

(Ueffing et al 2007a) N. Ueffing, G.R. Haffari, A. Sarkar, “Transductive Learning for Statistical Machine Translation ”, ACL.

(Ueffing et al 2007b) N. Ueffing, M. Simard, S. Larkin, and J. H. Johnson, “NRC’s Portage system for WMT 2007”, ACL Workshop on SMT.

142

References (Ueffing et al 2008) N. Ueffing, G.R. Haffari, A. Sarkar, “Semi-supervised

model adaptation for statistical machine translation ”, Machine Treanslation Journal.

(Yarowsky 1995) D. Yarowsky, “Unsupervised Word Sense Disambiguation Rivaling Supervised Methods”, ACL.

1 Gholamreza Haffari Simon Fraser University MT Summit, August 2009 Machine Learning approaches for...

Documents

Transcript of 1 Gholamreza Haffari Simon Fraser University MT Summit, August 2009 Machine Learning approaches for...