(Statistical) Approaches to Word Alignment 11-734 Advanced Machine Translation Seminar Sanjika...

(Statistical) Approaches to Word Alignment

11-734 Advanced Machine Translation Seminar

Sanjika HewavitharanaLanguage Technologies InstituteCarnegie Mellon University

02/02/2006

Word Alignment Models

We want to learn how to translate words and phrases Can learn it from parallel corpora

Typically work with sentence aligned corpora Available from LDC, etc For specific applications new data collection required

Model the associations between the different languages Word to word mapping -> lexicon Differences in word order -> distortion model ‘Wordiness’, i.e. how many words to express a concept ->

fertility Statistical translation is based on word alignment models

Alignment Example

Observations: Often 1-1 Often monotone Some 1-to-many Some 1-to-nothing

Word Alignment Models

IBM1 – lexical probabilities only IBM2 – lexicon plus absolut position IBM3 – plus fertilities IBM4 – inverted relative position alignment IBM5 – non-deficient version of model 4

HMM – lexicon plus relative position

BiBr – Bilingual Bracketing, lexical probabilites plus reordering via parallel segmentation

Syntactical alignment models

[Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999, Wu 1997, Yamada et al. 2003]

Notation

Source language f : source (French) word J : length of source sentence j : position in source sentence; j = 1,2,...,J : source sentence

Target language e : target (English) word I : length of target sentence i : position in target sentence; i = 1,2,...,I : target sentenceIi

I eeee ......11

JjJ ffff ......11

SMT - Principle

Translate a ‘French’ stringinto an ‘English’ string

Bayes’ decision rule for translation:

Based on Noisy channel model We will call f source and e target

JjJ ffff ......11

IiI eeee ......11

)}|Pr(){Pr(maxarg

)}|{Pr(maxargˆ

111

1

11

1

1

IJI

i

JI

i

I

efee

fee

e

Alignment as Hidden Variable

‘Hidden alignments’ to capture word-to-word correspondences

Number of connections: J * I (each source word with each target word)

Number of alignments: 2JI

Restricted alignment Each source word has one connection – a function i = aj: position i of ei which is connected to j Number of alignments is now: IJ : whole alignment

Relationship between Translation Model and Alignment Model

)|,Pr()|Pr( 0111IJIJ efef

},...,1;,...,1|),{( IiJjijA

JjJ aaaa ......11

Empty Position (Null Word)

Sometimes a word has no correspondence Alignment function aligns each source word to one target word,

i.e. cannot skip source word Solution:

Introduce empty position 0 with null word e0

‘Skip’ source word fj by aligning it to e0

Target sentence is extended to: Alignment is extended to:

IiI eeee ......00

JjJ aaaa ......00

Translation Model

Sum over all possible alignments

3 probability distributions: Length:

Alignment:

Lexicon:

),,|Pr(),|Pr()|Pr(

),|,Pr()|Pr(

)|,Pr(

011010

0110

011

IJJIJI

IJJI

IJJ

eJafeJaeJ

eJafeJ

eaf

Ja

IJJ eafef0

)|,Pr()|Pr( 011

)|Pr( 0IeJ

J

j

Ijj

IJ eJaaeJa1

01

101 ),,|Pr(),|Pr(

J

j

IJjj

IJJ eJaffeJaf1

011

1011 ),,,|Pr(),,|Pr(

Model Assumptions

Decompose interaction into pairwise dependencies Length: Source length only dependent on target length (very

weak)

Alignment: Zero order model: target position only dependent on source

position

First order model: target position only dependent on previous target position

Lexicon: source word only dependent on aligned word

)|()|Pr( 0 IJpeJ I

),,|(),,|Pr( 01

1 IJjapeJaa jIj

j

)|(),,,|Pr( 011

1 jajIJj

j efpeJaff

),,|(),,|Pr( 101

1 IJaapeJaa jjIj

j

IBM Model 1

Length: Source length only dependent on target length

Alignment: Assume uniform probability for position alignment


Alignment probability

J

j

I

iij

IJ efpIJjipIJpef1 1

11 )|(),,|()|()|Pr(

J

j

I

iijJ

efpI

IJp1 1

)|()1(

1)|(

)|()|Pr( 0 IJpeJ I

)1(

1),,|(

IJIjip

)|(),,,|Pr( 011

1 jajIJj

j efpeJaff

IBM Model 1 – Generative Process

To generate a French string from an English string :

Step 1: Pick the length of All lengths are equally probable; is a constant

Step 2: Pick an alignment with probability

Step 3: Pick the French words with probability

Final Result:

)|( IJp

J

j

I

iijJ

IJ efpI

IJpef

1 111 )|(

)1(

)|()|Pr(

JI )1(

1

J

jij

IJJ efpeaf1

111 )|(),|Pr(

Jf1Ie1

Jf1

Ja1

IBM Model 1 – Training

Parameters of the model:

Training data: parallel sentence pairs

We adjust parameters s.t. it maximize Normalized for each :

EM Algorithm used for the estimation Initialize the parameters uniformly Collect counts for each pair in the corpus Re-estimate parameters using counts Repeated for several iterations

Model simple enough to compute over all alignments Parameters does not depend on initial values

)|()|( eftefp

e

),(11

11

)|Pr(logIJ ef

IJ ef

f

eft 1)|(

),( 11IJ ef

),( ef

IBM Model 1 Training– Pseudo Code

# Accumulation (over corpus)For each sentence pair For each source position j Sum = 0.0 For each target position i Sum += p(fj|ei) For each target position i Count(fj,ei) += p(fj|ei)/Sum

# Re-estimate probabilities (over count table)For each target word e Sum = 0.0 For each source word f Sum += Count(f,e) For each source word f p(f|e) = Count(f,e)/Sum

# Repeat for several iterations

IBM Model 2

Only Difference from Model 1 is in Alignment Probability

Length: Source length only dependent on target length

Alignment: Target position depends on the source position (in addition to the source length and target length)

Model 1 is a special case of Model 2, where


)|()|Pr( 0 IJpeJ I

),,|(),,|Pr( 01

1 IJjapeJaa jIj

j

)|(),,,|Pr( 011

1 jajIJj

j efpeJaff

1

1),,|(

IIJjap j



Step 1: Pick the length of All lengths are equally probable; is a constant

Step 2: Pick an alignment with probability

Step 3: Pick the French words with probability

Final Result:

)|( IJp

J

j

I

iijj

IJ efpIJjapIJpef1 1

11 )|(),,|()|()|Pr(

J

jj IJjap

1

),,|(

J

jij

IJJ efpeaf1

111 )|(),|Pr(

Jf1Ie1

Jf1

Ja1



Training data: parallel sentence pairs

We maximize w.r.t translation and alignment params.

EM Algorithm used for the estimation Initialize alignment parameters uniformly, and

translation probabilities from Model 1 Accumulate counts, re-estimate parameters

Model simple enough to compute over all alignments

),(

11

11

)|Pr(logIJ ef

IJ ef

),( 11IJ ef

),,|(),,|(

)|()|(

IJjaaIJjap

eftefp

jj

Fertility-based Alignment Models

Models 3-5 are based on Fertility Fertility: Number of source words connected with a target word

: fertility values of = probability that is connected with source words

Alignment: Defined in the reverse-direction (target to source)

= probability of French position j given English position is i

),( iaj

ji ie

IjI ......11 Ie1

)|( ep e

),,|( IJijp



Step 1: Choose (I+1) fertilities with probability

I

iii

III eppe1

1000 )|()|()|Pr(

Jf1Ie1

I1

I

iii

I

ii epp

1010 )|(

!

1.|

I

iii

J epppJ

p10

12

10

0 )|(!

1.)1( 00

)|Pr( 11II e


Step 2: For each , for k =1… , choose a position 1…Jand a French word with probability

For a given alignment, there are orderings

I

i kikiki

i

efpJIip1 1

,, )|(),,|(

ie ki ,

I

i kikiki

I

ii

IIIJJi

efpJIipepeaf1 1

,,1

000011 )|(),,|(!!)|()|,Pr(

*)|()1(1

12

10

0 00

i

I

iii

J epppJ

p

i

kif ,

I

ii

0

!

I

i kikiki

i

efpJIip1 1

,, )|(),,|(

IBM Model 3 – Example

e0 Mary did not slap the green witch

1 0 1 3 1 1 1

Mary not slap slap slap the green witch

Mary not slap slap slap NULL the green witch

Mary no daba una botefada a la verde bruja

Mary no daba una botefada a la bruja verde

1 2 3 4 5 6 7 8 9

1 3 4 4 4 0 5 7 6

[Knight 99]

[e]

[choose fertility]

[choose translation]

[fertility for e0]

[choose targetpositions j ]

1

[aj ]



EM Algorithm used for the estimation Not possible to compute exact EM updates Initialize n,d,p uniformly, and translation probabilities from Model 2

Accumulate counts, re-estimate parameters

Cannot efficiently compute over all alignments Only Viterbi alignment is used

Model 3 is deficient Probability mass is wasted on impossible translations

1

)|()|(

),,|(),,|(

)|()|(

p

enep

IJijdIJijp

eftefp

IBM Model 4

Try to model re-ordering of phrases

is replaced with two sets of parameters:

One for placing the first word (head) of a group of words One for placing rest of the words relative to the head

Deficient Alignment can generate source positions outside of sentence length J

Model 5 removes this deficiency

),,|( IJijp

HMM Alignment Model

Idea: relative position model

Source

Target

[Vogel 96]

HMM Alignment

First order model: target position dependent on previous target position(captures movement of entire phrases)

Alignment probability:

Alignment depends on relative position

Maximum approximation:

),,|(),,|Pr( 101

1 IJaapeJaa jjIj

j

J

j

a

J

jajjj

IJ efpIaapIJpef1 1

111 )|(),|()|()|Pr(

J

jajjj

a

IJ

jJefpIaapIJpef

1111 )|(),|(max)|()|Pr(

1

I

iiic

iicIiip

1'')'''(

)'(),'|(

IBM2 vs HMM

[Vogel 96]

Enhancements to HMM & IBM Models

HMM model with empty word Adding I empty words to the target side

Model 6 IBM 4: predicts distance between subsequent target positions HMM: predicts distance between subsequent source positions Model 6: A log-linear combination of IBM 4 and HMM Models

Smoothing Alignment prob. – Interpolate with uniform dist. Fertility prob. – Depends of number of letters in a word

Symmetrization Heuristic postprocessing to combine alignments in both

directions

',' 4

46 )'|','()'|','(

)|,()|,()|,(

faeafpHMMeafp

eafpHMMeafpeafp

Experimental Results [Franz 03]

Refined models perform better Models 4,5,6 better than Model 1 or Dice coefficient model HMM better than IBM 2

Alignment quality based on the training method and bootstrap scheme used IBM 1->HMM->IBM 3 better than IBM 1->IBM 2->IBM 3

Smoothing and Symmetrization have a significant effect on alignment quality

More alignments in training yields better results Using word classes

Improvement for large corpora but not for small corpora

References:

Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer (1993). The Mathematics of Statistical Machine Translation , Computational Linguistics, vol. 19, no. 2.

Stephan Vogel, Hermann Ney, Christoph Tillmann (1996). HMM-based Word Alignment in Statistical Translation , COLING, The 16th Int. Conf. on Computational Linguistics, Copenhagen, Denmark, August, pp. 836-841.

Franz Josef Och, Hermann Ney (2003), A Systematic Comparison of Various Statistical Alignment Models , Computational Linguistics, vol. 29, no.1, pp. 19-51.

Knight, Kevin, (1999), A Statistical MT Tutorial Workbook, Available at http://www.isi.edu/natural-language/mt/wkbk.rtf.

http://acl.ldc.upenn.edu/J/J93/J93-2003.pdf

http://acl.ldc.upenn.edu/C/C96/C96-2141.pdf




(Statistical) Approaches to Word Alignment 11-734 Advanced Machine Translation Seminar Sanjika...

Documents

Transcript of (Statistical) Approaches to Word Alignment 11-734 Advanced Machine Translation Seminar Sanjika...