(Statistical) Approaches to Word Alignment 11-734 Advanced Machine Translation Seminar Sanjika...
-
Upload
lucinda-whitehead -
Category
Documents
-
view
219 -
download
0
Transcript of (Statistical) Approaches to Word Alignment 11-734 Advanced Machine Translation Seminar Sanjika...
(Statistical) Approaches to Word Alignment
11-734 Advanced Machine Translation Seminar
Sanjika HewavitharanaLanguage Technologies InstituteCarnegie Mellon University
02/02/2006
Word Alignment Models
We want to learn how to translate words and phrases Can learn it from parallel corpora
Typically work with sentence aligned corpora Available from LDC, etc For specific applications new data collection required
Model the associations between the different languages Word to word mapping -> lexicon Differences in word order -> distortion model ‘Wordiness’, i.e. how many words to express a concept ->
fertility Statistical translation is based on word alignment models
Alignment Example
Observations: Often 1-1 Often monotone Some 1-to-many Some 1-to-nothing
Word Alignment Models
IBM1 – lexical probabilities only IBM2 – lexicon plus absolut position IBM3 – plus fertilities IBM4 – inverted relative position alignment IBM5 – non-deficient version of model 4
HMM – lexicon plus relative position
BiBr – Bilingual Bracketing, lexical probabilites plus reordering via parallel segmentation
Syntactical alignment models
[Brown et.al. 1993, Vogel et.al. 1996, Och et al 1999, Wu 1997, Yamada et al. 2003]
Notation
Source language f : source (French) word J : length of source sentence j : position in source sentence; j = 1,2,...,J : source sentence
Target language e : target (English) word I : length of target sentence i : position in target sentence; i = 1,2,...,I : target sentenceIi
I eeee ......11
JjJ ffff ......11
SMT - Principle
Translate a ‘French’ stringinto an ‘English’ string
Bayes’ decision rule for translation:
Based on Noisy channel model We will call f source and e target
JjJ ffff ......11
IiI eeee ......11
)}|Pr(){Pr(maxarg
)}|{Pr(maxargˆ
111
1
11
1
1
IJI
i
JI
i
I
efee
fee
e
Alignment as Hidden Variable
‘Hidden alignments’ to capture word-to-word correspondences
Number of connections: J * I (each source word with each target word)
Number of alignments: 2JI
Restricted alignment Each source word has one connection – a function i = aj: position i of ei which is connected to j Number of alignments is now: IJ : whole alignment
Relationship between Translation Model and Alignment Model
)|,Pr()|Pr( 0111IJIJ efef
},...,1;,...,1|),{( IiJjijA
JjJ aaaa ......11
Empty Position (Null Word)
Sometimes a word has no correspondence Alignment function aligns each source word to one target word,
i.e. cannot skip source word Solution:
Introduce empty position 0 with null word e0
‘Skip’ source word fj by aligning it to e0
Target sentence is extended to: Alignment is extended to:
IiI eeee ......00
JjJ aaaa ......00
Translation Model
Sum over all possible alignments
3 probability distributions: Length:
Alignment:
Lexicon:
),,|Pr(),|Pr()|Pr(
),|,Pr()|Pr(
)|,Pr(
011010
0110
011
IJJIJI
IJJI
IJJ
eJafeJaeJ
eJafeJ
eaf
Ja
IJJ eafef0
)|,Pr()|Pr( 011
)|Pr( 0IeJ
J
j
Ijj
IJ eJaaeJa1
01
101 ),,|Pr(),|Pr(
J
j
IJjj
IJJ eJaffeJaf1
011
1011 ),,,|Pr(),,|Pr(
Model Assumptions
Decompose interaction into pairwise dependencies Length: Source length only dependent on target length (very
weak)
Alignment: Zero order model: target position only dependent on source
position
First order model: target position only dependent on previous target position
Lexicon: source word only dependent on aligned word
)|()|Pr( 0 IJpeJ I
),,|(),,|Pr( 01
1 IJjapeJaa jIj
j
)|(),,,|Pr( 011
1 jajIJj
j efpeJaff
),,|(),,|Pr( 101
1 IJaapeJaa jjIj
j
IBM Model 1
Length: Source length only dependent on target length
Alignment: Assume uniform probability for position alignment
Lexicon: source word only dependent on aligned word
Alignment probability
J
j
I
iij
IJ efpIJjipIJpef1 1
11 )|(),,|()|()|Pr(
J
j
I
iijJ
efpI
IJp1 1
)|()1(
1)|(
)|()|Pr( 0 IJpeJ I
)1(
1),,|(
IJIjip
)|(),,,|Pr( 011
1 jajIJj
j efpeJaff
IBM Model 1 – Generative Process
To generate a French string from an English string :
Step 1: Pick the length of All lengths are equally probable; is a constant
Step 2: Pick an alignment with probability
Step 3: Pick the French words with probability
Final Result:
)|( IJp
J
j
I
iijJ
IJ efpI
IJpef
1 111 )|(
)1(
)|()|Pr(
JI )1(
1
J
jij
IJJ efpeaf1
111 )|(),|Pr(
Jf1Ie1
Jf1
Ja1
IBM Model 1 – Training
Parameters of the model:
Training data: parallel sentence pairs
We adjust parameters s.t. it maximize Normalized for each :
EM Algorithm used for the estimation Initialize the parameters uniformly Collect counts for each pair in the corpus Re-estimate parameters using counts Repeated for several iterations
Model simple enough to compute over all alignments Parameters does not depend on initial values
)|()|( eftefp
e
),(11
11
)|Pr(logIJ ef
IJ ef
f
eft 1)|(
),( 11IJ ef
),( ef
IBM Model 1 Training– Pseudo Code
# Accumulation (over corpus)For each sentence pair For each source position j Sum = 0.0 For each target position i Sum += p(fj|ei) For each target position i Count(fj,ei) += p(fj|ei)/Sum
# Re-estimate probabilities (over count table)For each target word e Sum = 0.0 For each source word f Sum += Count(f,e) For each source word f p(f|e) = Count(f,e)/Sum
# Repeat for several iterations
IBM Model 2
Only Difference from Model 1 is in Alignment Probability
Length: Source length only dependent on target length
Alignment: Target position depends on the source position (in addition to the source length and target length)
Model 1 is a special case of Model 2, where
Lexicon: source word only dependent on aligned word
)|()|Pr( 0 IJpeJ I
),,|(),,|Pr( 01
1 IJjapeJaa jIj
j
)|(),,,|Pr( 011
1 jajIJj
j efpeJaff
1
1),,|(
IIJjap j
IBM Model 2 – Generative Process
To generate a French string from an English string :
Step 1: Pick the length of All lengths are equally probable; is a constant
Step 2: Pick an alignment with probability
Step 3: Pick the French words with probability
Final Result:
)|( IJp
J
j
I
iijj
IJ efpIJjapIJpef1 1
11 )|(),,|()|()|Pr(
J
jj IJjap
1
),,|(
J
jij
IJJ efpeaf1
111 )|(),|Pr(
Jf1Ie1
Jf1
Ja1
IBM Model 2 – Training
Parameters of the model:
Training data: parallel sentence pairs
We maximize w.r.t translation and alignment params.
EM Algorithm used for the estimation Initialize alignment parameters uniformly, and
translation probabilities from Model 1 Accumulate counts, re-estimate parameters
Model simple enough to compute over all alignments
),(
11
11
)|Pr(logIJ ef
IJ ef
),( 11IJ ef
),,|(),,|(
)|()|(
IJjaaIJjap
eftefp
jj
Fertility-based Alignment Models
Models 3-5 are based on Fertility Fertility: Number of source words connected with a target word
: fertility values of = probability that is connected with source words
Alignment: Defined in the reverse-direction (target to source)
= probability of French position j given English position is i
),( iaj
ji ie
IjI ......11 Ie1
)|( ep e
),,|( IJijp
IBM Model 3 – Generative Process
To generate a French string from an English string :
Step 1: Choose (I+1) fertilities with probability
I
iii
III eppe1
1000 )|()|()|Pr(
Jf1Ie1
I1
I
iii
I
ii epp
1010 )|(
!
1.|
I
iii
J epppJ
p10
12
10
0 )|(!
1.)1( 00
)|Pr( 11II e
IBM Model 3 – Generative Process
Step 2: For each , for k =1… , choose a position 1…Jand a French word with probability
For a given alignment, there are orderings
I
i kikiki
i
efpJIip1 1
,, )|(),,|(
ie ki ,
I
i kikiki
I
ii
IIIJJi
efpJIipepeaf1 1
,,1
000011 )|(),,|(!!)|()|,Pr(
*)|()1(1
12
10
0 00
i
I
iii
J epppJ
p
i
kif ,
I
ii
0
!
I
i kikiki
i
efpJIip1 1
,, )|(),,|(
IBM Model 3 – Example
e0 Mary did not slap the green witch
1 0 1 3 1 1 1
Mary not slap slap slap the green witch
Mary not slap slap slap NULL the green witch
Mary no daba una botefada a la verde bruja
Mary no daba una botefada a la bruja verde
1 2 3 4 5 6 7 8 9
1 3 4 4 4 0 5 7 6
[Knight 99]
[e]
[choose fertility]
[choose translation]
[fertility for e0]
[choose targetpositions j ]
1
[aj ]
IBM Model 3 – Training
Parameters of the model:
EM Algorithm used for the estimation Not possible to compute exact EM updates Initialize n,d,p uniformly, and translation probabilities from Model 2
Accumulate counts, re-estimate parameters
Cannot efficiently compute over all alignments Only Viterbi alignment is used
Model 3 is deficient Probability mass is wasted on impossible translations
1
)|()|(
),,|(),,|(
)|()|(
p
enep
IJijdIJijp
eftefp
IBM Model 4
Try to model re-ordering of phrases
is replaced with two sets of parameters:
One for placing the first word (head) of a group of words One for placing rest of the words relative to the head
Deficient Alignment can generate source positions outside of sentence length J
Model 5 removes this deficiency
),,|( IJijp
HMM Alignment Model
Idea: relative position model
Source
Target
[Vogel 96]
HMM Alignment
First order model: target position dependent on previous target position(captures movement of entire phrases)
Alignment probability:
Alignment depends on relative position
Maximum approximation:
),,|(),,|Pr( 101
1 IJaapeJaa jjIj
j
J
j
a
J
jajjj
IJ efpIaapIJpef1 1
111 )|(),|()|()|Pr(
J
jajjj
a
IJ
jJefpIaapIJpef
1111 )|(),|(max)|()|Pr(
1
I
iiic
iicIiip
1'')'''(
)'(),'|(
IBM2 vs HMM
[Vogel 96]
Enhancements to HMM & IBM Models
HMM model with empty word Adding I empty words to the target side
Model 6 IBM 4: predicts distance between subsequent target positions HMM: predicts distance between subsequent source positions Model 6: A log-linear combination of IBM 4 and HMM Models
Smoothing Alignment prob. – Interpolate with uniform dist. Fertility prob. – Depends of number of letters in a word
Symmetrization Heuristic postprocessing to combine alignments in both
directions
',' 4
46 )'|','()'|','(
)|,()|,()|,(
faeafpHMMeafp
eafpHMMeafpeafp
Experimental Results [Franz 03]
Refined models perform better Models 4,5,6 better than Model 1 or Dice coefficient model HMM better than IBM 2
Alignment quality based on the training method and bootstrap scheme used IBM 1->HMM->IBM 3 better than IBM 1->IBM 2->IBM 3
Smoothing and Symmetrization have a significant effect on alignment quality
More alignments in training yields better results Using word classes
Improvement for large corpora but not for small corpora
References:
Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, Robert L. Mercer (1993). The Mathematics of Statistical Machine Translation , Computational Linguistics, vol. 19, no. 2.
Stephan Vogel, Hermann Ney, Christoph Tillmann (1996). HMM-based Word Alignment in Statistical Translation , COLING, The 16th Int. Conf. on Computational Linguistics, Copenhagen, Denmark, August, pp. 836-841.
Franz Josef Och, Hermann Ney (2003), A Systematic Comparison of Various Statistical Alignment Models , Computational Linguistics, vol. 29, no.1, pp. 19-51.
Knight, Kevin, (1999), A Statistical MT Tutorial Workbook, Available at http://www.isi.edu/natural-language/mt/wkbk.rtf.