A Systematic Comparison of Various
Statistical Alignment ModelsOch and Ney
ACL 2003
Presented by:
Ankit Ramteke Ankur Aher
Piyush Dungarwal Mandar Joshi
PART I
Ankit RamtekePiyush Dungarwal
Introduction
Noisy channel model
Calculating Probability
● Alignment probability:Pr(a|f, e) = Pr(f, a|e) / Pr(f|e)
● Two birds, one stone• Compute Pr(f, a|e) using generative story
• Compute Pr(f|e) = Σa Pr(f, a|e)
Motivation
● Possible alignments:b c b cx y x y
b c b cx y x y
Outline of the Paper● Review various statistical alignment models and
heuristic models and Training of the alignment models.● IBM models 1-5 , HMM Model, Model 6 ( 4+HMM)● Heuristic based Models
● Some heuristic methods for improving alignment quality
● Evaluation methodology for word alignment methods
● Systematic comparison of the various statistical alignment models
Topics for Today
● Introduction● IBM Model 1● IBM Model 2● IBM Model 3
IBM Model 1
Description
Derivation
Learning Parameters
EM Algorithm consists of two steps:
• Expectation-Step: Apply model to the data
• parts of the model are hidden (here: alignments)• using the model, assign probabilities to possible values
• Maximization-Step: Estimate model from data
• take assigned values as fact• collect counts (weighted by probabilities)• estimate model from counts
• Iterate these steps until convergence
Learning Parameters
Example taken from SMT by Koehn
Example taken from SMT by Koehn
Example taken from SMT by Koehn
Example taken from SMT by Koehn
Example taken from SMT by Koehn
EM Algorithm
IBM Model 2
Motivation
● Why model 2 when we have 1?
राम पाठशाला गया
Ram went to school
राम पाठशाला गया
school Ram to went
IBM Model 2● Focuses on absolute alignment
● Assumptions
1. Pr(m|e) is independent of e & m• New parameter ϵ = Pr(m|e)
2. Uniform distribution over l+1 (null included)• Alignment probability is Pr(aj|j, m, l)
3. Pr(fj|a1j, f1j-1,m,e) depends only on fj and eaj• Translation probability, t(fj|eaj) = Pr(fj|a1-j, f1-j-1, m, e)
– Number of new parameters: m (aj for j=1 to m)– Training
IBM Model 3
Why model 3
● Possible alignments:b c b cx y x y
b c b cx y x y
Terminologies● Adds fertility model
● Fertility probability– Eg. n(2|house) = prob. of generating 2 words for the word ‘house’
● Translation probability: same as model 1– Eg. t(maison|house) = prob. of ‘maison’ being translation of ‘house’
● Distortion probability– Eg. d(5|2) = prob. that word at position 2 goes to position 5
Derivation from Noisy Channel
Example
This city is famous for its flora.This city is famous for its flora flora fertility
step
This city is famous NULL for its flora flora null insertion step
यह शहर है मशहू र के लए अपने पेड़ पौध lexical translation
यह शहर अपने पेड़ पौध के लए मशहू र है distortion step
Modifications Required
Final Probability
Generative Process
Generative Process
Training
• Training flow diagram from Kevin Knight SMT tutorial
Deficiency
• Distortion probabilities do not depend on the earlier words
• Model 3 wastes some of its probability on generalized strings
• Strings that have some positions with several words and others with none.
• When a model has this property of not concentrating all of its probability on events of interest, it is said to be deficient.
Example
राम पाठशाला गया
Ram went school
to
Alignment Model
Fertility Model E-Step Deficient
Model 1 Uniform No Exact No
Model 2 Zero order No Exact No
Model 3 Zero order Yes Approximate Yes
Comparison of Statistical Models
Heuristic Models
PART II
Ankur AherMandar Joshi
Outline
Recap : IBM Model 1,2 & 3HMM based alignment approachIBM Model 4
MT Phenomenon
Word-to-word translationFertilityRe-ordering
राम पाठशाला गया
Ram school went
Ram school went to
Ram went to school
SMT
Noisy channel modelargmaxe Pr(e|f) = argmaxe Pr(e) . Pr(f|e)
language model translation model
Pr(f|e) = Σa Pr(f, a|e)alignment model
Alignment Example: राम पाठशाला गया
Ram went to schoolPositional Vector: a{1-1,2-3,3-0,4-2}
Translation model
Model 1
Model 1
Model 2
● Why model 2 when we have 1?
राम पाठशाला गया
Ram went to school
राम पाठशाला गया
school Ram to went
Model 2
Model 2
Model 3
● Possible alignments:b c b cx y x y
b c b cx y x y
Terminologies
Terminologies
Example
Example taken from SMT by Koehn
Model 3
Deficiency
राम पाठशाला गया
Ram went school
to
Alignment Model
Fertility Model E-Step Deficient
Model 1 Uniform No Exact No
Model 2 Zero order No Exact No
Model 3 Zero order Yes Approximate Yes
Comparison of Statistical Models
Hidden Markov Alignment Model
Motivation
In the translation process, large phrases tend to move together.
Words that are adjacent in the source language tend to be next to each other in the target language.
Strong localization effect is observed in alignment.
Motivation
Hindi-English Alignment Example
times *
three *
cup *
world *
cricket *
won *
team *
Indian *भारतीय ट म केट व व कप तीन बार जीती
What is Hidden?
Capturing Locality
HMM captures the locality of English sentence.
Homogenous HMM
To make the alignment parameters independent of absolute word positions, we assume that the alignment probabilities p(i | i’, m) depend only on the jump width(i − i’).
Alignment Model
Fertility Model E-Step Deficient
Model 1 Uniform No Exact No
Model 2 Zero order No Exact No
Model 3 Zero order Yes Approximate Yes
HMM First-order No Exact No
Comparison of Statistical Models
IBM Model 4
Motivation
Large phrases tend to move together.
This intuition is captured through relative distortion.
The placement of the translation of an input word is typically based on the placement of the translation of the proceeding input word.
Example
The Indian team played well for 90 minutes.
भारतीय ट म ९० मनट अ छ खेल .
Terminology
Cept
– Each input word fj that is aligned to at least one output word forms a cept.
हंद क कताब अलमार म थी.
The Hindi book was in the cupboard.
The underlined words are cepts
Terminology
We define the operator [i] to map the ceptwith index i back to its corresponding foreign input word position.
Center of a cept
Center of a cept is defined as the ceiling of the average of the output word positions for that cept.
We use ⃝i to indicate centre of cept i.
Table 1
cept πi π1 π2 π3 π4 π5
foreign position [i]
1 3 4 5 6
foreign word f[i]
हंद कताब अलमार म थी
English words {ej }
Hindi book cupboard in the was
English positions {j}
2 3 7 5, 6 4
center of cept ⃝i
2 3 7 6 4
Relative Distortion
Three cases:
words that are generated by the NULL token,
These are uniformly distributed. the first word in a cept, and
subsequent words in a cept
The first word in a cept
For the likelihood of placements of the first word of a cept, we use the distribution
d1( j−⃝i−1)Example: See table 2
Subsequent words in a cept
The distribution
d>1( j − πi,k−1)
Example: See table 2
Table 2
j 1 2 3 4 5 6 7
ej The Hindi book was In the cupboard
in ceptπi,k
π0,0 π1,0 π2,0 π5,0 π4,0 π4,1 π3,0
⃝i−1 - 0 2 6 7 - 3
j - ⃝i−1 - +2 +1 -2 -2 - +4
distortion 1 D1(+2) D1(+1) D1(-2) D1(-2) D>1(+1) D1(+4)
Model 4
Alignment Model
Fertility Model E-Step Deficient
Model 1 Uniform No Exact No
Model 2 Zero order No Exact No
Model 3 Zero order Yes Approximate Yes
HMM First-order No Exact No
Model 4 First-order Yes Approximate Yes
Comparison of Statistical Models
Word Classes
Smarter way to look at re-ordering
Some words tend to get reordered while some do not.
Examples
– Good man भला आदमी– external affairs affaires extérieur
Word Classes
Conditioning on words will not work due to sparsity.
Instead, create word classes by dividing the vocabulary e.g. POS tags.
PART III
Ankit Ramteke
Ankur Aher
Outline
Topics covered
– IBM Models 1, 2, 3, 4– HMM
Today’s topics
– Heuristic Models– IBM Model 5– Model 6– Comparison of various methods based on
• Size of training corpus• EM training methodology
Heuristic Models
Dice
Example
Parallel corpus
Those boys play cricket. वह लड़के केट खेलते ह. That boy plays cricket. वह लड़का केट खेलता है. I saw that house. मने वह घर देखा. I slept in that house. मै उस घर म सोया.I played in that house. मै उस घर म खेला. I liked that house. मुझे वह घर भाया.That boy went in the house. वह लड़का घर म गया.
Example: That is my house. वह मेरा घर है.
dice(that, वह) = (2*4)/(6*5) = 0.266dice(that, घर) = (2*5)/(6*5) = 0.333 indirect associationdice(house, घर) = (2*5)/(5*5) = 0.4
Competitive linking algorithm
• Iteratively align the highest ranking word position (i,j)• Advantage: indirect associations occur less often
• House घर• The Dice coefficient results in a worse alignment quality than the
statistical models.
वह मेरा घर हैThat 0.26 0.0 0.33 0.16
is 0.0 0.0 0.0 0.0
my 0.0 0.0 0.0 0.0
house 0.24 0.0 0.4 0.0
Competitive linking algorithm
• Iteratively align the highest ranking word position (i,j)• Advantage: indirect associations occur less often
• That वह• The Dice coefficient results in a worse alignment quality than the
statistical models.
वह मेरा घर हैThat 0.26 0.0 0.33 0.16
is 0.0 0.0 0.0 0.0
my 0.0 0.0 0.0 0.0
house 0.24 0.0 0.4 0.0
Competitive linking algorithm
• Iteratively align the highest ranking word position (i,j)• Advantage: indirect associations occur less often
• My / is मेरा / है• The Dice coefficient results in a worse alignment quality than the
statistical models.
वह मेरा घर हैThat 0.26 0.0 0.33 0.16
is 0.0 0.0 0.0 0.0
my 0.0 0.0 0.0 0.0
house 0.24 0.0 0.4 0.0
Distortion Probabilities
Where,vj is the number of vacancies in the English output interval [1; j]
Null हंद क कताब अलमार म थी.
The Hindi book was in the cupboard.
Cept Vacancies Parameters of d1
f[i] πi,k v1 v2 v3 v4 v5 v6 V7 j vj V-max
v⃝i−1
The Hindi book was in the Cupboard
NULL π0,1 1 2 3 4 5 6 7 1 - - -हंद π1,1 - 1 2 3 4 5 6 2 1 6 0कताब π2,1 - - 1 2 3 4 5 3 1 5 0अलमार π3,1 - - - 1 2 3 4 7 4 4 0म π4,1 - - - 1 2 - - 5 2 2 3
π4,2 - - - 1 - 2 - 6 2 - -थी π5,1 - - - 1 - - - 4 1 1 0
Model 5
Model 6
Motivation
• HMM makes use of locality in the source language
• Model 4 makes use of locality in the target language
• How can we use both ?
Formulation
• Combine HMM and Model 4 in a log-linear way (Och & Ney 2003)
α is interpolation parameter
General Formulation
• A log-linear combination of several models pk(f, a | e),
k = 1, . . . ,K is given by
Alignment Model
Fertility Model E-Step Deficient
Model 1 Uniform No Exact No
Model 2 Zero order No Exact No
HMM First-order No Exact No
Model 3 Zero order Yes Approximate Yes
Model 4 First-order Yes Approximate Yes
Model 5 First-order Yes Approximate No
Model 6 First-order Yes Approximate Yes
Comparison of Statistical Models
SummaryAlignment Model Description
Model 1 Uniform Uniform probability distribution (l+1)
Model 2 Zero order a(aj|j, m, l)
HMM First-order
Model 3 Zero order ( | _ , , )
Model 4 First-order
Model 5 First-order
Model 6 First-order Log linear combination of HMM and Model 4
Corpus
Verbmobil taskGerman-English Speech translation task
German English
Training Corpus Sentences 34,446 ~ 34K
Words 329,625 343,076
Vocabulary 5,936 3,505
Singletons 2,600 1,305
Bilingual Dictionary Entries 4,404
Words 4,758 5,543
Test Corpus Sentences 354
Words 3,233 3,109
Hansards TaskThe French-English Hansards task consists
of the debates in the Canadian parliament.French English
Training Corpus Sentences 1470K
Words 24.33M 22.16M
Vocabulary 100,269 78,332
Singletons 40,199 31,319
Bilingual Dictionary Entries 28,701
Words 28,702 30,186
Test Corpus Sentences 500
Words 8,749 7,946
Comparison based on various metrics
Based on training schemes (Vermobil)
Based on training schemes (Hansards)
Observations - 1
Refined Models (4,5,6) perform better in all corpus sizes
As the corpus size increases the improvement is mush more significant
Statistical models perform better than heuristic models due to a well-founded mathematical theory that underlies their parameter estimation
Observations - 2
Hidden Markov alignment model achieves significantly better results than Model 2
– Homogeneous first-order alignment model– better represent the locality and monotonicity
properties of natural languagesAlignment quality depends heavily on the
method used to bootstrap the model
– Bootstrapping with HMM performs better than that with Model 2
EM training
Models 1,2 and HMM are exact
Models 3,4,5,6 are approximate
– Select smaller subset of alignments Three methods to select subset
Viterbi alignment(Brown et al. 2003)
Viterbi + Neighborhood(Al-Onaizan et al. 1999)
Pegged Alignments (Brown et al. 2003)
Effect of more alignments (Vermobil)
Effect of more alignments (Hansards)
Observations
The effect of pegging strongly depends on the quality of the starting point used
If Model 2 is used as starting point,
– A significant improvement with the neighborhood alignments and the pegged alignments.
Only the Viterbi alignment
– The results are significantly worse than using additionally the neighborhood of the Viterbi alignment.
HMM as the starting point has a much smaller effect
Using more alignments in training is a way to avoid a poor local optimum.
Computational times
• Using the pegging alignments yields only a moderate improvement in performance with a considerable increase in computational time
PART IV
Piyush Dungarwal
Mandar Joshi
Effect of Smoothing
Using linear interpolation
For Alignment probability:
For Fertility probability:
Results (Verbmobil task)
Results (Hansard task)
Direction
Direction
Symmetrization
Intersection
Union
Refined method
– In a first step, the intersection A = A1 ∩ A2 is determined– extend the alignment A by adding alignments (i, j)
occurring only in the A1 or in the alignment A2 if• neither fjj nor ei has an alignment in A, or • if both of the following conditions hold
– (i,j) has a horizontal or vertical neighbour– A ∪ {(i, j)} does not contain alignments with both horizontal and
vertical neighbours
e1
e2
e3
e4
e1
e2
e3
e4
e1
e2
e3
e4
e1
e2
e3
e4
e1
e2
e3
e4
f1 f4f2 f3 f1 f4f2 f3
f1 f4f2 f3 f1 f4f2 f3 f1 f4f2 f3
F -> E alignments
E -> F alignments
Intersection Union Refined method
Symmetrisation
Symmetrisation
Conclusion
This paper compares heuristic and statistical methods (IBM models 1-5, HMM, Model 6)
Statistical alignment models outperform the simple Dice coefficient
Model 6 outperforms all other models
Smoothing and symmetrisation have a significant effect on the alignment quality achieved by a particular model.
Conclusion
Bilingual dictionary and word classes have minor effect on alignment quality
In general, important ingredients of a good model seem to be a first-order dependence between word positions and a fertility model
References
Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Association of Computational Linguistics . March 2003. pages 19-51.
Peter F. Brown , Vincent J.Della Pietra , Stephen A. Della Pietra , Robert. L. Mercer.The Mathematics of Statistical Machine Translation: Parameter Estimation. Association of Computational Linguistics .1993.
Philipp Koehn. Statistical Machine Translation.2010. Chapter-4(Word-Based Models)
Vogel, Stephan, Hermann Ney, and Christoph Tillmann. HMM-based word alignment in statistical translation. COLING. 1996, pages 836–841.
Top Related