Download - A Systematic Comparison of Various Statistical Alignment Modelspb/cs712-2013/piyush-ankit-ankur-alignm… · Outline of the Paper Review various statistical alignment models and heuristic

A Systematic Comparison of Various

Statistical Alignment ModelsOch and Ney

ACL 2003

Presented by:

Ankit Ramteke Ankur Aher

Piyush Dungarwal Mandar Joshi

PART I

Ankit RamtekePiyush Dungarwal

Introduction

Noisy channel model

Motivation

● Possible alignments:b c b cx y x y

b c b cx y x y

Outline of the Paper● Review various statistical alignment models and

heuristic models and Training of the alignment models.● IBM models 1-5 , HMM Model, Model 6 ( 4+HMM)● Heuristic based Models

● Some heuristic methods for improving alignment quality

● Evaluation methodology for word alignment methods

● Systematic comparison of the various statistical alignment models

Topics for Today

● Introduction● IBM Model 1● IBM Model 2● IBM Model 3

IBM Model 1

Description

Derivation

Learning Parameters

EM Algorithm consists of two steps:

• Expectation-Step: Apply model to the data

• parts of the model are hidden (here: alignments)• using the model, assign probabilities to possible values

• Maximization-Step: Estimate model from data

• take assigned values as fact• collect counts (weighted by probabilities)• estimate model from counts

• Iterate these steps until convergence

Learning Parameters

Example taken from SMT by Koehn

EM Algorithm

IBM Model 2

Motivation

● Why model 2 when we have 1?

राम पाठशाला गया

Ram went to school


school Ram to went

IBM Model 2● Focuses on absolute alignment

● Assumptions

1. Pr(m|e) is independent of e & m• New parameter ϵ = Pr(m|e)

2. Uniform distribution over l+1 (null included)• Alignment probability is Pr(aj|j, m, l)

3. Pr(fj|a1j, f1j-1,m,e) depends only on fj and eaj• Translation probability, t(fj|eaj) = Pr(fj|a1-j, f1-j-1, m, e)

– Number of new parameters: m (aj for j=1 to m)– Training

IBM Model 3

Why model 3


b c b cx y x y

Terminologies● Adds fertility model

● Fertility probability– Eg. n(2|house) = prob. of generating 2 words for the word ‘house’

● Translation probability: same as model 1– Eg. t(maison|house) = prob. of ‘maison’ being translation of ‘house’

● Distortion probability– Eg. d(5|2) = prob. that word at position 2 goes to position 5

Derivation from Noisy Channel

Example

This city is famous for its flora.This city is famous for its flora flora fertility

step

This city is famous NULL for its flora flora null insertion step

यह शहर है मशहू र के लए अपने पेड़ पौध lexical translation

यह शहर अपने पेड़ पौध के लए मशहू र है distortion step

Modifications Required

Final Probability

Generative Process

Training

• Training flow diagram from Kevin Knight SMT tutorial

Deficiency

• Distortion probabilities do not depend on the earlier words

• Model 3 wastes some of its probability on generalized strings

• Strings that have some positions with several words and others with none.

• When a model has this property of not concentrating all of its probability on events of interest, it is said to be deficient.

Example


Ram went school

to

Alignment Model

Fertility Model E-Step Deficient

Model 1 Uniform No Exact No

Model 2 Zero order No Exact No

Model 3 Zero order Yes Approximate Yes

Comparison of Statistical Models

Heuristic Models

PART II

Ankur AherMandar Joshi

Outline

Recap : IBM Model 1,2 & 3HMM based alignment approachIBM Model 4

MT Phenomenon

Word-to-word translationFertilityRe-ordering


Ram school went

Ram school went to

Ram went to school

Noisy channel modelargmaxe Pr(e|f) = argmaxe Pr(e) . Pr(f|e)

language model translation model

Pr(f|e) = Σa Pr(f, a|e)alignment model

Alignment Example: राम पाठशाला गया

Ram went to schoolPositional Vector: a{1-1,2-3,3-0,4-2}

Translation model

Model 1

Model 2

● Why model 2 when we have 1?


Ram went to school


school Ram to went

Model 2

Model 3


b c b cx y x y

Terminologies

Example


Model 3

Deficiency


Ram went school

to

Alignment Model






Hidden Markov Alignment Model

Motivation

In the translation process, large phrases tend to move together.

Words that are adjacent in the source language tend to be next to each other in the target language.

Strong localization effect is observed in alignment.

Motivation

Hindi-English Alignment Example

times *

three *

cup *

world *

cricket *

won *

team *

Indian *भारतीय ट म केट व व कप तीन बार जीती

What is Hidden?

Capturing Locality

HMM captures the locality of English sentence.

Homogenous HMM

To make the alignment parameters independent of absolute word positions, we assume that the alignment probabilities p(i | i’, m) depend only on the jump width(i − i’).

Alignment Model





HMM First-order No Exact No


IBM Model 4

Motivation

Large phrases tend to move together.

This intuition is captured through relative distortion.

The placement of the translation of an input word is typically based on the placement of the translation of the proceeding input word.

Example

The Indian team played well for 90 minutes.

भारतीय ट म ९० मनट अ छ खेल .

Terminology

Cept

– Each input word fj that is aligned to at least one output word forms a cept.

हंद क कताब अलमार म थी.

The Hindi book was in the cupboard.

The underlined words are cepts

Terminology

We define the operator [i] to map the ceptwith index i back to its corresponding foreign input word position.

Center of a cept

Center of a cept is defined as the ceiling of the average of the output word positions for that cept.

We use ⃝i to indicate centre of cept i.

Table 1

cept πi π1 π2 π3 π4 π5

foreign position [i]

1 3 4 5 6

foreign word f[i]

हंद कताब अलमार म थी

English words {ej }

Hindi book cupboard in the was

English positions {j}

2 3 7 5, 6 4

center of cept ⃝i

2 3 7 6 4

Relative Distortion

Three cases:

words that are generated by the NULL token,

These are uniformly distributed. the first word in a cept, and

subsequent words in a cept

The first word in a cept

For the likelihood of placements of the first word of a cept, we use the distribution

d1( j−⃝i−1)Example: See table 2

Subsequent words in a cept

The distribution

d>1( j − πi,k−1)

Example: See table 2

Table 2

j 1 2 3 4 5 6 7

ej The Hindi book was In the cupboard

in ceptπi,k

π0,0 π1,0 π2,0 π5,0 π4,0 π4,1 π3,0

⃝i−1 - 0 2 6 7 - 3

j - ⃝i−1 - +2 +1 -2 -2 - +4

distortion 1 D1(+2) D1(+1) D1(-2) D1(-2) D>1(+1) D1(+4)

Model 4

Alignment Model






Model 4 First-order Yes Approximate Yes


Word Classes

Smarter way to look at re-ordering

Some words tend to get reordered while some do not.

Examples

– Good man भला आदमी– external affairs affaires extérieur

Word Classes

Conditioning on words will not work due to sparsity.

Instead, create word classes by dividing the vocabulary e.g. POS tags.

More on this later

PART III

Ankit Ramteke

Ankur Aher

Outline

Topics covered

– IBM Models 1, 2, 3, 4– HMM

Today’s topics

– Heuristic Models– IBM Model 5– Model 6– Comparison of various methods based on

• Size of training corpus• EM training methodology

Heuristic Models

Example

Parallel corpus

Those boys play cricket. वह लड़के केट खेलते ह. That boy plays cricket. वह लड़का केट खेलता है. I saw that house. मने वह घर देखा. I slept in that house. मै उस घर म सोया.I played in that house. मै उस घर म खेला. I liked that house. मुझे वह घर भाया.That boy went in the house. वह लड़का घर म गया.

Example: That is my house. वह मेरा घर है.

dice(that, वह) = (2*4)/(6*5) = 0.266dice(that, घर) = (2*5)/(6*5) = 0.333 indirect associationdice(house, घर) = (2*5)/(5*5) = 0.4

Competitive linking algorithm

• Iteratively align the highest ranking word position (i,j)• Advantage: indirect associations occur less often

• House घर• The Dice coefficient results in a worse alignment quality than the

statistical models.

वह मेरा घर हैThat 0.26 0.0 0.33 0.16

is 0.0 0.0 0.0 0.0

my 0.0 0.0 0.0 0.0

house 0.24 0.0 0.4 0.0



• That वह• The Dice coefficient results in a worse alignment quality than the

statistical models.

वह मेरा घर हैThat 0.26 0.0 0.33 0.16

is 0.0 0.0 0.0 0.0

my 0.0 0.0 0.0 0.0

house 0.24 0.0 0.4 0.0



• My / is मेरा / है• The Dice coefficient results in a worse alignment quality than the

statistical models.

वह मेरा घर हैThat 0.26 0.0 0.33 0.16

is 0.0 0.0 0.0 0.0

my 0.0 0.0 0.0 0.0

house 0.24 0.0 0.4 0.0

Distortion Probabilities

Where,vj is the number of vacancies in the English output interval [1; j]

Null हंद क कताब अलमार म थी.

The Hindi book was in the cupboard.

Cept Vacancies Parameters of d1

f[i] πi,k v1 v2 v3 v4 v5 v6 V7 j vj V-max

v⃝i−1

The Hindi book was in the Cupboard

NULL π0,1 1 2 3 4 5 6 7 1 - - -हंद π1,1 - 1 2 3 4 5 6 2 1 6 0कताब π2,1 - - 1 2 3 4 5 3 1 5 0अलमार π3,1 - - - 1 2 3 4 7 4 4 0म π4,1 - - - 1 2 - - 5 2 2 3

π4,2 - - - 1 - 2 - 6 2 - -थी π5,1 - - - 1 - - - 4 1 1 0

Model 5

Model 6

Motivation

• HMM makes use of locality in the source language

• Model 4 makes use of locality in the target language

• How can we use both ?

Formulation

• Combine HMM and Model 4 in a log-linear way (Och & Ney 2003)

α is interpolation parameter

General Formulation

• A log-linear combination of several models pk(f, a | e),

k = 1, . . . ,K is given by

Alignment Model







Model 5 First-order Yes Approximate No



SummaryAlignment Model Description

Model 1 Uniform Uniform probability distribution (l+1)

Model 2 Zero order a(aj|j, m, l)

HMM First-order

Model 3 Zero order ( | _ , , )

Model 4 First-order

Model 5 First-order

Model 6 First-order Log linear combination of HMM and Model 4

Corpus

Verbmobil taskGerman-English Speech translation task

German English

Training Corpus Sentences 34,446 ~ 34K

Words 329,625 343,076

Vocabulary 5,936 3,505

Singletons 2,600 1,305

Bilingual Dictionary Entries 4,404

Words 4,758 5,543

Test Corpus Sentences 354

Words 3,233 3,109

Hansards TaskThe French-English Hansards task consists

of the debates in the Canadian parliament.French English

Training Corpus Sentences 1470K

Words 24.33M 22.16M

Vocabulary 100,269 78,332

Singletons 40,199 31,319

Bilingual Dictionary Entries 28,701

Words 28,702 30,186

Test Corpus Sentences 500

Words 8,749 7,946

Comparison based on various metrics

Based on training schemes (Vermobil)

Based on training schemes (Hansards)

Observations - 1

Refined Models (4,5,6) perform better in all corpus sizes

As the corpus size increases the improvement is mush more significant

Statistical models perform better than heuristic models due to a well-founded mathematical theory that underlies their parameter estimation

Observations - 2

Hidden Markov alignment model achieves significantly better results than Model 2

– Homogeneous first-order alignment model– better represent the locality and monotonicity

properties of natural languagesAlignment quality depends heavily on the

method used to bootstrap the model

– Bootstrapping with HMM performs better than that with Model 2

EM training

Models 1,2 and HMM are exact

Models 3,4,5,6 are approximate

– Select smaller subset of alignments Three methods to select subset

Viterbi alignment(Brown et al. 2003)

Viterbi + Neighborhood(Al-Onaizan et al. 1999)

Pegged Alignments (Brown et al. 2003)

Effect of more alignments (Vermobil)

Effect of more alignments (Hansards)

Observations

The effect of pegging strongly depends on the quality of the starting point used

If Model 2 is used as starting point,

– A significant improvement with the neighborhood alignments and the pegged alignments.

Only the Viterbi alignment

– The results are significantly worse than using additionally the neighborhood of the Viterbi alignment.

HMM as the starting point has a much smaller effect

Using more alignments in training is a way to avoid a poor local optimum.

Computational times

• Using the pegging alignments yields only a moderate improvement in performance with a considerable increase in computational time

PART IV

Piyush Dungarwal

Mandar Joshi

Effect of Smoothing

Using linear interpolation

For Alignment probability:

For Fertility probability:

Results (Verbmobil task)

Results (Hansard task)

Direction

Symmetrization

Intersection

Union

Refined method

– In a first step, the intersection A = A1 ∩ A2 is determined– extend the alignment A by adding alignments (i, j)

occurring only in the A1 or in the alignment A2 if• neither fjj nor ei has an alignment in A, or • if both of the following conditions hold

– (i,j) has a horizontal or vertical neighbour– A ∪ {(i, j)} does not contain alignments with both horizontal and

vertical neighbours

e1

e2

e3

e4

e1

e2

e3

e4

e1

e2

e3

e4

e1

e2

e3

e4

e1

e2

e3

e4

f1 f4f2 f3 f1 f4f2 f3

f1 f4f2 f3 f1 f4f2 f3 f1 f4f2 f3

F -> E alignments

E -> F alignments

Intersection Union Refined method

Symmetrisation

Conclusion

This paper compares heuristic and statistical methods (IBM models 1-5, HMM, Model 6)

Statistical alignment models outperform the simple Dice coefficient

Model 6 outperforms all other models

Smoothing and symmetrisation have a significant effect on the alignment quality achieved by a particular model.

Conclusion

Bilingual dictionary and word classes have minor effect on alignment quality

In general, important ingredients of a good model seem to be a first-order dependence between word positions and a fertility model

References

Franz Josef Och and Hermann Ney. A systematic comparison of various statistical alignment models. Association of Computational Linguistics . March 2003. pages 19-51.

Peter F. Brown , Vincent J.Della Pietra , Stephen A. Della Pietra , Robert. L. Mercer.The Mathematics of Statistical Machine Translation: Parameter Estimation. Association of Computational Linguistics .1993.

Philipp Koehn. Statistical Machine Translation.2010. Chapter-4(Word-Based Models)

Vogel, Stephan, Hermann Ney, and Christoph Tillmann. HMM-based word alignment in statistical translation. COLING. 1996, pages 836–841.