1 Lecture 8: Statistical Alignment & Machine Translation
(Chapter 13 of Manning & Schutze) Wen-Hsiang Lu ( ) Department
of Computer Science and Information Engineering, National Cheng
Kung University 2014/05/12 (Slides from Berlin Chen, National
Taiwan Normal University http://140.122.185.120/PastCourses/2004S-
NaturalLanguageProcessing/NLP_main_2004S.htm
http://140.122.185.120/PastCourses/2004S-
NaturalLanguageProcessing/NLP_main_2004S.htm ) References: 1.Brown,
P., Pietra, S. A. D., Pietra, V. D. J., Mercer, R. L. The
Mathematics of Machine Translation, Computational Linguistics,
1993. 2.Knight, K. Automating Knowledge Acquisition for Machine
Translation, AI Magazine,1997. 3.W. A. Gale and K. W. Church. A
Program for Aligning Sentences in Bilingual Corpora, Computational
Linguistics 1993.
Slide 2
2 Machine Translation (MT) Definition Automatic translation of
text or speech from one language to another Goal Produce close to
error-free output that reads fluently in the target language Far
from it ? Current Status Existing systems seem not good in quality
Babel Fish Translation (before 2004) Google Translation A mix of
probabilistic and non-probabilistic components
Slide 3
3 Issues Build high-quality semantic-based MT systems in
circumscribed domains Abandon automatic MT, build software to
assist human translators instead Post-edit the output of a buggy
translation Develop automatic knowledge acquisition techniques for
improving general-purpose MT Supervised or unsupervised
learning
Slide 4
4 Different Strategies for MT English Text (word string) French
Text (word string) English (syntactic parse) French (syntactic
parse) English (semantic representation) French (semantic
representation) Interlingua (knowledge representation)
word-for-word syntactic transfer semantic transfer knowledge-based
translation
Slide 5
5 Word for Word MT Translate words one-by-one from one language
to another Problems 1. No one-to-one correspondence between words
in different languages (lexical ambiguity) Need to look at the
context larger than individual word ( phrase or clause ) 2.
Languages have different word orders English French suit lawsuit (
), set of garments ( ) meanings
Slide 6
6 Syntactic Transfer MT Parse the source text, then transfer
the parse tree of the source text into a syntactic tree in the
target language, and then generate the translation from this
syntactic tree Solve the problems of word ordering Problems
Syntactic ambiguity The target syntax will likely mirror that of
the source text German: Ich esse gern ( I like to eat ) English: I
eat readily/gladly N V Adv
Slide 7
7 Text Alignment: Definition Definition Align paragraphs,
sentences or words in one language to paragraphs, sentences or
words in another languages Thus can learn which words tend to be
translated by which other words in another language Is not part of
MT process per se But the obligatory first step for making use of
multilingual text corpora Applications Bilingual lexicography
Machine translation Multilingual information retrieval bilingual
dictionaries, MT, parallel grammars
Slide 8
8 Text Alignment: Sources and Granularities Sources of parallel
texts or bitexts Parliamentary proceedings (Hansards) Newspapers
and magazines Religious and literary works Two levels of alignment
Gross large scale alignment Learn which paragraphs or sentences
correspond to which paragraphs or sentences in another language
Word alignment Learn which words tend to be translated by which
words in another language The necessary step for acquiring a
bilingual dictionary with less literal translation Orders of word
or sentence might not be preserved.
Slide 9
9 Text Alignment: Example 1 2:2 alignment
Slide 10
10 Text Alignment: Example 2 2:2 alignment 1:1 alignment 2:1
alignment a bead/a sentence alignment Studies show that around 90%
of alignments are 1:1 sentence alignment.
Slide 11
11 Sentence Alignment Crossing dependencies are not allowed
here Word ordering is preserved ! Related work
13 Sentence Alignment Length-based method Rationale: the short
sentences will be translated as short sentences and long sentences
as long sentences Length is defined as the number of words or the
number of characters Approach 1 (Gale & Church 1993)
Assumptions The paragraph structure was clearly marked in the
corpus, confusions are checked by hand Lengths of sentences
measured in characters Crossing dependences are not handled here
The order of sentences are not changed in the translation
s1s2s3s4...sIs1s2s3s4...sI t1t2t3t4....tJt1t2t3t4....tJ Ignore the
rich information available in the text. Union Bank of Switzerland
(UBS) corpus : English, French, and German
Slide 14
14 Sentence Alignment Length-based method
t1t2t3t4....tJt1t2t3t4....tJ s1s2s3s4...sIs1s2s3s4...sI B1B1 B2B2
B3B3 BkBk source target possible alignments: {1:1, 1:0, 0:1,
2:1,1:2, 2:2,} a bead probability independence between beads Source
Target
Slide 15
15 Sentence Alignment Length-based method Dynamic Programming
The cost function (Distance Measure) Sentence is the unit of
alignment Statistically modeling of character lengths square
difference of two paragraphs is a normal distribution Ratio of
texts in two languages Bayes Law The prob. distribution of standard
normal distribution
Slide 16
16 Sentence Alignment Length-based method The priori
probability Source Target s i-1 sisi s i-2 tjtj t j-1 t j-2 Or P(
align)
Slide 17
17 Sentence Alignment Length-based method A simple example s1s1
s2s2 s3s3 s4s4 t1t1 t2t2 t3t3 t1t1 t2t2 t3t3 L 1 alignment 2 L 1
alignment 1 cost(align(s 1, t 1 )) + cost(align(s 2, t 2 )) +
cost(align(s 3,)) + cost(align(s 4, t 3 )) cost(align(s 1, s 2, t 1
)) + cost(align(s 3, t 2 )) + cost(align(s 4, t 3 ))
Slide 18
18 Sentence Alignment Length-based method 4% error rate was
achieved Problems: Can not handle noisy and imperfect input E.g.,
OCR output or file containing unknown markup conventions Finding
paragraph or sentence boundaries is difficult Solution: just align
text (position) offsets in two parallel texts (Church 1993)
Questionable for languages with few cognates ( ) or different
writing systems E.g., English Chinese eastern European languages
Asian languages
Slide 19
19 Sentence Alignment Lexical method Approach 1 (Kay and
Rscheisen 1993) First assume the first and last sentences of the
text were align as the initial anchors Form an envelope of possible
alignments Alignments excluded when sentences across anchors or
their respective distance from an anchor differ greatly Choose word
pairs their distributions are similar in most of the sentences Find
pairs of source and target sentences which contain many possible
lexical correspondences The most reliable of pairs are used to
induce a set of partial alignment (add to the list of anchors)
Iterations
Slide 20
20 Word Alignment The sentence/offset alignment can be extended
to a word alignment Some criteria are then used to select aligned
word pairs to include them into the bilingual dictionary Frequency
of word correspondences Association measures .
Slide 21
21 Statistical Machine Translation The noisy channel model
Assumptions: An English word can be aligned with multiple French
words while each French word is aligned with at most one English
word Independence of the individual word-to-word translations
Language ModelTranslation ModelDecoder e: English f: French |e|=l
|f|=m
Slide 22
22 Statistical Machine Translation Three important components
involved Decoder Language model Give the probability p(e)
Translation model normalization constant all possible alignments
(the English word that a French word f j is aligned with)
translation probability
Slide 23
23 An Introduction to Statistical Machine Translation Dept. of
CSIE, NCKU Yao-Sheng Chang Date: 2011.04.12
Slide 24
24 Introduction Translations involve many cultural respects We
only consider the translation of individual sentence, just
acceptable sentences. Every sentence in one language is a possible
translation of any sentence in the other Assign (S,T) a
probability, Pr(T|S), to be the probability that a translator will
produce T in the target language when presented with S in the
source language.
Slide 25
25 Statistical Machine Translation (SMT) Noise channel
problem
Slide 26
26 Fundamental of SMT Given a string of French f, the job of
our translation system is to find the string e that the speaker had
in mind when he produced f. (Bayes theorem) Since denominator Pr(f)
here is a constant, the best e is one which has the greatest
probability.
Slide 27
27 Practical Challenges Computation of translation model
Pr(f|e) Computation of language model Pr(e) Decoding (i.e., search
for e that maximize Pr(f|e) Pr(e))
Slide 28
28 Alignment of case 1 (one-to-many)
Slide 29
29 Alignment of case 2 (many-to-one)
Slide 30
30 Alignment of case 3 (many-to-many)
Slide 31
31 Formulation of Alignment(1) Let e = e 1 l e 1 e 2 e l and f
= f 1 m f 1 f 2 f m An alignment between a pair of strings e and f
use a mapping of every word e i to some word f j In other words, an
alignment a between e and f tells that the word e i, 1 i l is
generated by the word f j, a j {1,,l} There are (l+1) m different
alignments between e and f. (Including Null no mapping ) e = e 1 e
2 e i e l f = f 1 f 2 f j f m aj =iaj =i fj = e a j = e i
Slide 32
32 Translation Model The alignment, a, can be represented by a
series, a 1 m = a l a 2... a m, of m values, each between 0 and l
such that if the word in position j of the French string is
connected to the word in position i of the English string, then a j
= i, and if it is not connected to any English word, then a j = 0
(null).
Slide 33
33 IBM Model I (1)
Slide 34
34 IBM Model I (2) The alignment is determined by specifying
the values of a j for j from 1 to m, each of which can take any
value from 0 to l.
Slide 35
35 Constrained Maximization We wish to adjust the translation
probabilities so as to maximize Pr(f|e ) subject to the constraints
that for each e
Slide 36
36 Lagrange Multipliers (1) Method of Lagrange multipliers :
Lagrange multipliers with one constraint If there is a maximum or
minimum subject to the constraint g(x,y) = 0, then it will occur at
one of the critical numbers of the function F defined by is called
the f(x,y) is called the objective function . g(x,y) is called the
constrained equation . F(x, y, ) is called the Lagrange function .
is called the Lagrange multiplier .
Slide 37
37 Lagrange Multipliers (2) Following standard practice for
constrained maximization, we introduce Lagrange multipliers e, and
seek an unconstrained extremum of the auxiliary function
Slide 38
38 Derivation (1) The partial derivative of h with respect to
t(f|e) is where is the Kronecker delta function, equal to one when
both of its arguments are the same and equal to zero otherwise
Slide 39
39 Derivation (2) We call the expected number of times that e
connects to f in the translation (f|e) the count of f given e for
(f|e) and denote it by c(f|e; f, e). By definition,
Slide 40
40 Derivation (3) replacing e by e Pr(f|e), then Equation (11)
can be written very compactly as In practice, our training data
consists of a set of translations, (f (1) le (1) ), (f (2) le (2)
),..., (f (s) le (s) ),, so this equation becomes
Slide 41
41 Derivation (4) For an expression that can be evaluated
efficiently.
Slide 42
42 Derivation (5) Thus, the number of operations necessary to
calculate a count is proportional to l + m rather than to (l + 1) m
as Equation (12)
Slide 43
Other References about MT
Slide 44
44
Slide 45
45 Chinese-English Sentence Alignment , ,ROCLING XVI,2004 : 2
Stages Iterative DP Stop list Partial Match
Slide 46
46 Bilingual Collocation Extraction Based on Syntactic and
Statistical Analyses (Chien-Cheng Wu & Jason S. Chang,
ROCLING03)
Slide 47
47 Bilingual Collocation Extraction Based on Syntactic and
Statistical Analyses (Chien-Cheng Wu & Jason S. Chang,
ROCLING03) Preprocessing steps to calculate the following
information: Lists of preferred POS patterns of collocation in both
languages. Collocation candidates matching the preferred POS
patterns. N-gram statistics for both languages, N = 1, 2. Log
likelihood Ratio statistics for two consecutive words in both
languages. Log likelihood Ratio statistics for a pair of candidates
of bilingual collocation across from one language to the other.
Content word alignment based on Competitive Linking Algorithm
(Melamed 1997).
Slide 48
48 Bilingual Collocation Extraction Based on Syntactic and
Statistical Analyses (Chien-Cheng Wu & Jason S. Chang,
ROCLING03)
Slide 49
49 Bilingual Collocation Extraction Based on Syntactic and
Statistical Analyses (Chien-Cheng Wu & Jason S. Chang,
ROCLING03)
Slide 50
A Probability Model to Improve Word Alignment Colin Cherry
& Dekang Lin, University of Alberta
Slide 51
Outline Introduction Probabilistic Word-Alignment Model
Word-Alignment Algorithm Constraints Features
Slide 52
Introduction Word-aligned corpora are an excellent source of
translation-related knowledge in statistical machine translation.
E.g., translation lexicons, transfer rules Word-alignment problem
Conventional approaches usually used co-occurrence models E.g., 2
(Gale & Church 1991), log-likelihood ratio (Dunning 1993)
Indirect association problem: Melamed (2000) proposed competitive
linking along with an explicit noise model to solve To propose a
probabilistic word-alignment model which allows easy integration of
context-specific features. CISCO System Inc. CISCO System Inc.
Slide 53
noise
Slide 54
Probabilistic Word-Alignment Model Given E = e 1, e 2, , e m, F
= f 1, f 2, , f n If e i and f j are translation pair, then link
l(e i, f j ) exists If e i has no corresponding translation, then
null link l(e i, f 0 ) exists If f j has no corresponding
translation, then null link l(e 0, f j ) exists An alignment A is a
set of links such that every word in E and F participates in at
least one link Alignment problem is to find alignment A to maximize
P(A|E, F) IBM s translation model: maximize P(A, F|E)
Slide 55
Probabilistic Word-Alignment Model (Cont.) Given A = {l 1, l 2,
, l t }, where l k = l(e i k, f j k ), then consecutive subsets of
A, l i j = {l i, l i+1, , l j } Let C k = {E, F, l 1 k-1 }
represent the context of l k