Download - 1 Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer Science and Information.

1 Lecture 8: Statistical Alignment & Machine Translation (Chapter 13 of Manning & Schutze) Wen-Hsiang Lu ( ) Department of Computer Science and Information Engineering, National Cheng Kung University 2014/05/12 (Slides from Berlin Chen, National Taiwan Normal University http://140.122.185.120/PastCourses/2004S- NaturalLanguageProcessing/NLP_main_2004S.htm http://140.122.185.120/PastCourses/2004S- NaturalLanguageProcessing/NLP_main_2004S.htm ) References: 1.Brown, P., Pietra, S. A. D., Pietra, V. D. J., Mercer, R. L. The Mathematics of Machine Translation, Computational Linguistics, 1993. 2.Knight, K. Automating Knowledge Acquisition for Machine Translation, AI Magazine,1997. 3.W. A. Gale and K. W. Church. A Program for Aligning Sentences in Bilingual Corpora, Computational Linguistics 1993.

2 Machine Translation (MT) Definition Automatic translation of text or speech from one language to another Goal Produce close to error-free output that reads fluently in the target language Far from it ? Current Status Existing systems seem not good in quality Babel Fish Translation (before 2004) Google Translation A mix of probabilistic and non-probabilistic components

3 Issues Build high-quality semantic-based MT systems in circumscribed domains Abandon automatic MT, build software to assist human translators instead Post-edit the output of a buggy translation Develop automatic knowledge acquisition techniques for improving general-purpose MT Supervised or unsupervised learning

4 Different Strategies for MT English Text (word string) French Text (word string) English (syntactic parse) French (syntactic parse) English (semantic representation) French (semantic representation) Interlingua (knowledge representation) word-for-word syntactic transfer semantic transfer knowledge-based translation

5 Word for Word MT Translate words one-by-one from one language to another Problems 1. No one-to-one correspondence between words in different languages (lexical ambiguity) Need to look at the context larger than individual word ( phrase or clause ) 2. Languages have different word orders English French suit lawsuit ( ), set of garments ( ) meanings

6 Syntactic Transfer MT Parse the source text, then transfer the parse tree of the source text into a syntactic tree in the target language, and then generate the translation from this syntactic tree Solve the problems of word ordering Problems Syntactic ambiguity The target syntax will likely mirror that of the source text German: Ich esse gern ( I like to eat ) English: I eat readily/gladly N V Adv

7 Text Alignment: Definition Definition Align paragraphs, sentences or words in one language to paragraphs, sentences or words in another languages Thus can learn which words tend to be translated by which other words in another language Is not part of MT process per se But the obligatory first step for making use of multilingual text corpora Applications Bilingual lexicography Machine translation Multilingual information retrieval bilingual dictionaries, MT, parallel grammars

8 Text Alignment: Sources and Granularities Sources of parallel texts or bitexts Parliamentary proceedings (Hansards) Newspapers and magazines Religious and literary works Two levels of alignment Gross large scale alignment Learn which paragraphs or sentences correspond to which paragraphs or sentences in another language Word alignment Learn which words tend to be translated by which words in another language The necessary step for acquiring a bilingual dictionary with less literal translation Orders of word or sentence might not be preserved.

9 Text Alignment: Example 1 2:2 alignment

10 Text Alignment: Example 2 2:2 alignment 1:1 alignment 2:1 alignment a bead/a sentence alignment Studies show that around 90% of alignments are 1:1 sentence alignment.

11 Sentence Alignment Crossing dependencies are not allowed here Word ordering is preserved ! Related work

12 Sentence Alignment Length-based Lexical-guided Offset-based

13 Sentence Alignment Length-based method Rationale: the short sentences will be translated as short sentences and long sentences as long sentences Length is defined as the number of words or the number of characters Approach 1 (Gale & Church 1993) Assumptions The paragraph structure was clearly marked in the corpus, confusions are checked by hand Lengths of sentences measured in characters Crossing dependences are not handled here The order of sentences are not changed in the translation s1s2s3s4...sIs1s2s3s4...sI t1t2t3t4....tJt1t2t3t4....tJ Ignore the rich information available in the text. Union Bank of Switzerland (UBS) corpus : English, French, and German

14 Sentence Alignment Length-based method t1t2t3t4....tJt1t2t3t4....tJ s1s2s3s4...sIs1s2s3s4...sI B1B1 B2B2 B3B3 BkBk source target possible alignments: {1:1, 1:0, 0:1, 2:1,1:2, 2:2,} a bead probability independence between beads Source Target

15 Sentence Alignment Length-based method Dynamic Programming The cost function (Distance Measure) Sentence is the unit of alignment Statistically modeling of character lengths square difference of two paragraphs is a normal distribution Ratio of texts in two languages Bayes Law The prob. distribution of standard normal distribution

16 Sentence Alignment Length-based method The priori probability Source Target s i-1 sisi s i-2 tjtj t j-1 t j-2 Or P( align)

17 Sentence Alignment Length-based method A simple example s1s1 s2s2 s3s3 s4s4 t1t1 t2t2 t3t3 t1t1 t2t2 t3t3 L 1 alignment 2 L 1 alignment 1 cost(align(s 1, t 1 )) + cost(align(s 2, t 2 )) + cost(align(s 3,)) + cost(align(s 4, t 3 )) cost(align(s 1, s 2, t 1 )) + cost(align(s 3, t 2 )) + cost(align(s 4, t 3 ))

18 Sentence Alignment Length-based method 4% error rate was achieved Problems: Can not handle noisy and imperfect input E.g., OCR output or file containing unknown markup conventions Finding paragraph or sentence boundaries is difficult Solution: just align text (position) offsets in two parallel texts (Church 1993) Questionable for languages with few cognates ( ) or different writing systems E.g., English Chinese eastern European languages Asian languages

19 Sentence Alignment Lexical method Approach 1 (Kay and Rscheisen 1993) First assume the first and last sentences of the text were align as the initial anchors Form an envelope of possible alignments Alignments excluded when sentences across anchors or their respective distance from an anchor differ greatly Choose word pairs their distributions are similar in most of the sentences Find pairs of source and target sentences which contain many possible lexical correspondences The most reliable of pairs are used to induce a set of partial alignment (add to the list of anchors) Iterations

20 Word Alignment The sentence/offset alignment can be extended to a word alignment Some criteria are then used to select aligned word pairs to include them into the bilingual dictionary Frequency of word correspondences Association measures .

21 Statistical Machine Translation The noisy channel model Assumptions: An English word can be aligned with multiple French words while each French word is aligned with at most one English word Independence of the individual word-to-word translations Language ModelTranslation ModelDecoder e: English f: French |e|=l |f|=m

22 Statistical Machine Translation Three important components involved Decoder Language model Give the probability p(e) Translation model normalization constant all possible alignments (the English word that a French word f j is aligned with) translation probability

23 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date: 2011.04.12

24 Introduction Translations involve many cultural respects We only consider the translation of individual sentence, just acceptable sentences. Every sentence in one language is a possible translation of any sentence in the other Assign (S,T) a probability, Pr(T|S), to be the probability that a translator will produce T in the target language when presented with S in the source language.

25 Statistical Machine Translation (SMT) Noise channel problem

26 Fundamental of SMT Given a string of French f, the job of our translation system is to find the string e that the speaker had in mind when he produced f. (Bayes theorem) Since denominator Pr(f) here is a constant, the best e is one which has the greatest probability.

27 Practical Challenges Computation of translation model Pr(f|e) Computation of language model Pr(e) Decoding (i.e., search for e that maximize Pr(f|e) Pr(e))

28 Alignment of case 1 (one-to-many)

29 Alignment of case 2 (many-to-one)

30 Alignment of case 3 (many-to-many)

31 Formulation of Alignment(1) Let e = e 1 l e 1 e 2 e l and f = f 1 m f 1 f 2 f m An alignment between a pair of strings e and f use a mapping of every word e i to some word f j In other words, an alignment a between e and f tells that the word e i, 1 i l is generated by the word f j, a j {1,,l} There are (l+1) m different alignments between e and f. (Including Null no mapping ) e = e 1 e 2 e i e l f = f 1 f 2 f j f m aj =iaj =i fj = e a j = e i

32 Translation Model The alignment, a, can be represented by a series, a 1 m = a l a 2... a m, of m values, each between 0 and l such that if the word in position j of the French string is connected to the word in position i of the English string, then a j = i, and if it is not connected to any English word, then a j = 0 (null).

33 IBM Model I (1)

34 IBM Model I (2) The alignment is determined by specifying the values of a j for j from 1 to m, each of which can take any value from 0 to l.

35 Constrained Maximization We wish to adjust the translation probabilities so as to maximize Pr(f|e ) subject to the constraints that for each e

36 Lagrange Multipliers (1) Method of Lagrange multipliers : Lagrange multipliers with one constraint If there is a maximum or minimum subject to the constraint g(x,y) = 0, then it will occur at one of the critical numbers of the function F defined by is called the f(x,y) is called the objective function . g(x,y) is called the constrained equation . F(x, y, ) is called the Lagrange function . is called the Lagrange multiplier .

37 Lagrange Multipliers (2) Following standard practice for constrained maximization, we introduce Lagrange multipliers e, and seek an unconstrained extremum of the auxiliary function

38 Derivation (1) The partial derivative of h with respect to t(f|e) is where is the Kronecker delta function, equal to one when both of its arguments are the same and equal to zero otherwise

39 Derivation (2) We call the expected number of times that e connects to f in the translation (f|e) the count of f given e for (f|e) and denote it by c(f|e; f, e). By definition,

40 Derivation (3) replacing e by e Pr(f|e), then Equation (11) can be written very compactly as In practice, our training data consists of a set of translations, (f (1) le (1) ), (f (2) le (2) ),..., (f (s) le (s) ),, so this equation becomes

41 Derivation (4) For an expression that can be evaluated efficiently.

42 Derivation (5) Thus, the number of operations necessary to calculate a count is proportional to l + m rather than to (l + 1) m as Equation (12)

Other References about MT

45 Chinese-English Sentence Alignment , ,ROCLING XVI,2004 : 2 Stages Iterative DP Stop list Partial Match

46 Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING03)

47 Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses (Chien-Cheng Wu & Jason S. Chang, ROCLING03) Preprocessing steps to calculate the following information: Lists of preferred POS patterns of collocation in both languages. Collocation candidates matching the preferred POS patterns. N-gram statistics for both languages, N = 1, 2. Log likelihood Ratio statistics for two consecutive words in both languages. Log likelihood Ratio statistics for a pair of candidates of bilingual collocation across from one language to the other. Content word alignment based on Competitive Linking Algorithm (Melamed 1997).

A Probability Model to Improve Word Alignment Colin Cherry & Dekang Lin, University of Alberta

Outline Introduction Probabilistic Word-Alignment Model Word-Alignment Algorithm Constraints Features

Introduction Word-aligned corpora are an excellent source of translation-related knowledge in statistical machine translation. E.g., translation lexicons, transfer rules Word-alignment problem Conventional approaches usually used co-occurrence models E.g., 2 (Gale & Church 1991), log-likelihood ratio (Dunning 1993) Indirect association problem: Melamed (2000) proposed competitive linking along with an explicit noise model to solve To propose a probabilistic word-alignment model which allows easy integration of context-specific features. CISCO System Inc. CISCO System Inc.

Probabilistic Word-Alignment Model Given E = e 1, e 2, , e m, F = f 1, f 2, , f n If e i and f j are translation pair, then link l(e i, f j ) exists If e i has no corresponding translation, then null link l(e i, f 0 ) exists If f j has no corresponding translation, then null link l(e 0, f j ) exists An alignment A is a set of links such that every word in E and F participates in at least one link Alignment problem is to find alignment A to maximize P(A|E, F) IBM s translation model: maximize P(A, F|E)

Probabilistic Word-Alignment Model (Cont.) Given A = {l 1, l 2, , l t }, where l k = l(e i k, f j k ), then consecutive subsets of A, l i j = {l i, l i+1, , l j } Let C k = {E, F, l 1 k-1 } represent the context of l k