Exact Decoding of Phrase-based Translation Models through Lagrangian Relaxation #emnlpreading
Phrase-Based Statistical Machine Translation Using Finite
Transcript of Phrase-Based Statistical Machine Translation Using Finite
Phrase-Based Statistical Machine TranslationUsing Finite State Machines
with some links to ASR
Bill Byrne
Cambridge University Engineering Department
The Johns Hopkins University Center for Language and Speech Processing
27 May 2005
Work done with Shankar Kumar and Yonggang Deng
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Can We Pretend MT is ASR ?
Of course. We just need :I Alignment models and estimation algorithmsI Training data: bitext, monolingual textI Search algorithms for translationI Some way to measure translation quality
Cambridge UniversityEngineering Department Current Research in SMT– 1/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Five Easy Problems in Statistical Machine Translation
What’s currently needed to build a basic phrase-based SMT system:
I Bitext for SMT Training: Document and Sentence AlignmentI Models of Word and Phrase Alignment Within SentencesI Language ModelsI Translation – Model-Based Search AlgorithmsI Automatic Measurement of Translation Quality
Cambridge UniversityEngineering Department Current Research in SMT– 2/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
Translation
Once the models are defined, translation is ‘trivial’ .
Its just search ... e = argmaxe P(e|f)
One approach is to construct specialized search algorithmsI Depending on the underlying models, search can be by Viterbi, A∗, or
other specialized search procedures
But decoder design and implementation is complexI Small model changes might require large changes to a decoderI Approximations in search imply inexact implementation of the models
I Can only implement what can be implementedI May as develop models which lead to exact realizations
I Decoder implementation takes effort away from ‘modeling’
Cambridge UniversityEngineering Department Current Research in SMT– 3/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
Translation via Weighted Finite State Transducers
Translation with Finite State Devices, Knight & Al Onaizan, AMTA’98I Implements the IBM models as WFSTs
I word-to-word translation, word fertility, and permutation (reordering)
If the component models can be implemented as WFSTs which can becomposed, building a decoder is trivial
I Can be limiting, but avoids special-purpose decodersI The value of this modeling approach has been shown in ASR by the
systems developed at AT&TI Translation is performed using libraries of standard FSM operationsI Clear formulation of underlying model componentsI Easy to work on modules in isolation
Cambridge UniversityEngineering Department Current Research in SMT– 4/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
Translation Template Model - TTM
Generative source-channel model of machine translationI Takes the best of
I Och&Ney’s Phrase-based translation modelsI Knight&Al Onaizan WFST description of translation via IBM models
I Bitext word alignment and translation under the model can be performedusing standard WFST operations
I Modular ImplementationI No need for a specialized decoder - “the model is the decoder”I Can easily generate translation lattices and N-best lists
Overall Goals:I Relatively good performance with models that are really quite simpleI Should be easy to use in translating ASR latticesI Easy to teach
Cambridge UniversityEngineering Department Current Research in SMT– 5/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
TTM (2004) – Translation with Monotone Phrase Order
1
Source PhraseSegmentation
Source PhraseReordering
exports grain are_projected_to fall by_25_%
Placement of
Target Phrase
Insertion Markers
Target PhraseInsertion
PhraseTransduction
Target PhraseSegmentation
1
are_projected_tograin exports fall by_25_% Source Phrases
grain exports are projected to fall by 25 % Source LanguageSentence
Reordered SourcePhrases
grain are_projected_to by_25_%fall
Target Phrasesflechirdoivent de_25_%les exportations de grains
les exportations de grains doivent flechir de 25 %Target Language
Sentence
exports
I Transformations via stochastic models implemented as WFSTsI Implementation is direct using standard WFST operations
Cambridge UniversityEngineering Department Current Research in SMT– 6/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
TTM (2005) – Translation with Moving Target Phrase Order
Target Phrase
0c
4c
5c
0d
1d
1v
2v
3v
4v
5v
6v
7v
2f
3f
4f
5f
6f
7f
8f
9f
2d
3d
4d
5d
2c
3c
1c
x1 x
2x
3x
4x
5
1e
5e
7e
2e
3e
4e
6e
9e
8e
u1 u
2u
3u
4u
5
y1 y
5y
4y
3y
2
1f
Source Phrase
Segmentation
Target Phrase
Segmentation
Reordering
Translation
Insertion
Target Phrase
doivent de_25_%exportationsgrains fléchir
exportations grains de_25_%doivent fléchir
1
les exportations de grains doivent fléchir de_25_%
1.exportations doiventgrains fléchir
Target Phrasesin Source Phrase Order
grain exports are_projected_to by_25_%
grain exports are projected to fall by 25 %Sentence
fall Source Phrases
Target Phrases
Target Phrases
Source Language
Target Language Sentence
Phrase Insertion Markers
in Target Phrase Order
de_25_%
les exportations de grains doivent fléchir de 25 %
Target
Phrase
Cambridge UniversityEngineering Department Current Research in SMT– 7/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
Phrase-to-Phrase Translation Models
Alignment Template Models (Och et al. 1999)I derived from ‘good’ word-level alignments, typically from IBM-4
this bill places salve on a sore wound
ce bill met de le baume sur une blessure
this bill places salve on a sore woundce bill met de le baume sur une blessure
this bill places salve on a sore wound
ce bill met de le baume sur une blessureIBM−4 F
IBM−4 E
u1
v1 v2
u2
IBM−4 F U E
I Phrase Pairs are extracted to cover word alignment patternsI Probability distributions are defined over phrase pair sequences
Cambridge UniversityEngineering Department Current Research in SMT– 8/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
The Phrase Pair Inventory
English Phrase French Phrase Phrase Transductionu v Probability P(v |u)
hear hear bravo 0.8bravo bravo 0.15
ordre 0.05terms of reference mandat 0.8
de son mandat 0.2
I Phrase Pair Inventory affects the performance of the TTMI Word Alignment Quality of underlying modelsI Coverage of phrases on the test set
Cambridge UniversityEngineering Department Current Research in SMT– 9/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
Target Phrase Segmentation Transducer Ω
:ε voilà
: leε
: problèmeε
: problème
ε
εfon
damen
tal
:
ε fondamental
:
: voilàε
: εvoilà
ε:
le_problème_fondamental
ε:vo
ilà_l
e
voilà
:ε le
le problème fondamentalF
problème_fondamental ε:
Ω
Based on the French phrases inthe PPI
Assume F is the sentence to betranslated
Phrase sequences that couldhave generated F : Ω Fvoila le probleme fondamentalvoila le probleme fondamental
Cambridge UniversityEngineering Department Current Research in SMT– 10/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
Phrase Translation Transducer Y
:
Y
fundamental_problem:problème_fondamental/1basic_problem:problème_fondamental/0.3
the_fundamental_problem_we_face:le_problème_fondamental/1
that_is:voilà/0.11those_are:voilà/0.25these_are:voilà/0.12
those_are_the:voilà_le/0.11that_these_are_the:voilà_le/1
that_is_the:voilà_le/0.10
le_problème_fondamental/0.4the_basic_problem
Map English phrases intoFrench phrases
Realizes the Phrase-PairInventory
Cambridge UniversityEngineering Department Current Research in SMT– 11/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
Target Language Phrase Insertion
1
c1
c2
c4
c5
c3
c
0d
1d
1v
2v
3v
4v
5v
6v
7v
1f
2f
3f
4f
5f
6f
7f
8f
9f
2d
3d
4d
5d
1 de_25_%exportations.
fléchirdoivent de_25_%les exportations de grains
les exportations de grains doivent fléchir de 25 %
Target Language Sentence
Target Phrases
Source Phrases with Target Phrase Insertion Markers
flechirdoiventgrains
0
Different phrase segmentations lead to different translations :I Sequences cK
0 that could have generated F : Y Ω FH1 that these are the fundamental problem :
voila le probleme fondamental...
H16 that is the basic problem:voila le probleme fondamental
Cambridge UniversityEngineering Department Current Research in SMT– 12/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
Procedures for Alignment and Translation (Monotone Phrase Order)
Given a French sentence f J1 to be translated into English, we build the
following transducers (in this order)I F to represent the French sentenceI Ω maps French phrases in our Phrase-Pair Inventory (PPI) to words in FI Y maps English phrases to French phrases in Ω with probabilities (PPI)I Φ inserts French phrase insertion markersI W maps English words to English phrases seen in Y
(restricted phrase segmentation transducer)
If f J1 is to be aligned with an English sentence eI
1:I Build an FSM E to represent eI
1
If f J1 is to be translated:I Build a n-gram backoff LM and compile it as a weighted acceptor G
Cambridge UniversityEngineering Department Current Research in SMT– 13/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Component TransducersAlignment and Translation Operations
Bitext Word Alignment and Translation Via WFSTs
TTM Generative Model: P(f J1 , vR
1 , dK0 , cK
0 , aK1 , uK
1 , K , eI1)
I MAP Alignment of a sentence pair f J1 ,eI
1
K , uK1 , aK
1 , cK0 , d K
0 , v R1 = argmax
K ,uK1 ,aK
1 ,cK0 ,dK
0 ,vR1
P(K , uK1 , aK
1 , cK0 , dK
0 , vR1 |eI
1, f J1 )
I MAP Translation of a French sentence f J1
eI1, K , uK
1 , aK1 , cK
0 , d K0 , v R
1 = argmaxeI
1,K ,uK1 ,aK
1 ,cK0 ,dK
0 ,vR1
P(K , uK1 , aK
1 , ck0 , dK
0 , vR1 |f J
1 )
WFST Operations with Monotone SearchI Alignment
1. Generate the alignment lattice: B = E W Φ Y Ω F2. MAP Alignment : least cost path in B
I Translation1. Generate the translation lattice: T = G U Φ Y Ω F2. MAP Translation : least cost path in T
Cambridge UniversityEngineering Department Current Research in SMT– 14/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Translation Performance and Analysis
Problems with Bitext Word Alignment under the TTM
Consider extraction of phrase pairs from word alignments
il demeure donc important dans un pays comme le nôtre
transportation is important to this country
I Extracted PPI is not rich enough to cover all the sentence-pairsI If we discard the word alignments, this pair has probability zero under the
modelI In fact, most sentences from the training bitext have probability zero
I Solution:- TTM already allows insertions of target phrases- In addition, we allow deletions of source phrases
Cambridge UniversityEngineering Department Current Research in SMT– 15/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Translation Performance and Analysis
Source Phrase Deletion in Bitext Word Alignmenttransportation is important to this country
transportation is important_to_this_country
I3 transportation is important_to_this_country I3
important_dans_un_pays comme_le_nôtreεil_demeure_donc ε
il demeure donc important dans un pays comme le nôtre
I Novel use of phrase-based translation models for alignmentI TTM word alignments are very accurate: ↑ precision, ↓ recall
Supports parameter estimation proceduresI Must be able to segment and align the bitext
EM requiresP( K , uK
1 , aK1 , cK
0 , dK0 , vR
1| z Hidden
Variables
| eI1, f J
1| z Training
Data
)
Cambridge UniversityEngineering Department Current Research in SMT– 16/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Translation Performance and Analysis
Phrase Swapping by WFSTsx 2 x 3 x 4 x 5x 1
y 2 y 3 y 4 y 5y 1
3b = 01b = +12b = −1 4b = 0 5b = 0
doivent de_25_%exportations fléchir
exportations grains de_25_%doivent fléchir
grains
Associate a jump sequence bK1 with each sequence yK
1P(yK
1 |xK1 , uK
1 , K , eI1) = P(bK
1 |xK1 , uK
1 , K , eI1) =
QKk=1 P(bk |xk , uk )
1x y2:
x2 1y
1x 1y:x2 y2: / 1−p / b=0
1 2
/ 1−p / b=0
p / b=+1
1 / b=−1
:
Inspired by work of Tillman
Input: x1, x2, b ∈ 0, +1,−1Output: y1, y2 with prob (1 − p)2
y2, y1 with prob p
MJ-1 : maximum jump of 1
Properly parameterized
Not degenerate
Cambridge UniversityEngineering Department Current Research in SMT– 17/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Translation Performance and Analysis
Phrase Swapping by WFSTsx 2 x 3 x 4 x 5x 1
y 2 y 3 y 4 y 5y 1
3b = 01b = +12b = −1 4b = 0 5b = 0
doivent de_25_%exportations fléchir
exportations grains de_25_%doivent fléchir
grains
Associate a jump sequence bK1 with each sequence yK
1P(yK
1 |xK1 , uK
1 , K , eI1) = P(bK
1 |xK1 , uK
1 , K , eI1) =
QKk=1 P(bk |xk , uk )
b=0
1
23
45
6
b=−1
b=+1b=−1
b=+2
b=0
b=−2
b=−1
b=+1 b=−2
Inspired by work of Tillman
Input: x1, x2, b ∈ 0, +1,−1Output: y1, y2 with prob (1 − p)2
y2, y1 with prob p
MJ-2 : maximum jump of 2
Properly parameterized
Not degenerate
Cambridge UniversityEngineering Department Current Research in SMT– 18/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Translation Performance and Analysis
Incorporating Reordering in Translation
I Reodering prior to translation:I English phrases in French phrase orderI Difficult to realize with WFSTs
I Tried the opposite approach:I First generate a French phrase sequence in English phrase orderI Next reorder this sequence into French phrase order under the Local Phrase
Reordering ModelI Possible to realize with WFSTs both in alignment and translation
I No English phrase reordering processI Reordering is done prior to Insertion of Target Phrases
I word alignments within phrases can span fairly long distances
Can perform embedded reestimation of reordering model parametersI Phrase-pair dependent reordering probability : P(bk |xk , uk )
I Estimated via Viterbi approximation to EMI Exact estimation: alignments are done under the translation model
Cambridge UniversityEngineering Department Current Research in SMT– 19/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Translation Performance and Analysis
Training and Translation Procedures
1. Bitext Chunking1.1 Monotone alignment into coarse chunks of documents1.2 Divisive clustering into subsentence chunks
2. MTTK word alignment – Model 4 replacement3. Construct component WFSTs for the TTM –
3.1 extract phrases from the alignments using the ‘usual’ heuristics3.2 use phrase-pair induction under MTTK to augment the PPI
4. Translation lattice generation with pruned trigram →5. Translation lattice rescoring with unpruned 4gram →6. Minimum Error Training
6.1 transducer weights optimized for BLEU on heldout data
Cambridge UniversityEngineering Department Current Research in SMT– 20/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Translation Performance and Analysis
TTM and Phrase Reordering
Good gains from reordering in both Arabic→English and Chinese→EnglishI Jump-1 might be as good as Jump-2 (in this formulation)I little gain in Chinese→English from parameter estimation
Translation under MJ-1 and MJ-2 reordering with a 4-gram LM.BLEU (%)
Reordering Arabic-English Chinese-EnglishModel 02 03 04 02 03 04None 37.5 40.3 36.8 24.2 23.7 26.0
MJ-1 flat 40.4 43.9 39.4 25.7 24.5 27.4MJ-1 VT 41.3 44.8 40.3 25.8 24.5 27.8MJ-2 flat 41.0 44.4 39.7 26.2 24.8 27.8MJ-2 VT 41.4 45.0 40.2 26.4 24.8 27.8
Cambridge UniversityEngineering Department Current Research in SMT– 21/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Translation Performance and Analysis
Influence of Language Model Order on Reordering in Translation
BLEU Over Merged Eval Sets (Eval02,Eval03,Eval04)BLEU (%)
A-E C-E2g 3g 4g 2g 3g 4g
None 21.0 36.8 37.8 16.1 24.8 25.0MJ-1 23.4 40.4 41.6 16.2 25.9 26.5MJ-2 23.3 40.4 41.6 16.0 26.0 26.8
These Simple Reordering Models Benefit from Better Language Models
Cambridge UniversityEngineering Department Current Research in SMT– 22/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Translation Performance and Analysis
Translation Performance with Reordering across Eval 04 Test Genres
Model BLEU (%)A-E C-E
News Eds Spchs News Eds SpchsNone 41.1 30.8 33.3 23.6 25.9 30.8
MJ-1 flat 44.7 31.6 35.3 24.3 27.6 32.9MJ-1 VT 45.6 32.6 35.7 24.8 27.8 33.3MJ-2 flat 45.2 31.9 35.2 24.8 27.8 33.4MJ-2 VT 45.8 32.4 35.2 24.8 27.7 33.6
Causes:I Training / Test Mismatch ?I Language model mismatch, in particular ?I Less movement in Speeches and Editorials ?I Different operating points ?
Cambridge UniversityEngineering Department Current Research in SMT– 23/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Unsolicited Opinions
Summary: Working Towards Integrated Modeling and DecodingModeling: Given an English Sentence e and a French sentence f , construct ajoint distribution over their alignments.
P(e, a, f ) = P(f |a, e)| z Translation
Model
P(a|e)| z Alignment
Model
P(e)| z Language
Model
Decoding (ideal) : Given f , find a translation be and an alignment ba as
( be, ba ) = argmaxe,a
P(f |a, e) P(a|e) P(e)
Decoding (really) : lacks integrated modeling and decodingI Models are trained & alignments are generated over the training setI The models are discarded, and the alignments are keptI PPI, etc. are extracted from the alignments and used in translation
Goal: same models in alignment and translation – ‘what works’ for ASRI needed for : MMI, clustering, context dependence, ...Cambridge UniversityEngineering Department Current Research in SMT– 24/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Unsolicited Opinions
New! – TTM Tutorial is Available
German→English translation based onI Europarl corpusI Giza++ alignmentsI AT&T FSM Toolkit
I any FSM toolkit should work ...
Tutorial steps through building and using all the transducers
Very much an alpha version, but available to anybody who’s interested
Cambridge UniversityEngineering Department Current Research in SMT– 25/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Unsolicited Opinions
The Machine Translation PyramidCasts the problem in familiar terms- strengths and weaknesses of the formulation are obvious
Interlingua
English
Words
English
Syntax
English
Semantics
Foreign
Words
Foreign
Syntax
Foreign
Semantics
Cambridge UniversityEngineering Department Current Research in SMT– 26/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Unsolicited Opinions
The Statistical Machine Translation Pyramid
Building good systems requires balancing competing concerns
Simple Models & Fast Algorithms
MachineLearning
LinguisticKnowledge
Evidence
Explication
Elegance
Cambridge UniversityEngineering Department Current Research in SMT– 27/28
IntroductionPhrase Translation via WFSTs
Embedded Model EstimationConclusion
Unsolicited Opinions
Thanks!
Cambridge UniversityEngineering Department Current Research in SMT– 28/28