Evaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation M...
-
Upload
byron-dean -
Category
Documents
-
view
216 -
download
0
Transcript of Evaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation M...
Evaluation of Context-DependentPhrasal Translation Lexiconsfor Statistical Machine Translation
Marine CARPUAT and Dekai WU
Human Language Technology CenterDepartment of Computer Science and Engineering
HKUST
New resources for SMT: context-dependent phrasal translation lexicons
A key new resource for Phrase Sense Disambiguation (PSD) for SMT [Carpuat & Wu 2007]
Entirely automatically acquired Consistently improves 8 translation quality metrics [EMNLP
2007] Fully phrasal just like conventional SMT lexicons [TMI 2007]
But… much larger than conventional lexicons!
Why is this extremely large resource necessary? Is its contribution observably useful? Is it used by the SMT system differently than conventional
SMT lexicons?
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Our finding: context-dependent lexicons directly improve lexical choice in SMT Exploit the available vocabulary better for phrasal
segmentation more and longer phrases are used in decoding consistent with other findings [TMI2007]
fully phrasal context-dependent lexicons yield more reliable improvements than single word lexicons
Select better translation candidates even after compensating for differences in phrasal
segmentation
improvements in BLEU, TER, METEOR, etc. really reflect improved lexical choice
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Problems with current SMT systems
Input 张 教 授 给 一 群 人 就 “ 中 国 和 印 度 ” 上 课 。Ref. Prof. Zhang gave a lecture on “China and India” to a packed audience.
SMT1 Prof. Zhang to a group of people on `China and India` class.
SMT2 Prof. Zhang and a group of people go into class on “China and India”.
Correct translation
SMT2 Prof. Zhang and a group of people go into class on “China and India”.
Ref. Prof. Zhang gave a lecture on “China and India” to a packed audience.
upgo intoclimb…attendgave …
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Translation lexicons in SMT are independent of context!
张 教 授 给 一 群 人 就 “ 中 国 和 印 度 ” 上 课 。
欢 迎 大 家 明 天 来 上 课 , 题 目 是 “ 中 国 和 印 度 ” 。
Prof. Zhang gave a lecture on “China and India” to a packed audience.
Everyone is welcome to attend class tomorrow, on the topic “China and India”.
up go into climb … attend gave
up go into climb … attend gave
.25
.25
.20….10.05…
.25
.25
.20….10.05…
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Phrasal lexicons in SMT are independent of context too!
张 教 授 给 一 群 人 就 “ 中 国 和 印 度 ” 上 课 。
欢 迎 大 家 明 天 来 上 课 , 题 目 是 “ 中 国 和 印 度 ” 。
Prof. Zhang gave a lecture on “China and India” to a packed audience.
Everyone is welcome to attend class tomorrow, on the topic “China and India”.
attend class gave a lecture …
attend class gave a lecture …
.45
.15
.45
.15
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Current SMT systems are hurt byvery weak models of context Translation disambiguation models are too simplistic:
Phrasal lexicon translation probabilities are static, so not sensitive to context
Context in input language is only modeled weakly by phrase segments
Context in output language is only modeled weakly by n-grams
Error analysis reveals many lexical choice errors
Yet, few attempts at directly modeling context
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Today’s SMT systems ignore the contextual features that would help lexical choice
No full sentential context merely local n-gram context
No POS information merely surface form of words
No structural information merely word n-gram identities
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
attend class gave a lecture …
Correct translation disambiguation requires rich context features
张 教 授 给 一 群 人 就 “ 中 国 和 印 度 ” 上 课 。
欢 迎 大 家 明 天 来 上 课 , 题 目 是 “ 中 国 和 印 度 ” 。
Prof. Zhang gave a lecture on “China and India” to a packed audience.
Everyone is welcome to attend class tomorrow, on the topic “China and India”.
.45
.15
.45
.15 attend class gave a lecture …
.15
.80
.70
.20
N N P N P N V
V N AD V V N V N
SUBJ
SUBJ
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Today’s SMT systems ignore context in their phrasal translation lexicons
atle
tle
afePeP
fePePe
... ,|log logargmax
... |log logargmax
*
... ) ,|(log logargmax1
ja
m
jjle feteP
j cj(f)
Entire input
sentence context
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
But context-dependent lexical choice does not necessarily improve translation quality
Early pilot study [Brown et al. 1991] use single most discriminative feature to disambiguate
between 2 English translations of a French word WSD improves French-English translation quality, but not
on a significant vocabulary and allowing only 2 senses
Context-dependent lexical choice helps word alignment, but not really translation quality [Garcia Varea et al. 2001, 2002]
maximum-entropy trained bilexicon replaces IBM-4/5 translation probabilities
improves AER on Canadian Hansards and Verbmobil tasks small improvement on WER and PER by rescoring n-best
lists, but not statistically significant [Garcia Varea & Casacuberta 2005]
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent modeling improves quality of Statistical MT [Carpuat & Wu 2007]
Introduced context-dependent phrasal lexicons for SMT
leverage WSD techniques for SMT lexical choice generalize conventional WSD to Phrase Sense
Disambiguation
Context-dependent modeling always improves SMT accuracy
on all tasks - 3 different IWSLT06 datasets, NIST04 on all 8 common automatic metrics - BLEU, NIST, METEOR,
METEOR+synsets, TER, WER, PER, CDER
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
No other WSD for SMT approach improves translation quality as consistently
Until recently, using WSD to improve SMT quality has met with mixed or disappointing results
Carpuat & Wu [ACL-2005], Cabezas & Resnik [unpub]
Last year, for the first time, different approaches showed that WSD can help translation quality
WSD improved BLEU (but how about other metrics??) on 3 Chinese-English tasks [Carpuat et al. IWSLT-2006]
WSD improved BLEU (but how about other metrics??) on Chinese-English NIST task [Chan et al. ACL-2007]
WSD improved METEOR (but not BLEU!) on Spanish-English Europarl task [Giménez & Màrquez WMT-2007]
Phrasal WSD improves BLEU, NIST, METEOR (but how about error rates??)
on Italian-English and Chinese-English IWSLT task [Stroppa et al. TMI-2007]
But no other approach improves on 8 metrics on 4 different tasks
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
But how useful are the context-dependent lexicons as resources?
Improving translation quality is great, but… Metrics aggregate impact of many different factors Metrics ignore how translation hypotheses are generated
Context-dependent lexicons are more expensive to train, so…
Are their contributions observably useful?
Direct analysis needed: how do SMT systems use context-dependent vs. conventional lexicons?
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Learning context-dependent vs. conventional lexicons for SMT
learned from the same word-aligned parallel data: cover the same phrasal input vocabulary know the same phrasal translation candidates
Only difference: an additional context-dependent parameter dynamically computed vs. static conventional
scores Uses WSD modeling vs. MLE in conventional
lexicons
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Word Sense Disambiguation provides appropriate models of context WSD has long targeted the questions of
how to design context features how to combine contextual evidence into a sense prediction
Senseval/SemEval have extensively evaluated WSD systems
with different feature sets with different machine learning classifiers
Senseval multilingual lexical sample tasks use observable lexical translations as senses just like lexical choice in SMT E.g. Senseval-2003 English-Hindi, SemEval-2007 Chinese-
English
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Leveraging a Senseval WSD system
Top Senseval-3 Chinese Lexical Sample system[Carpuat et al. 2004]
standard classification models maximum entropy, SVM, boosted decision stumps, naïve
Bayes
rich lexical and syntactic features bag of word sentence context position sensitive co-occurring words and POS tags basic syntactic dependency features
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Generalizing WSD to PSD for context-dependent phrasal translation lexicons One PSD model per input language phrase
regardless of POS, length, etc. Generalization of standard WSD models
Sense candidates are the phrase translation candidates seen in training
The sense candidates are extracted just like the conventional SMT phrasal lexicon
typically, output language phrases consistent with the intersection of bidirectional IBM alignments
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Extracting PSD senses and training examples from word-aligned parallel text
is there a new - age music concert within the next few days ?
在 最近 一段 时间 里 有 流行音乐 会 吗 ?
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Extracting PSD senses and training examples from word-aligned parallel text
在 最近 一段 时间 <t sense= “within”> 里 </t> 有 流行音乐 会 吗 ?
is there a new - age music concert within the next few days ?
在 最近 一段 时间 里 有 流行音乐 会 吗 ?
Extracted PSD training instances:
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Extracting PSD senses and training examples from word-aligned parallel text
在 最近 一段 时间 里 有 <t sense=“new - age music”> 流行音乐 </t> 会 吗 ?
Extracted PSD training instances:
is there a new - age music concert within the next few days ?
在 最近 一段 时间 里 有 流行音乐 会 吗 ?
在 最近 一段 时间 <t sense= “within”> 里 </t> 有 流行音乐 会 吗 ?
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Extracting PSD senses and training examples from word-aligned parallel text
在 <t sense=“within the next few days”> 最近 一段 时间 里 </t> 有 流行音乐 会 吗 ?
is there a new - age music concert within the next few days ?
在 最近 一段 时间 里 有 流行音乐 会 吗 ?
在 最近 一段 时间 里 有 <t sense=“new - age music”> 流行音乐 </t> 会 吗 ?
Extracted PSD training instances:在 最近 一段 时间 <t sense= “within”> 里 </t> 有 流行音乐 会 吗 ?
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Integrating context-dependent lexicon into phrase-based SMT architectures The context-dependent phrasal lexicon probabilities
Are conditional translation probabilities can naturally be added as a feature in log linear translation
models
Unlike conventional translation probabilities, they are dynamically computed dependent on full-sentence context
Decoding can make full use of context-dependent phrasal lexicons predictions at all stages of decoding
unlike in n-best reranking
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Evaluating context-dependent phrasal translation lexicons
lexical choice only vs. translation quality [Carpuat & Wu EMNLP 2007]
integrated evaluation in SMT vs. stand-alone as in Senseval [Carpuat et al. 2004]
fully phrasal lexicons only vs. single-word context-dependent lexicon [Carpuat & Wu TMI
2007]
Translation task Test set: NIST-04 Chinese-English text translation
1788 sentences 4 reference translations
Standard phrase-based SMT decoder (Moses)
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Experimental setup
Learning the lexicons Standard conventional lexicon learning
Newswire Chinese-English corpus ~2M sentences
Standard word-alignment methodology GIZA++ Intersection using “grow-diag” heuristics [Koehn et al. 2003]
Standard Pharaoh/Moses phrase-table Maximum phrase length = 10 Translation probabilities in both directions, lexical weights
Context-dependent lexicons Use the exact same word-aligned parallel data Train a WSD model for each known phrase
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Step 1: Evaluating phrasal segmentation with context-dependent vs. conventional lexicons
Goal: compare the phrasal segmentation of the input sentence used to produce the top hypothesis
Method:
We do not evaluate accuracy There is no gold standard phrasal segmentation!
Instead, we analyze how the input phrases available in lexicons are used
SMT uses longer input phrases with context-dependent lexicons
Context-dependent lexicons help use longer, less ambiguous phrases
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
SMT uses more input phrase types with context-dependent lexicons
26% of phrase types used with context-dependent lexicon are not used with conventional lexicon
96% of those lexicon entries are truly phrasal (not single words)
Context-dependent lexicons make better use of available input language vocabulary
SMT uses more rare phrases with context-dependent lexicons
With context modeling, less training data is needed for phrases to be usedHKUST Human Language Technology Center Carpuat & Wu, LREC 2008
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Step 2: Comparing translation selection
Goal: compare translation selection only
Method:
We compare accuracy of translation selection for identical segments only
Because different lexicons yield different phrasal segmentations
A translation is considered accurate if it matches any of the reference translations
Because input sentence and references are not word-aligned
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent lexicon predictions match references better
Context-dependent lexicons yield more matches than conventional lexicons
48% of errors made with conventional lexicons are corrected with context-dependent lexicons
Lexicon: Conventional
Match
No match
Context-dependent
Match 1435 2139
No match 683 2272
Conclusion: context-dependent phrasal translation lexicons are useful resources for SMT
A key new resource for Phrase Sense Disambiguation (PSD) for SMT [Carpuat & Wu 2007]
Entirely automatically acquired Consistently improves 8 translation quality metrics [EMNLP
2007] Fully phrasal just like conventional SMT lexicons [TMI 2007]
But… much larger than conventional lexicons!
Why is this extremely large resource necessary? Is its contribution observably useful? Is it used by the SMT system differently than conventional
SMT lexicons?
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Conclusion: context-dependent phrasal translation lexicons are useful resources for SMT
Improve phrasal segmentation Exploit available input vocabulary better
More phrases, longer phrases and more rare phrases are used in decoding
Consistent with other findings fully phrasal context-dependent lexicons yield more reliable
improvements than single word lexicons [Carpuat & Wu TMI2007]
Improve translation candidate selection Even after compensating for differences in phrasal
segmentation
Genuinely improve lexical choice Not just BLEU and other metrics!
Evaluation of Context-DependentPhrasal Translation Lexiconsfor Statistical Machine Translation
Marine CARPUAT and Dekai WU
Human Language Technology CenterDepartment of Computer Science and Engineering
HKUST
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Translation quality evaluationNot just BLEU, but 8 automatic metrics
N-gram matching metrics BLEU4 NIST METEOR METEOR+synsets
augmented with WordNet synonym matching Edit distances
TER WER PER CDER
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent modeling consistently improves translation quality
Test set
Experiment
BLEU NIST METEOR
METEOR
(no syn)
TER WER PER CDER
IWSLT 1
SMT 42.21 7.888 65.40 63.24 40.45 45.58 37.80 40.09
SMT+WSD 42.38 7.902 65.73 63.64 39.98 45.30
37.60 39.91
IWSLT 2
SMT 41.49 8.167 66.25 63.85 40.95 46.42 37.52 40.35
SMT+WSD 41.97 8.244 66.35 63.86 40.63 46.14
37.25 40.10
IWSLT 3
SMT 49.91 9.016 73.36 70.70 35.60 40.60 32.30 35.46
SMT+WSD 51.05 9.142 74.13 71.44 34.68 39.75
31.71 34.58
NIST SMT 20.41 7.155 60.21 56.15 76.76 88.26 61.71 70.32
SMT+WSD 20.92 7.468 60.30 56.79 71.34 83.37
57.29 67.38
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Results are statistically significant
NIST results are statistically significant at the 95% level
Tested using paired bootstrap resampling
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Translations with context-dependent phrasal lexicons often differ from SMT translations
Test set Translations changed by context modeling
IWSLT 1 25.49%
IWSLT 2 30.40%
IWSLT 3 29.25%
NIST 95.74%
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent modeling helps even for small and single-domain IWSLT
IWSLT is a single-domain task with very short sentences
Even in these conditions, context-dependent phrasal lexicons are helpful
there are genuine sense ambiguities E.g.
“turn” vs. “transfer”
context-features are available 19 observed features per occurrence of a Chinese phrase
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
The most useful context features are not available in standard SMT
The 3 most useful context feature types are: POS tag of word preceding the target phrase POS tag of word following the target phrase Bag-of-word context
We use weights learned by maximum entropy classifier to determine the most useful features:
We normalized feature weights for each WSD model and then compute average weight of each feature type
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Dynamic context-dependent sense predictions are better than static predictions
Context-dependent modeling often helps rank the correct translation first
Even when context-dependent modeling picks the same translation candidate, the WSD scores are more discriminative than baseline translation probabilities
better at overriding incorrect LM predictions
gives higher confidence to translate longer input phrases when appropriate
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent modeling improves phrasal lexical choice examples
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent modeling improves phrasal lexical choice examples
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent modeling prefers longer phrases
Input
Ref. No parliament members voted against him .
SMTWithout any congressmen voted against him .
SMT+WSD No congressmen voted against him .
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent modeling prefers longer phrases
Input
Ref. No parliament members voted against him .
SMTWithout any congressmen voted against him .
SMT+WSD No congressmen voted against him .
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent modeling prefers longer phrases
Input
Ref. No parliament members voted against him .
SMTWithout any congressmen voted against him .
SMT+WSD No congressmen voted against him .
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent modeling prefers longer phrases
Average length of Chinese phrases used is higher with context-dependent phrasal lexicon
This confirms that Context-dependent predictions for all phrases are useful Context-dependent predictions should be available at all
stages of decoding
This explains why using WSD for single words only has a less reliable impact on translation quality
as in Cabezas & Resnik [2005], Carpuat et al. [2006]
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent lexicons should be phrasal to always help translation
Test set
Experiment BLEU NIST METEOR
METEOR
(no syn)
TER WER PER CDER
# 1 SMT 42.21 7.888 65.40 63.24 40.45 45.58 37.80 40.09
+word lex. 41.94 7.911 65.55 63.52 40.59 45.61 37.75 40.09
+phrasal lex.
42.38 7.902 65.73 63.64 39.98 45.30
37.60 39.91
# 2 SMT 41.49 8.167 66.25 63.85 40.95 46.42 37.52 40.35
+word lex. 41.31 8.161 66.23 63.72 41.34 46.82 37.98 40.69
+phrasal lex.
41.97 8.244 66.35 63.86 40.63 46.14
37.25 40.10
# 3 SMT 49.91 9.016 73.36 70.70 35.60 40.60 32.30 35.46
+word lex. 49.73 9.017 73.32 70.82 35.72 40.61 32.10 35.30
+phrasal lex.
51.05 9.142 74.13 71.44 34.68 39.75
31.71 34.58
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
No other WSD for SMT approach improves translation quality as consistently
Until recently, using WSD to improve SMT quality has met with mixed or disappointing results
Carpuat & Wu [ACL-2005], Cabezas & Resnik [unpub]
Last year, for the first time, different approaches showed that WSD can help translation quality
WSD improved BLEU (but how about other metrics??) on 3 Chinese-English tasks [Carpuat et al. IWSLT-2006]
WSD improved BLEU (but how about other metrics??) on Chinese-English NIST task [Chan et al. ACL-2007]
WSD improved METEOR (but not BLEU!) on Spanish-English Europarl task [Giménez & Màrquez WMT-2007]
Phrasal WSD improves BLEU, NIST, METEOR (but how about error rates??)
on Italian-English and Chinese-English IWSLT task [Stroppa et al. TMI-2007]
But no other approach improves on 8 metrics on 4 different tasks
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Context-dependent modeling improves quality of Statistical MT Presenting context-dependent phrasal lexicons for SMT
leverage WSD techniques for SMT lexical choice
Context-dependent modeling always improves SMT accuracy on all tasks - 3 different IWSLT06 datasets, NIST04 on all 8 common automatic metrics - BLEU, NIST, METEOR,
METEOR+synsets, TER, WER, PER, CDER
Why? Most useful context features are unavailable to current SMT
systems Better phrasal segmentation Better phrasal lexical choice
more accurate rankings more discriminative scores
HKUST Human Language Technology Center Carpuat & Wu, LREC 2008
Maxent-based sense disambiguation in Candide [Berger 1996]
No evaluation of impact on translation quality only 2 example sentences, no contrastive evaluation by human
judgment nor any automatic metric extension by Garcia Varea et al. does not significantly improve
translation quality Still does not model input language context Overly simplified context model
does not use full sentential context only 3 words to the left, 3 words to the right
does not generalize over word identities only words, no POS tags
does not generalize to phrasal disambiguation targets only words
Does not augment the existing SMT model only replace context-independent translation probability