Post on 18-Nov-2014
description
Quality of Machine Translation Quality Estimation Open issues Conclusions
Estimativa da qualidade da traducao
automatica
Lucia Specia
University of Sheffieldl.specia@sheffield.ac.uk
Faculdade de Letras da Universidade do Porto13 May 2013
Estimativa da qualidade da traducao automatica 1 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Outline
1 Quality of Machine Translation
2 Quality Estimation
3 Open issues
4 Conclusions
Estimativa da qualidade da traducao automatica 2 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Outline
1 Quality of Machine Translation
2 Quality Estimation
3 Open issues
4 Conclusions
Estimativa da qualidade da traducao automatica 3 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Introduction
Machine Translation:
Around since the early 1950s
Increasingly more popular since 1990: statisticalapproaches
Software tools and data available to build translationsystems - Moses and others
Increasing demand for cheaper and fast translations
How do we measure quality and progress over time?
So far... mostly automatic evaluation metrics
Estimativa da qualidade da traducao automatica 4 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Introduction
Machine Translation:
Around since the early 1950s
Increasingly more popular since 1990: statisticalapproaches
Software tools and data available to build translationsystems - Moses and others
Increasing demand for cheaper and fast translations
How do we measure quality and progress over time?
So far... mostly automatic evaluation metrics
Estimativa da qualidade da traducao automatica 4 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Introduction
Machine Translation:
Around since the early 1950s
Increasingly more popular since 1990: statisticalapproaches
Software tools and data available to build translationsystems - Moses and others
Increasing demand for cheaper and fast translations
How do we measure quality and progress over time?
So far... mostly automatic evaluation metrics
Estimativa da qualidade da traducao automatica 4 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Introduction
Machine Translation:
Around since the early 1950s
Increasingly more popular since 1990: statisticalapproaches
Software tools and data available to build translationsystems - Moses and others
Increasing demand for cheaper and fast translations
How do we measure quality and progress over time?
So far... mostly automatic evaluation metrics
Estimativa da qualidade da traducao automatica 4 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Introduction
Machine Translation:
Around since the early 1950s
Increasingly more popular since 1990: statisticalapproaches
Software tools and data available to build translationsystems - Moses and others
Increasing demand for cheaper and fast translations
How do we measure quality and progress over time?
So far... mostly automatic evaluation metrics
Estimativa da qualidade da traducao automatica 4 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
MT evaluation metrics
N-gram matching between system output and one ormore reference translations: BLEU and many others
Issue 1: Too many possible good quality translations,need thousands of references to capture valid variations
Solution: HyTER (Language Weaver) annotation tool togenerate all possible correct translations! [DM12]
Translations built bottom-up from word/phrasetranslation equivalents using FSA2-2.5 hours worth of expert annotation per sentenceOne annotator: 5.2× 106 pathsA bunch of annotators: 8.5× 1011 paths
Estimativa da qualidade da traducao automatica 5 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
MT evaluation metrics
N-gram matching between system output and one ormore reference translations: BLEU and many others
Issue 1: Too many possible good quality translations,need thousands of references to capture valid variations
Solution: HyTER (Language Weaver) annotation tool togenerate all possible correct translations! [DM12]
Translations built bottom-up from word/phrasetranslation equivalents using FSA2-2.5 hours worth of expert annotation per sentenceOne annotator: 5.2× 106 pathsA bunch of annotators: 8.5× 1011 paths
Estimativa da qualidade da traducao automatica 5 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
MT evaluation metrics
N-gram matching between system output and one ormore reference translations: BLEU and many others
Issue 1: Too many possible good quality translations,need thousands of references to capture valid variations
Solution: HyTER (Language Weaver) annotation tool togenerate all possible correct translations! [DM12]
Translations built bottom-up from word/phrasetranslation equivalents using FSA2-2.5 hours worth of expert annotation per sentenceOne annotator: 5.2× 106 pathsA bunch of annotators: 8.5× 1011 paths
Estimativa da qualidade da traducao automatica 5 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
MT evaluation metrics
Issue 2: Difficult to quantify severity of mismatchingn-grams
ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!
Some attempts to weight mismatches differently -sparse, lexicalised approach
However, same error is more or less important dependingon the user or purpose:
Severe if end-user does not speak source languageTrivial to post-edit by translators
Estimativa da qualidade da traducao automatica 6 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
MT evaluation metrics
Issue 2: Difficult to quantify severity of mismatchingn-grams
ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!
Some attempts to weight mismatches differently -sparse, lexicalised approach
However, same error is more or less important dependingon the user or purpose:
Severe if end-user does not speak source languageTrivial to post-edit by translators
Estimativa da qualidade da traducao automatica 6 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
MT evaluation metrics
Issue 2: Difficult to quantify severity of mismatchingn-grams
ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!
Some attempts to weight mismatches differently -sparse, lexicalised approach
However, same error is more or less important dependingon the user or purpose:
Severe if end-user does not speak source languageTrivial to post-edit by translators
Estimativa da qualidade da traducao automatica 6 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
MT evaluation metrics
Issue 2: Difficult to quantify severity of mismatchingn-grams
ref Do not buy this product, it’s their craziest invention!sys Do buy this product, it’s their craziest invention!
Some attempts to weight mismatches differently -sparse, lexicalised approach
However, same error is more or less important dependingon the user or purpose:
Severe if end-user does not speak source languageTrivial to post-edit by translators
Estimativa da qualidade da traducao automatica 6 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
MT evaluation metrics
Conversely:
ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.
sys Six-hours battery, 30 minutes to full charge last.
Ok for gisting - meaning preservedVery costly for post-editing if style is to be preserved
Estimativa da qualidade da traducao automatica 7 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
MT evaluation metrics
Conversely:
ref The battery lasts 6 hours and it can be fully rechargedin 30 minutes.
sys Six-hours battery, 30 minutes to full charge last.
Ok for gisting - meaning preservedVery costly for post-editing if style is to be preserved
Estimativa da qualidade da traducao automatica 7 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Task-based evaluation
Measure translation quality within task. E.g. Autodesk -Productivity test through post-editing [Aut11]
2-day translation and post-editing , 37 participantsIn-house Moses (Autodesk data: software)Time spent on each segment
Estimativa da qualidade da traducao automatica 8 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Task-based evaluation
E.g.: Intel - User satisfaction with un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites [Int10]
Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!
MT for chat and community forums [Int12]
∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)
Estimativa da qualidade da traducao automatica 9 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Task-based evaluation
E.g.: Intel - User satisfaction with un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites [Int10]
Overall customer satisfaction: 75% for English→Chinese
95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!
MT for chat and community forums [Int12]
∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)
Estimativa da qualidade da traducao automatica 9 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Task-based evaluation
E.g.: Intel - User satisfaction with un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites [Int10]
Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hour
Customers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!
MT for chat and community forums [Int12]
∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)
Estimativa da qualidade da traducao automatica 9 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Task-based evaluation
E.g.: Intel - User satisfaction with un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites [Int10]
Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!
MT for chat and community forums [Int12]
∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)
Estimativa da qualidade da traducao automatica 9 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Task-based evaluation
E.g.: Intel - User satisfaction with un-edited MT
Translation is good if customer can solve problem
MT for Customer Support websites [Int10]
Overall customer satisfaction: 75% for English→Chinese95% reduction in costProject cycle from 10 days to 1 dayFrom 300 to 60,000 words translated/hourCustomers in China using MT texts were more satisfiedwith support than natives using original texts (68%)!
MT for chat and community forums [Int12]
∼60% “understandable and actionable”(→English/Spanish)Max ∼10% “not understandable”(→Chinese)
Estimativa da qualidade da traducao automatica 9 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Outline
1 Quality of Machine Translation
2 Quality Estimation
3 Open issues
4 Conclusions
Estimativa da qualidade da traducao automatica 10 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Overview
Metrics either depend on references or post-editing/use oftranslations (task-based)
Our proposal
Quality assessment without reference, prior topost-editing/use of translations
Estimativa da qualidade da traducao automatica 11 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Overview
Metrics either depend on references or post-editing/use oftranslations (task-based)
Our proposal
Quality assessment without reference, prior topost-editing/use of translations
Estimativa da qualidade da traducao automatica 11 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Overview
Why don’t translators use (more) MT?
Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?
Estimativa da qualidade da traducao automatica 12 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Overview
Why don’t translators use (more) MT?Translations are not good enough!
What about TMs? Aren’t fuzzy matches useful?
Estimativa da qualidade da traducao automatica 12 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Overview
Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?
Estimativa da qualidade da traducao automatica 12 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Overview
Why don’t translators use (more) MT?Translations are not good enough!What about TMs? Aren’t fuzzy matches useful?
Estimativa da qualidade da traducao automatica 12 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Framework
Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited
Quality = post-editing effort
No access to reference translations: machine learningtechniques to predict post-editing effort scores
Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT
QTLaunchPad project
Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/
Estimativa da qualidade da traducao automatica 13 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Framework
Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited
Quality = post-editing effort
No access to reference translations: machine learningtechniques to predict post-editing effort scores
Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT
QTLaunchPad project
Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/
Estimativa da qualidade da traducao automatica 13 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Framework
Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited
Quality = post-editing effort
No access to reference translations: machine learningtechniques to predict post-editing effort scores
Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT
QTLaunchPad project
Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/
Estimativa da qualidade da traducao automatica 13 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Framework
Quality estimation (QE): provide an estimate ofquality for new translated text *before* it is post-edited
Quality = post-editing effort
No access to reference translations: machine learningtechniques to predict post-editing effort scores
Considers interaction with TM systems: only used forlow fuzzy match cases, or to select between TM and MT
QTLaunchPad project
Multidimensional Quality Metrics for MT and HT, for manualand (semi-)automatic evaluation (QE):http://www.qt21.eu/launchpad/
Estimativa da qualidade da traducao automatica 13 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Framework
QE system
Examples: source &
translations,quality scores
Qualityindicators
Estimativa da qualidade da traducao automatica 14 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Framework
Sourcetext
MT system
Translation
QE system
Quality score
Examples: source &
translations,quality scores
Qualityindicators
Estimativa da qualidade da traducao automatica 14 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Examples of positive results
Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences
Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec
Accuracy in selecting best translation among 4 MTsystems
Best MT system Highest QE score54% 77%
Estimativa da qualidade da traducao automatica 15 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Examples of positive results
Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences
Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec
Accuracy in selecting best translation among 4 MTsystems
Best MT system Highest QE score54% 77%
Estimativa da qualidade da traducao automatica 15 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Examples of positive results
Time to post-edit subset of sentences predicted as“good” (low effort) vs time to post-edit random subset ofsentences
Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec
Accuracy in selecting best translation among 4 MTsystems
Best MT system Highest QE score54% 77%
Estimativa da qualidade da traducao automatica 15 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
State-of-the-art
Quality indicators:
Source text TranslationMT system
Confidence indicators
Complexity indicators
Fluency indicators
Adequacyindicators
Learning algorithms: wide range
Datasets: few with absolute human scores (1-4/5 scores,PE time, edit distance)
Estimativa da qualidade da traducao automatica 16 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
State-of-the-art
Quality indicators:
Source text TranslationMT system
Confidence indicators
Complexity indicators
Fluency indicators
Adequacyindicators
Learning algorithms: wide range
Datasets: few with absolute human scores (1-4/5 scores,PE time, edit distance)
Estimativa da qualidade da traducao automatica 16 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
State-of-the-art
Quality indicators:
Source text TranslationMT system
Confidence indicators
Complexity indicators
Fluency indicators
Adequacyindicators
Learning algorithms: wide range
Datasets: few with absolute human scores (1-4/5 scores,PE time, edit distance)
Estimativa da qualidade da traducao automatica 16 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Outline
1 Quality of Machine Translation
2 Quality Estimation
3 Open issues
4 Conclusions
Estimativa da qualidade da traducao automatica 17 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
State-of-the-art indicators
Shallow indicators:(S/T/S-T) Sentence length(S/T) Language model(S/T) Token-type ratio(S) Average number of possible translations per word(S) % of n-grams belonging to different frequencyquartiles of a source language corpus(T) Untranslated/OOV words(T) Mismatching brackets, quotation marks(S-T) Preservation of punctuation(S-T) Word alignment score, etc.
These do well for estimation post-editing effort...
...but are not enough for other aspects of quality, e.g.adequacy
Estimativa da qualidade da traducao automatica 18 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
State-of-the-art indicators
Shallow indicators:(S/T/S-T) Sentence length(S/T) Language model(S/T) Token-type ratio(S) Average number of possible translations per word(S) % of n-grams belonging to different frequencyquartiles of a source language corpus(T) Untranslated/OOV words(T) Mismatching brackets, quotation marks(S-T) Preservation of punctuation(S-T) Word alignment score, etc.
These do well for estimation post-editing effort...
...but are not enough for other aspects of quality, e.g.adequacy
Estimativa da qualidade da traducao automatica 18 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
State-of-the-art indicators
Linguistic indicators - count-based:
(S/T/S-T) Content/non-content words
(S/T/S-T) Nouns/verbs/... NP/VP/...
(S/T/S-T) Deictics (references)
(S/T/S-T) Discourse markers (references)
(S/T/S-T) Named entities
(S/T/S-T) Zero-subjects
(S/T/S-T) Pronominal subjects
(S/T/S-T) Negation indicators
(T) Subject-verb / adjective-noun agreement
(T) Language Model of POS
(T) Grammar checking (dangling words)
(T) Coherence
Estimativa da qualidade da traducao automatica 19 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
State-of-the-art indicators
Linguistic indicators - alignment-based:
(S-T) Correct translation of pronouns
(S-T) Matching of dependency relations
(S-T) Matching of named entities
(S-T) Alignment of parse trees
(S-T) Alignment of predicates & arguments, etc.
Some indicators are language-dependent, others needresources that are language-dependent, but apply to mostlanguages, e.g. LM of POS tags
Estimativa da qualidade da traducao automatica 20 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
State-of-the-art indicators
Linguistic indicators - alignment-based:
(S-T) Correct translation of pronouns
(S-T) Matching of dependency relations
(S-T) Matching of named entities
(S-T) Alignment of parse trees
(S-T) Alignment of predicates & arguments, etc.
Some indicators are language-dependent, others needresources that are language-dependent, but apply to mostlanguages, e.g. LM of POS tags
Estimativa da qualidade da traducao automatica 20 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
State-of-the-art indicators
Fine-grained, lexicalised indicators:
target-word = “process” =
{1, if source-word = “hdhh alamlyt”.
0, otherwise.
target-word = “process” =
{1, if source-pos = “DT DTNN”.
0, otherwise.
Closer to error detection
Need large amounts of training data [BHAO11], or RB approaches
Estimativa da qualidade da traducao automatica 21 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
State-of-the-art indicators
Fine-grained, lexicalised indicators:
target-word = “process” =
{1, if source-word = “hdhh alamlyt”.
0, otherwise.
target-word = “process” =
{1, if source-pos = “DT DTNN”.
0, otherwise.
Closer to error detection
Need large amounts of training data [BHAO11], or RB approaches
Estimativa da qualidade da traducao automatica 21 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Do these indicators work?
To some extent... Issues:
Representation of shallow/deep indicators: counts,ratios, (absolute) differences?
F = S − T , F = |S − T |, F =T
S, F =
S − T
S...
Resources to extract deep indicators: availability andreliability
Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples
Estimativa da qualidade da traducao automatica 22 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Do these indicators work?
To some extent... Issues:
Representation of shallow/deep indicators: counts,ratios, (absolute) differences?
F = S − T , F = |S − T |, F =T
S, F =
S − T
S...
Resources to extract deep indicators: availability andreliability
Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples
Estimativa da qualidade da traducao automatica 22 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Do these indicators work?
To some extent... Issues:
Representation of shallow/deep indicators: counts,ratios, (absolute) differences?
F = S − T , F = |S − T |, F =T
S, F =
S − T
S...
Resources to extract deep indicators: availability andreliability
Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples
Estimativa da qualidade da traducao automatica 22 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Do these indicators work?
To some extent... Issues:
Representation of shallow/deep indicators: counts,ratios, (absolute) differences?
F = S − T , F = |S − T |, F =T
S, F =
S − T
S...
Resources to extract deep indicators: availability andreliability
Data to extract fine-grained indicators: need previouslytranslated and post-edited data esp. for negativeexamples
Estimativa da qualidade da traducao automatica 22 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Manual scoring: agreement between translators
Absolute value judgements: difficult to achieve consistencyacross annotators even in highly controlled setup
en-es news WMT12 dataset: 3 professionaltranslators, 1-5 scores
15% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled (0.33, 0.17,0.50)
Estimativa da qualidade da traducao automatica 23 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Manual scoring: agreement between translators
Absolute value judgements: difficult to achieve consistencyacross annotators even in highly controlled setup
en-es news WMT12 dataset: 3 professionaltranslators, 1-5 scores
15% of initial dataset discarded: annotators disagreed bymore than one categoryRemaining annotations had to be scaled (0.33, 0.17,0.50)
Estimativa da qualidade da traducao automatica 23 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Manual scoring: Agreement between translators
en-pt subtitles of TV series: 3 non-professionalsannotators, 1-4 scores
351 cases (41%): full agreement445 cases (52%): partial agreement54 cases (7%): null agreement
Agreement by score:
Score Full4 59%3 35%2 23%1 50%
Estimativa da qualidade da traducao automatica 24 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Manual scoring: Agreement between translators
en-pt subtitles of TV series: 3 non-professionalsannotators, 1-4 scores
351 cases (41%): full agreement445 cases (52%): partial agreement54 cases (7%): null agreement
Agreement by score:
Score Full4 59%3 35%2 23%1 50%
Estimativa da qualidade da traducao automatica 24 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
More objective ways of annotating translations
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimativa da qualidade da traducao automatica 25 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
More objective ways of annotating translations
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimativa da qualidade da traducao automatica 25 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
More objective ways of annotating translations
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versa
Certain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimativa da qualidade da traducao automatica 25 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
More objective ways of annotating translations
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
A number of cases where translations with low HTER(few edits) were assigned low quality scores (highpost-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimativa da qualidade da traducao automatica 25 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
More objective ways of annotating translations
TIME: varies considerably across translators (expected)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
100
200
300
400
500
600
A1
A2
A3
A4
A5
A6
A7
A8
Segments
Annotators
Seconds
Can we normalise this variation?
A dedicated QE system for each translator?
Estimativa da qualidade da traducao automatica 26 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
More objective ways of annotating translations
TIME: varies considerably across translators (expected)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.00
5.00
10.00
15.00
20.00
25.00
A1
A2
A3
A4
A5
A6
A7
A8
Annotators
Seconds / word
Segments
Can we normalise this variation?
A dedicated QE system for each translator?
Estimativa da qualidade da traducao automatica 26 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
More objective ways of annotating translations
Time, HTER, Keystrokes: data from 8 post-editors
Estimativa da qualidade da traducao automatica 27 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
More objective ways of annotating translations
PET: http://pers-www.wlv.ac.uk/~in1676/pet/
Estimativa da qualidade da traducao automatica 27 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
How to use estimated PE effort scores?
Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?
Wasting time to read scores and translations vs wasting“gisting” information
How to define a threshold on the estimated translationquality to decide what should be filtered out?
Translator dependentTask dependent (SDL)
Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?
Too much information vs hard-to-interpret scores
Estimativa da qualidade da traducao automatica 28 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
How to use estimated PE effort scores?
Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?
Wasting time to read scores and translations vs wasting“gisting” information
How to define a threshold on the estimated translationquality to decide what should be filtered out?
Translator dependentTask dependent (SDL)
Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?
Too much information vs hard-to-interpret scores
Estimativa da qualidade da traducao automatica 28 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
How to use estimated PE effort scores?
Should (supposedly) bad quality translations be filteredout or shown to translators (different scores/colourcodes as in TMs)?
Wasting time to read scores and translations vs wasting“gisting” information
How to define a threshold on the estimated translationquality to decide what should be filtered out?
Translator dependentTask dependent (SDL)
Do translators prefer detailed estimates (sub-sentencelevel) or an overall estimate for the complete sentence?
Too much information vs hard-to-interpret scores
Estimativa da qualidade da traducao automatica 28 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Outline
1 Quality of Machine Translation
2 Quality Estimation
3 Open issues
4 Conclusions
Estimativa da qualidade da traducao automatica 29 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/
PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial products by SDL (document-level for gisting)and Multilizer
A number of open issues to be investigated...
Collaboration with “human translators” essential
My vision
Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence
Estimativa da qualidade da traducao automatica 30 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/
PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial products by SDL (document-level for gisting)and Multilizer
A number of open issues to be investigated...
Collaboration with “human translators” essential
My vision
Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence
Estimativa da qualidade da traducao automatica 30 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/
PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial products by SDL (document-level for gisting)and Multilizer
A number of open issues to be investigated...
Collaboration with “human translators” essential
My vision
Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence
Estimativa da qualidade da traducao automatica 30 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/
PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial products by SDL (document-level for gisting)and Multilizer
A number of open issues to be investigated...
Collaboration with “human translators” essential
My vision
Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence
Estimativa da qualidade da traducao automatica 30 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/
PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial products by SDL (document-level for gisting)and Multilizer
A number of open issues to be investigated...
Collaboration with “human translators” essential
My vision
Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence
Estimativa da qualidade da traducao automatica 30 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects of MTquality, esp. wrt PE effort: QuEsthttp://quest.dcs.shef.ac.uk/
PE effort estimates can be used in real applicationsRanking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial products by SDL (document-level for gisting)and Multilizer
A number of open issues to be investigated...
Collaboration with “human translators” essential
My vision
Sub-sentence level QE (error detection), highlightingerrors but also given an overall estimate for the sentence
Estimativa da qualidade da traducao automatica 30 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Estimativa da qualidade da traducao
automatica
Lucia Specia
University of Sheffieldl.specia@sheffield.ac.uk
Faculdade de Letras da Universidade do Porto13 May 2013
Estimativa da qualidade da traducao automatica 31 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
Autodesk.
Translation and Post-Editing Productivity.
In http: // translate. autodesk. com/ productivity. html ,2011.
Nguyen Bach, Fei Huang, and Yaser Al-Onaizan.
Goodness: a method for measuring machine translation confidence.
pages 211–219, Portland, Oregon, 2011.
Markus Dreyer and Daniel Marcu.
Hyter: Meaning-equivalent semantics for translation evaluation.
In Proceedings of the 2012 Conference of the North AmericanChapter of the Association for Computational Linguistics: HumanLanguage Technologies, pages 162–171, Montreal, Canada, 2012.
Intel.
Being Streetwise with Machine Translation in an EnterpriseNeighborhood.
Estimativa da qualidade da traducao automatica 31 / 31
Quality of Machine Translation Quality Estimation Open issues Conclusions
In http:
// mtmarathon2010. info/ JEC2010_ Burgett_ slides. pptx ,2010.
Intel.
Enabling Multilingual Collaboration through Machine Translation.
In http: // media12. connectedsocialmedia. com/ intel/ 06/
8647/ Enabling_ Multilingual_ Collaboration_ Machine_
Translation. pdf , 2012.
Estimativa da qualidade da traducao automatica 31 / 31