Comparison of machine learning methods for estimating case ...
Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia...
Transcript of Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia...
![Page 1: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/1.jpg)
Quality Estimation Shared Task Open issues Conclusions
Estimating machine translation quality
State-of-the-art systems and open issues
Lucia Specia
University of [email protected]
6 September 2012
Estimating machine translation quality 1 / 46
![Page 2: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/2.jpg)
Quality Estimation Shared Task Open issues Conclusions
Outline
1 Quality Estimation
2 Shared Task
3 Open issues
4 Conclusions
Estimating machine translation quality 2 / 46
![Page 3: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/3.jpg)
Quality Estimation Shared Task Open issues Conclusions
Outline
1 Quality Estimation
2 Shared Task
3 Open issues
4 Conclusions
Estimating machine translation quality 3 / 46
![Page 4: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/4.jpg)
Quality Estimation Shared Task Open issues Conclusions
Overview
Quality estimation (QE): metrics that provide an estimateon the quality of unseen translated texts
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Estimating machine translation quality 4 / 46
![Page 5: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/5.jpg)
Quality Estimation Shared Task Open issues Conclusions
Overview
Quality estimation (QE): metrics that provide an estimateon the quality of unseen translated texts
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Estimating machine translation quality 4 / 46
![Page 6: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/6.jpg)
Quality Estimation Shared Task Open issues Conclusions
Overview
Quality estimation (QE): metrics that provide an estimateon the quality of unseen translated texts
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Estimating machine translation quality 4 / 46
![Page 7: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/7.jpg)
Quality Estimation Shared Task Open issues Conclusions
Overview
Quality estimation (QE): metrics that provide an estimateon the quality of unseen translated texts
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Estimating machine translation quality 4 / 46
![Page 8: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/8.jpg)
Quality Estimation Shared Task Open issues Conclusions
Overview
Quality estimation (QE): metrics that provide an estimateon the quality of unseen translated texts
Quality = Can we publish it as is?
Quality = Can a reader get the gist?
Quality = Is it worth post-editing it?
Quality = How much effort to fix it?
Estimating machine translation quality 4 / 46
![Page 9: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/9.jpg)
Quality Estimation Shared Task Open issues Conclusions
Framework
QE system
Examples: source &
translations,quality scores
Qualityindicators
Estimating machine translation quality 5 / 46
![Page 10: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/10.jpg)
Quality Estimation Shared Task Open issues Conclusions
Framework
Sourcetext
MT system
Translation
QE system
Quality score
Examples: source &
translations,quality scores
Qualityindicators
Estimating machine translation quality 5 / 46
![Page 11: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/11.jpg)
Quality Estimation Shared Task Open issues Conclusions
Framework
Sourcetext
MT system
Translation
QE system
Quality score
Examples: source &
translations,quality scores
Qualityindicators
No access to reference translations: supervised machinelearning techniques to predict quality scores
Estimating machine translation quality 5 / 46
![Page 12: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/12.jpg)
Quality Estimation Shared Task Open issues Conclusions
Background
Also called confidence estimation, started in 2002/3
Inspired by confidence scores in ASR: word posteriorprobabilities
JHU Workshop in 2003
Estimate BLEU/NIST/WER: difficult to interpretA “hard to beat” baseline: MT is always badPoor results, no use in applications
New surge in interest from 2008/9
Better MT systems XMT used in translation industry XEstimate more interpretable metrics: post-editing (PE)effort (human scores, time, % edits to fix)Some positive results
Estimating machine translation quality 6 / 46
![Page 13: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/13.jpg)
Quality Estimation Shared Task Open issues Conclusions
Background
Also called confidence estimation, started in 2002/3
Inspired by confidence scores in ASR: word posteriorprobabilities
JHU Workshop in 2003
Estimate BLEU/NIST/WER: difficult to interpretA “hard to beat” baseline: MT is always badPoor results, no use in applications
New surge in interest from 2008/9
Better MT systems XMT used in translation industry XEstimate more interpretable metrics: post-editing (PE)effort (human scores, time, % edits to fix)Some positive results
Estimating machine translation quality 6 / 46
![Page 14: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/14.jpg)
Quality Estimation Shared Task Open issues Conclusions
Background
Also called confidence estimation, started in 2002/3
Inspired by confidence scores in ASR: word posteriorprobabilities
JHU Workshop in 2003
Estimate BLEU/NIST/WER: difficult to interpretA “hard to beat” baseline: MT is always badPoor results, no use in applications
New surge in interest from 2008/9
Better MT systems
XMT used in translation industry XEstimate more interpretable metrics: post-editing (PE)effort (human scores, time, % edits to fix)Some positive results
Estimating machine translation quality 6 / 46
![Page 15: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/15.jpg)
Quality Estimation Shared Task Open issues Conclusions
Background
Also called confidence estimation, started in 2002/3
Inspired by confidence scores in ASR: word posteriorprobabilities
JHU Workshop in 2003
Estimate BLEU/NIST/WER: difficult to interpretA “hard to beat” baseline: MT is always badPoor results, no use in applications
New surge in interest from 2008/9
Better MT systems X
MT used in translation industry XEstimate more interpretable metrics: post-editing (PE)effort (human scores, time, % edits to fix)Some positive results
Estimating machine translation quality 6 / 46
![Page 16: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/16.jpg)
Quality Estimation Shared Task Open issues Conclusions
Background
Also called confidence estimation, started in 2002/3
Inspired by confidence scores in ASR: word posteriorprobabilities
JHU Workshop in 2003
Estimate BLEU/NIST/WER: difficult to interpretA “hard to beat” baseline: MT is always badPoor results, no use in applications
New surge in interest from 2008/9
Better MT systems XMT used in translation industry
XEstimate more interpretable metrics: post-editing (PE)effort (human scores, time, % edits to fix)Some positive results
Estimating machine translation quality 6 / 46
![Page 17: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/17.jpg)
Quality Estimation Shared Task Open issues Conclusions
Background
Also called confidence estimation, started in 2002/3
Inspired by confidence scores in ASR: word posteriorprobabilities
JHU Workshop in 2003
Estimate BLEU/NIST/WER: difficult to interpretA “hard to beat” baseline: MT is always badPoor results, no use in applications
New surge in interest from 2008/9
Better MT systems XMT used in translation industry X
Estimate more interpretable metrics: post-editing (PE)effort (human scores, time, % edits to fix)Some positive results
Estimating machine translation quality 6 / 46
![Page 18: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/18.jpg)
Quality Estimation Shared Task Open issues Conclusions
Background
Also called confidence estimation, started in 2002/3
Inspired by confidence scores in ASR: word posteriorprobabilities
JHU Workshop in 2003
Estimate BLEU/NIST/WER: difficult to interpretA “hard to beat” baseline: MT is always badPoor results, no use in applications
New surge in interest from 2008/9
Better MT systems XMT used in translation industry XEstimate more interpretable metrics: post-editing (PE)effort (human scores, time, % edits to fix)
Some positive results
Estimating machine translation quality 6 / 46
![Page 19: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/19.jpg)
Quality Estimation Shared Task Open issues Conclusions
Background
Also called confidence estimation, started in 2002/3
Inspired by confidence scores in ASR: word posteriorprobabilities
JHU Workshop in 2003
Estimate BLEU/NIST/WER: difficult to interpretA “hard to beat” baseline: MT is always badPoor results, no use in applications
New surge in interest from 2008/9
Better MT systems XMT used in translation industry XEstimate more interpretable metrics: post-editing (PE)effort (human scores, time, % edits to fix)Some positive results
Estimating machine translation quality 6 / 46
![Page 20: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/20.jpg)
Quality Estimation Shared Task Open issues Conclusions
Some positive results
Time to post-edit subset of sentences predicted as “lowPE effort” vs time to post-edit random subset ofsentences [Spe11]
Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec
Accuracy in selecting best translation among 4 MTsystems [SRT10]
Best MT system Highest QE score54% 77%
Estimating machine translation quality 7 / 46
![Page 21: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/21.jpg)
Quality Estimation Shared Task Open issues Conclusions
Some positive results
Time to post-edit subset of sentences predicted as “lowPE effort” vs time to post-edit random subset ofsentences [Spe11]
Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec
Accuracy in selecting best translation among 4 MTsystems [SRT10]
Best MT system Highest QE score54% 77%
Estimating machine translation quality 7 / 46
![Page 22: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/22.jpg)
Quality Estimation Shared Task Open issues Conclusions
Some positive results
Time to post-edit subset of sentences predicted as “lowPE effort” vs time to post-edit random subset ofsentences [Spe11]
Language no QE QEfr-en 0.75 words/sec 1.09 words/secen-es 0.32 words/sec 0.57 words/sec
Accuracy in selecting best translation among 4 MTsystems [SRT10]
Best MT system Highest QE score54% 77%
Estimating machine translation quality 7 / 46
![Page 23: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/23.jpg)
Quality Estimation Shared Task Open issues Conclusions
Current approaches
Quality indicators
Source text TranslationMT system
Confidence indicators
Complexity indicators
Fluency indicators
Adequacyindicators
Learning algorithms: range of regression, classification,ranking algorithms
Datasets: few with absolute human scores (1-4 scores,PE time, edit distance), WMT data with relative scores
Estimating machine translation quality 8 / 46
![Page 24: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/24.jpg)
Quality Estimation Shared Task Open issues Conclusions
Current approaches
Quality indicators
Source text TranslationMT system
Confidence indicators
Complexity indicators
Fluency indicators
Adequacyindicators
Learning algorithms: range of regression, classification,ranking algorithms
Datasets: few with absolute human scores (1-4 scores,PE time, edit distance), WMT data with relative scores
Estimating machine translation quality 8 / 46
![Page 25: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/25.jpg)
Quality Estimation Shared Task Open issues Conclusions
Current approaches
Quality indicators
Source text TranslationMT system
Confidence indicators
Complexity indicators
Fluency indicators
Adequacyindicators
Learning algorithms: range of regression, classification,ranking algorithms
Datasets: few with absolute human scores (1-4 scores,PE time, edit distance), WMT data with relative scores
Estimating machine translation quality 8 / 46
![Page 26: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/26.jpg)
Quality Estimation Shared Task Open issues Conclusions
Outline
1 Quality Estimation
2 Shared Task
3 Open issues
4 Conclusions
Estimating machine translation quality 9 / 46
![Page 27: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/27.jpg)
Quality Estimation Shared Task Open issues Conclusions
Objectives
WMT-12 – joint work with Radu Soricut (Google)
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective featuresIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast regression and ranking techniques
Estimating machine translation quality 10 / 46
![Page 28: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/28.jpg)
Quality Estimation Shared Task Open issues Conclusions
Objectives
WMT-12 – joint work with Radu Soricut (Google)
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective featuresIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast regression and ranking techniques
Estimating machine translation quality 10 / 46
![Page 29: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/29.jpg)
Quality Estimation Shared Task Open issues Conclusions
Objectives
WMT-12 – joint work with Radu Soricut (Google)
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective features
Identify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast regression and ranking techniques
Estimating machine translation quality 10 / 46
![Page 30: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/30.jpg)
Quality Estimation Shared Task Open issues Conclusions
Objectives
WMT-12 – joint work with Radu Soricut (Google)
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective featuresIdentify most suitable machine learning techniques
Test (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast regression and ranking techniques
Estimating machine translation quality 10 / 46
![Page 31: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/31.jpg)
Quality Estimation Shared Task Open issues Conclusions
Objectives
WMT-12 – joint work with Radu Soricut (Google)
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective featuresIdentify most suitable machine learning techniquesTest (new) automatic evaluation metrics
Establish the state of the art performance in the fieldContrast regression and ranking techniques
Estimating machine translation quality 10 / 46
![Page 32: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/32.jpg)
Quality Estimation Shared Task Open issues Conclusions
Objectives
WMT-12 – joint work with Radu Soricut (Google)
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective featuresIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the field
Contrast regression and ranking techniques
Estimating machine translation quality 10 / 46
![Page 33: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/33.jpg)
Quality Estimation Shared Task Open issues Conclusions
Objectives
WMT-12 – joint work with Radu Soricut (Google)
First common ground for development and comparison ofQE systems, focusing on sentence-level estimation ofPE effort:
Identify (new) effective featuresIdentify most suitable machine learning techniquesTest (new) automatic evaluation metricsEstablish the state of the art performance in the fieldContrast regression and ranking techniques
Estimating machine translation quality 10 / 46
![Page 34: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/34.jpg)
Quality Estimation Shared Task Open issues Conclusions
Objectives
Estimating machine translation quality 10 / 46
![Page 35: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/35.jpg)
Quality Estimation Shared Task Open issues Conclusions
Datasets
English → Spanish
English source sentences
Spanish MT outputs (PBSMT Moses)
Post-edited output by 1 professional translator
Effort scores by 3 professional translators, scale 1-5,averaged
Human Spanish translation (original references)
# Instances
Training: 1832Blind test: 422
Estimating machine translation quality 11 / 46
![Page 36: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/36.jpg)
Quality Estimation Shared Task Open issues Conclusions
Datasets
English → Spanish
English source sentences
Spanish MT outputs (PBSMT Moses)
Post-edited output by 1 professional translator
Effort scores by 3 professional translators, scale 1-5,averaged
Human Spanish translation (original references)
# Instances
Training: 1832Blind test: 422
Estimating machine translation quality 11 / 46
![Page 37: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/37.jpg)
Quality Estimation Shared Task Open issues Conclusions
Datasets
English → Spanish
English source sentences
Spanish MT outputs (PBSMT Moses)
Post-edited output by 1 professional translator
Effort scores by 3 professional translators, scale 1-5,averaged
Human Spanish translation (original references)
# Instances
Training: 1832Blind test: 422
Estimating machine translation quality 11 / 46
![Page 38: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/38.jpg)
Quality Estimation Shared Task Open issues Conclusions
Datasets
English → Spanish
English source sentences
Spanish MT outputs (PBSMT Moses)
Post-edited output by 1 professional translator
Effort scores by 3 professional translators, scale 1-5,averaged
Human Spanish translation (original references)
# Instances
Training: 1832Blind test: 422
Estimating machine translation quality 11 / 46
![Page 39: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/39.jpg)
Quality Estimation Shared Task Open issues Conclusions
Datasets
English → Spanish
English source sentences
Spanish MT outputs (PBSMT Moses)
Post-edited output by 1 professional translator
Effort scores by 3 professional translators, scale 1-5,averaged
Human Spanish translation (original references)
# Instances
Training: 1832Blind test: 422
Estimating machine translation quality 11 / 46
![Page 40: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/40.jpg)
Quality Estimation Shared Task Open issues Conclusions
Datasets
English → Spanish
English source sentences
Spanish MT outputs (PBSMT Moses)
Post-edited output by 1 professional translator
Effort scores by 3 professional translators, scale 1-5,averaged
Human Spanish translation (original references)
# Instances
Training: 1832Blind test: 422
Estimating machine translation quality 11 / 46
![Page 41: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/41.jpg)
Quality Estimation Shared Task Open issues Conclusions
Datasets
Annotation guidelines3 human judges for PE effort assigning 1-5 scores for〈source, MT output, PE output〉
[1] The MT output is incomprehensible, with little or no information transferredaccurately. It cannot be edited, needs to be translated from scratch.
[2] About 50-70% of the MT output needs to be edited. It requires a significantediting effort in order to reach publishable level.
[3] About 25-50% of the MT output needs to be edited. It contains different errorsand mistranslations that need to be corrected.
[4] About 10-25% of the MT output needs to be edited. It is generally clear andintelligible.
[5] The MT output is perfectly clear and intelligible. It is not necessarily a perfecttranslation, but requires little to no editing.
Estimating machine translation quality 12 / 46
![Page 42: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/42.jpg)
Quality Estimation Shared Task Open issues Conclusions
Resources provided
SMT resources for training and test sets:
SMT training corpus (Europarl and News-documentaries)
LMs: 5-gram LM; 3-gram LM and 1-3-gram counts
IBM Model 1 table (Giza)
Word-alignment file as produced by grow-diag-final
Phrase table with word alignment information
Moses configuration file used for decoding
Moses run-time log: model component values, wordgraph, etc.
Estimating machine translation quality 13 / 46
![Page 43: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/43.jpg)
Quality Estimation Shared Task Open issues Conclusions
Resources provided
Two sub-tasks:
Scoring: predict a score in [1-5] for each test instance
Ranking: sort all test instances best-worst
Estimating machine translation quality 14 / 46
![Page 44: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/44.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Scoring metrics - standard MAE and RMSE
MAE =
∑Ni=1 |H(si)− V (si)|
N
RMSE =
√∑Ni=1(H(si)− V (si))2
N
N = |S |H(si) is the predicted score for siV (si) the is human score for si
Estimating machine translation quality 15 / 46
![Page 45: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/45.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Ranking metrics Spearman’s rank correlation and newmetric: DeltaAvg
For S1, S2, . . . , Sn quantiles:
DeltaAvgV [n] =
∑n−1k=1 V (S1,k)
n − 1− V (S)
V (S): extrinsic function measuring the “quality” of set S
Average human scores (1-5) of set S
Estimating machine translation quality 16 / 46
![Page 46: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/46.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Ranking metrics Spearman’s rank correlation and newmetric: DeltaAvg
For S1, S2, . . . , Sn quantiles:
DeltaAvgV [n] =
∑n−1k=1 V (S1,k)
n − 1− V (S)
V (S): extrinsic function measuring the “quality” of set S
Average human scores (1-5) of set S
Estimating machine translation quality 16 / 46
![Page 47: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/47.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
DeltaAvg
Example 1: n=2, quantiles S1, S2
DeltaAvg[2] = V (S1)− V (S)“Quality of the top half compared to the overall quality”
Average human scores of top half compared to averagehuman scores of complete set
Estimating machine translation quality 17 / 46
![Page 48: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/48.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
score 5
score 4
score 3
score 2
score 1
Average human score: 3
Estimating machine translation quality 17 / 46
![Page 49: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/49.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
score 5
score 4
score 3
score 2
score 1
Average human score: 3
N = 2
Random = [3 - 3] = 0QE = [3.8 - 3] = 0.8
N = 2DeltaAvg[2]
Oracle = [4.2 - 3] = 1.2Lowerb = [1.8 - 3] = -1.2
Estimating machine translation quality 17 / 46
![Page 50: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/50.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
score 5
score 4
score 3
score 2
score 1
Average human score: 3
N = 2
Random = [3 - 3] = 0QE = [3.8 - 3] = 0.8
N = 2DeltaAvg[2]
Oracle = [4.2 - 3] = 1.2Lowerb = [1.8 - 3] = -1.2
Average “human” score of top 50% selected after
ranking based on QE score.QE score can be on any scale...
Estimating machine translation quality 17 / 46
![Page 51: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/51.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
DeltaAvg
Example 2: n=3, quantiles S1, S2, S3
DeltaAvg[3] =(V (S1)−V (S))+(V (S1,2)−V (S))
2
Average human scores of top third compared to averagehuman scores of complete set; average human scores of top
two thirds compared to average human scores of completeset, averaged
Estimating machine translation quality 18 / 46
![Page 52: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/52.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
score 5
score 4
score 3
score 2
score 1
Average human score: 3
Random = [3 - 3] = 0
N = 5DeltaAvg[5]
Oracle1
= [5 - 3] = 2
Lowerb1
= [1 - 3] = -2...
QE1
= [4.1 - 3] = 1.1
Estimating machine translation quality 18 / 46
![Page 53: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/53.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
score 5
score 4
score 3
score 2
score 1
Average human score: 3
Random = [3 - 3] = 0
N = 5DeltaAvg[5]
Oracle1
= [5 - 3] = 2
Lowerb1
= [1 - 3] = -2...
QE1
= [4.1 - 3] = 1.1QE
1,2= [3.9 - 3] = 0.9
Estimating machine translation quality 18 / 46
![Page 54: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/54.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
score 5
score 4
score 3
score 2
score 1
Average human score: 3
Random = [3 - 3] = 0
N = 5DeltaAvg[5]
Oracle1
= [5 - 3] = 2
Lowerb1
= [1 - 3] = -2...
QE1
= [4.1 - 3] = 1.1QE
1,2= [3.9 - 3] = 0.9
QE1,2,3
= [3.5 - 3] = 0.5QE
1,2,3,4= [3.3 - 3] = 0.3
DeltaAvg[5] = (1.1+0.9+0.5+0.3)/4= 0.7
Estimating machine translation quality 18 / 46
![Page 55: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/55.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Final DeltaAvg metric
DeltaAvgV =
∑Nn=2 DeltaAvgV [n]
N − 1
where N = |S |/2
Average DeltaAvg[n] for all n, 2 ≤ n ≤ |S |/2
Estimating machine translation quality 19 / 46
![Page 56: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/56.jpg)
Quality Estimation Shared Task Open issues Conclusions
Evaluation metrics
Final DeltaAvg metric
DeltaAvgV =
∑Nn=2 DeltaAvgV [n]
N − 1
where N = |S |/2
Average DeltaAvg[n] for all n, 2 ≤ n ≤ |S |/2
Estimating machine translation quality 19 / 46
![Page 57: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/57.jpg)
Quality Estimation Shared Task Open issues Conclusions
Participants
ID Participating teamPRHLT-UPV Universitat Politecnica de Valencia, Spain
UU Uppsala University, SwedenSDLLW SDL Language Weaver, USA
Loria LORIA Institute, FranceUPC Universitat Politecnica de Catalunya, Spain
DFKI DFKI, GermanyWLV-SHEF Univ of Wolverhampton & Univ of Sheffield, UK
SJTU Shanghai Jiao Tong University, ChinaDCU-SYMC Dublin City University, Ireland & Symantec, Ireland
UEdin University of Edinburgh, UKTCD Trinity College Dublin, Ireland
One or two systems per team, most teams submitting for rankingand scoring sub-tasks
Estimating machine translation quality 20 / 46
![Page 58: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/58.jpg)
Quality Estimation Shared Task Open issues Conclusions
Baseline system
Feature extraction software – system-independent features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4
% of seen source unigrams
SVM regression with RBF kernel with the parameters γ, ε and Coptimized using a grid-search and 5-fold cross validation on thetraining set
Estimating machine translation quality 21 / 46
![Page 59: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/59.jpg)
Quality Estimation Shared Task Open issues Conclusions
Baseline system
Feature extraction software – system-independent features:
number of tokens in the source and target sentences
average source token length
average number of occurrences of words in the target
number of punctuation marks in source and target sentences
LM probability of source and target sentences
average number of translations per source word
% of source 1-grams, 2-grams and 3-grams in frequencyquartiles 1 and 4
% of seen source unigrams
SVM regression with RBF kernel with the parameters γ, ε and Coptimized using a grid-search and 5-fold cross validation on thetraining set
Estimating machine translation quality 21 / 46
![Page 60: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/60.jpg)
Quality Estimation Shared Task Open issues Conclusions
Results - ranking sub-task
System ID DeltaAvg Spearman Corr• SDLLW M5PbestDeltaAvg 0.63 0.64
• SDLLW SVM 0.61 0.60UU bltk 0.58 0.61UU best 0.56 0.62
TCD M5P-resources-only* 0.56 0.56Baseline (17FFs SVM) 0.55 0.58
PRHLT-UPV 0.55 0.55UEdin 0.54 0.58SJTU 0.53 0.53
WLV-SHEF FS 0.51 0.52WLV-SHEF BL 0.50 0.49
DFKI morphPOSibm1LM 0.46 0.46DCU-SYMC unconstrained 0.44 0.41
DCU-SYMC constrained 0.43 0.41TCD M5P-all* 0.42 0.41
UPC 1 0.22 0.26UPC 2 0.15 0.19
• = winning submissionsgray area = not different from baseline* = bug-fix was applied after the submission
Estimating machine translation quality 22 / 46
![Page 61: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/61.jpg)
Quality Estimation Shared Task Open issues Conclusions
Results - ranking sub-task
Oracle methods: associate various metrics in a oraclemanner to the test input:
Oracle Effort: the gold-label Effort
Oracle HTER: the HTER metric against the post-editedtranslations as reference
System ID DeltaAvg Spearman CorrOracle Effort 0.95 1.00
Oracle HTER 0.77 0.70
Estimating machine translation quality 23 / 46
![Page 62: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/62.jpg)
Quality Estimation Shared Task Open issues Conclusions
Results - scoring sub-task
System ID MAE RMSE• SDLLW M5PbestDeltaAvg 0.61 0.75
UU best 0.64 0.79SDLLW SVM 0.64 0.78
UU bltk 0.64 0.79Loria SVMlinear 0.68 0.82
UEdin 0.68 0.82TCD M5P-resources-only* 0.68 0.82
Baseline (17FFs SVM) 0.69 0.82Loria SVMrbf 0.69 0.83
SJTU 0.69 0.83WLV-SHEF FS 0.69 0.85
PRHLT-UPV 0.70 0.85WLV-SHEF BL 0.72 0.86
DCU-SYMC unconstrained 0.75 0.97DFKI grcfs-mars 0.82 0.98DFKI cfs-plsreg 0.82 0.99
UPC 1 0.84 1.01DCU-SYMC constrained 0.86 1.12
UPC 2 0.87 1.04TCD M5P-all 2.09 2.32
Estimating machine translation quality 24 / 46
![Page 63: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/63.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → variety of features
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with parse trees)
Good features:
confidence: model components from SMT decoderpseudo-reference: agreement between 2 SMT systemsfuzzy-match like: source (and target) similarity withSMT training corpus (LM, etc)
Estimating machine translation quality 25 / 46
![Page 64: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/64.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → variety of features
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with parse trees)
Good features:
confidence: model components from SMT decoderpseudo-reference: agreement between 2 SMT systemsfuzzy-match like: source (and target) similarity withSMT training corpus (LM, etc)
Estimating machine translation quality 25 / 46
![Page 65: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/65.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → variety of features
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)
high performance (e.g. “UU” with parse trees)
Good features:
confidence: model components from SMT decoderpseudo-reference: agreement between 2 SMT systemsfuzzy-match like: source (and target) similarity withSMT training corpus (LM, etc)
Estimating machine translation quality 25 / 46
![Page 66: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/66.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → variety of features
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with parse trees)
Good features:
confidence: model components from SMT decoderpseudo-reference: agreement between 2 SMT systemsfuzzy-match like: source (and target) similarity withSMT training corpus (LM, etc)
Estimating machine translation quality 25 / 46
![Page 67: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/67.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → variety of features
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with parse trees)
Good features:
confidence: model components from SMT decoder
pseudo-reference: agreement between 2 SMT systemsfuzzy-match like: source (and target) similarity withSMT training corpus (LM, etc)
Estimating machine translation quality 25 / 46
![Page 68: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/68.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
New and effective quality indicators (features)
Most participating systems use external resources:parsers, POS taggers, NER, etc. → variety of features
Many tried to exploit linguistically-oriented features
none or modest improvements (e.g. WLV-SHEF)high performance (e.g. “UU” with parse trees)
Good features:
confidence: model components from SMT decoderpseudo-reference: agreement between 2 SMT systemsfuzzy-match like: source (and target) similarity withSMT training corpus (LM, etc)
Estimating machine translation quality 25 / 46
![Page 69: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/69.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Machine Learning techniques
Best performing: Regression Trees (M5P) and SVR
M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set
Feature selection crucial in this setup
Structured learning techniques: “UU” submissions (treekernels)
Estimating machine translation quality 26 / 46
![Page 70: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/70.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Machine Learning techniques
Best performing: Regression Trees (M5P) and SVR
M5P Regression Trees: compact models, less overfitting,“readable”
SVRs: easily overfit with small training data and largefeature set
Feature selection crucial in this setup
Structured learning techniques: “UU” submissions (treekernels)
Estimating machine translation quality 26 / 46
![Page 71: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/71.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Machine Learning techniques
Best performing: Regression Trees (M5P) and SVR
M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set
Feature selection crucial in this setup
Structured learning techniques: “UU” submissions (treekernels)
Estimating machine translation quality 26 / 46
![Page 72: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/72.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Machine Learning techniques
Best performing: Regression Trees (M5P) and SVR
M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set
Feature selection crucial in this setup
Structured learning techniques: “UU” submissions (treekernels)
Estimating machine translation quality 26 / 46
![Page 73: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/73.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Machine Learning techniques
Best performing: Regression Trees (M5P) and SVR
M5P Regression Trees: compact models, less overfitting,“readable”SVRs: easily overfit with small training data and largefeature set
Feature selection crucial in this setup
Structured learning techniques: “UU” submissions (treekernels)
Estimating machine translation quality 26 / 46
![Page 74: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/74.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking task
automatic and deterministic (and therefore consistent)Extrinsic interpretabilityVersatile: valuation function V can change, N canchangeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating machine translation quality 27 / 46
![Page 75: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/75.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking task
automatic and deterministic (and therefore consistent)
Extrinsic interpretabilityVersatile: valuation function V can change, N canchangeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating machine translation quality 27 / 46
![Page 76: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/76.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking task
automatic and deterministic (and therefore consistent)Extrinsic interpretabilityVersatile: valuation function V can change, N canchange
High correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating machine translation quality 27 / 46
![Page 77: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/77.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking task
automatic and deterministic (and therefore consistent)Extrinsic interpretabilityVersatile: valuation function V can change, N canchangeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating machine translation quality 27 / 46
![Page 78: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/78.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking task
automatic and deterministic (and therefore consistent)Extrinsic interpretabilityVersatile: valuation function V can change, N canchangeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating machine translation quality 27 / 46
![Page 79: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/79.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking task
automatic and deterministic (and therefore consistent)Extrinsic interpretabilityVersatile: valuation function V can change, N canchangeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating machine translation quality 27 / 46
![Page 80: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/80.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Evaluation metrics
DeltaAvg → suitable for the ranking task
automatic and deterministic (and therefore consistent)Extrinsic interpretabilityVersatile: valuation function V can change, N canchangeHigh correlation with Spearman, but less strict
MAE, RMSE → difficult task, values stubbornly high
Regression vs ranking
Most submissions: regression results to infer ranking
Ranking approach is simpler, directly useful in manyapplications
Estimating machine translation quality 27 / 46
![Page 81: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/81.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Establish state-of-the-art performance
“Baseline” - hard to beat, previous state-of-the-art
Metrics, data sets, and performance points available
Known values for oracle-based upperbounds
Good resource to further investigate: best features & bestalgorithms
Estimating machine translation quality 28 / 46
![Page 82: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/82.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Establish state-of-the-art performance
“Baseline” - hard to beat, previous state-of-the-art
Metrics, data sets, and performance points available
Known values for oracle-based upperbounds
Good resource to further investigate: best features & bestalgorithms
Estimating machine translation quality 28 / 46
![Page 83: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/83.jpg)
Quality Estimation Shared Task Open issues Conclusions
Discussion
Establish state-of-the-art performance
“Baseline” - hard to beat, previous state-of-the-art
Metrics, data sets, and performance points available
Known values for oracle-based upperbounds
Good resource to further investigate: best features & bestalgorithms
Estimating machine translation quality 28 / 46
![Page 84: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/84.jpg)
Quality Estimation Shared Task Open issues Conclusions
Follow up
Feature sets available
11 systems, 1515 features (some overlap) of varioustypes, from 6 to 497 features per system
http://www.dcs.shef.ac.uk/~lucia/resources/
feature_sets_all_participants.tar.gz
Estimating machine translation quality 29 / 46
![Page 85: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/85.jpg)
Quality Estimation Shared Task Open issues Conclusions
Outline
1 Quality Estimation
2 Shared Task
3 Open issues
4 Conclusions
Estimating machine translation quality 30 / 46
![Page 86: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/86.jpg)
Quality Estimation Shared Task Open issues Conclusions
Agreement between translators
Absolute value judgements: difficult to achieveconsistency across annotators even in highly controlledsetup
30% of initial dataset discarded: annotators disagreed bymore than one category
Too subjective?
Estimating machine translation quality 31 / 46
![Page 87: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/87.jpg)
Quality Estimation Shared Task Open issues Conclusions
Agreement between translators
Absolute value judgements: difficult to achieveconsistency across annotators even in highly controlledsetup
30% of initial dataset discarded: annotators disagreed bymore than one category
Too subjective?
Estimating machine translation quality 31 / 46
![Page 88: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/88.jpg)
Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
TIME: varies considerably across translators (expected). E.g.:seconds per word
Can we normalise this variation?
A dedicated QE system for each translator?
Estimating machine translation quality 32 / 46
![Page 89: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/89.jpg)
Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
TIME: varies considerably across translators (expected). E.g.:seconds per word
Can we normalise this variation?
A dedicated QE system for each translator?
Estimating machine translation quality 32 / 46
![Page 90: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/90.jpg)
Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
TIME: varies considerably across translators (expected). E.g.:seconds per word
Can we normalise this variation?
A dedicated QE system for each translator?
Estimating machine translation quality 32 / 46
![Page 91: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/91.jpg)
Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
Translations with low HTER (few edits) & low qualityscores (high post-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimating machine translation quality 33 / 46
![Page 92: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/92.jpg)
Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
Translations with low HTER (few edits) & low qualityscores (high post-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimating machine translation quality 33 / 46
![Page 93: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/93.jpg)
Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
Translations with low HTER (few edits) & low qualityscores (high post-editing effort), and vice-versa
Certain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimating machine translation quality 33 / 46
![Page 94: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/94.jpg)
Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
HTER: Edit distance between MT output and its minimallypost-edited version
HTER =#edits
#words postedited version
Edits: substitute, delete, insert, shift
Analysis by Maarit Koponen (WMT-12) on post-editedtranslations with HTER and 1-5 scores
Translations with low HTER (few edits) & low qualityscores (high post-editing effort), and vice-versaCertain edits seem to require more cognitive effort thanothers - not captured by HTER
Estimating machine translation quality 33 / 46
![Page 95: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/95.jpg)
Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
Keystrokes: different PE strategies - data from 8 translators(joint work with Maarit Koponen and Wilker Aziz):
Estimating machine translation quality 34 / 46
![Page 96: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/96.jpg)
Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
Keystrokes: different PE strategies - data from 8 translators(joint work with Maarit Koponen and Wilker Aziz):
Estimating machine translation quality 34 / 46
![Page 97: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/97.jpg)
Quality Estimation Shared Task Open issues Conclusions
More objective ways of generating absolute scores
PET: http://pers-www.wlv.ac.uk/~in1676/pet/
Estimating machine translation quality 34 / 46
![Page 98: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/98.jpg)
Quality Estimation Shared Task Open issues Conclusions
Use of relative scores
Ranking of translations: Suitable if the final application isto compare alternative translations of same source sentence
N-best list re-ranking
System combination
MT system evaluation
Estimating machine translation quality 35 / 46
![Page 99: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/99.jpg)
Quality Estimation Shared Task Open issues Conclusions
Use of relative scores
Ranking of translations: Suitable if the final application isto compare alternative translations of same source sentence
N-best list re-ranking
System combination
MT system evaluation
Estimating machine translation quality 35 / 46
![Page 100: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/100.jpg)
Quality Estimation Shared Task Open issues Conclusions
Source text fuzzy match score
Why do translators use (and trust) TMs?
Why can’t we do the same for MT? E.g. Xplanation Group
Estimating machine translation quality 36 / 46
![Page 101: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/101.jpg)
Quality Estimation Shared Task Open issues Conclusions
Source text fuzzy match score
Why do translators use (and trust) TMs?
Why can’t we do the same for MT? E.g. Xplanation Group
Estimating machine translation quality 36 / 46
![Page 102: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/102.jpg)
Quality Estimation Shared Task Open issues Conclusions
Source text fuzzy match score
Why do translators use (and trust) TMs?
Why can’t we do the same for MT?
E.g. Xplanation Group
Estimating machine translation quality 36 / 46
![Page 103: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/103.jpg)
Quality Estimation Shared Task Open issues Conclusions
Source text fuzzy match score
Why do translators use (and trust) TMs?
Why can’t we do the same for MT? E.g. Xplanation Group
Estimating machine translation quality 36 / 46
![Page 104: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/104.jpg)
Quality Estimation Shared Task Open issues Conclusions
What is the best metric to estimate PE effort?
Effort scores are subjective
Effort/HTER seem to lack “cognitive load”
Time varies too much across post-editors
Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort
Source fuzzy match score: as reliable as with TMs?
Estimating machine translation quality 37 / 46
![Page 105: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/105.jpg)
Quality Estimation Shared Task Open issues Conclusions
What is the best metric to estimate PE effort?
Effort scores are subjective
Effort/HTER seem to lack “cognitive load”
Time varies too much across post-editors
Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort
Source fuzzy match score: as reliable as with TMs?
Estimating machine translation quality 37 / 46
![Page 106: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/106.jpg)
Quality Estimation Shared Task Open issues Conclusions
What is the best metric to estimate PE effort?
Effort scores are subjective
Effort/HTER seem to lack “cognitive load”
Time varies too much across post-editors
Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort
Source fuzzy match score: as reliable as with TMs?
Estimating machine translation quality 37 / 46
![Page 107: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/107.jpg)
Quality Estimation Shared Task Open issues Conclusions
What is the best metric to estimate PE effort?
Effort scores are subjective
Effort/HTER seem to lack “cognitive load”
Time varies too much across post-editors
Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort
Source fuzzy match score: as reliable as with TMs?
Estimating machine translation quality 37 / 46
![Page 108: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/108.jpg)
Quality Estimation Shared Task Open issues Conclusions
What is the best metric to estimate PE effort?
Effort scores are subjective
Effort/HTER seem to lack “cognitive load”
Time varies too much across post-editors
Keystrokes seems to capture PE strategies, but do notcorrelate well with PE effort
Source fuzzy match score: as reliable as with TMs?
Estimating machine translation quality 37 / 46
![Page 109: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/109.jpg)
Quality Estimation Shared Task Open issues Conclusions
How to use estimated PE effort scores?
Should (supposedly) bad quality translations be filtered outor shown to translators (different scores/colour codes as inTMs)?
Wasting time to read scores and translations vs wasting“gisting” information
Estimating machine translation quality 38 / 46
![Page 110: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/110.jpg)
Quality Estimation Shared Task Open issues Conclusions
How to use estimated PE effort scores?
Should (supposedly) bad quality translations be filtered outor shown to translators (different scores/colour codes as inTMs)?
Wasting time to read scores and translations vs wasting“gisting” information
Estimating machine translation quality 38 / 46
![Page 111: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/111.jpg)
Quality Estimation Shared Task Open issues Conclusions
How to use estimated PE effort scores?
How to define a threshold on the estimated translationquality to decide what should be filtered out?
Translator dependent
Task dependent
Estimating machine translation quality 39 / 46
![Page 112: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/112.jpg)
Quality Estimation Shared Task Open issues Conclusions
How to use estimated PE effort scores?
How to define a threshold on the estimated translationquality to decide what should be filtered out?
Translator dependent
Task dependent
Estimating machine translation quality 39 / 46
![Page 113: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/113.jpg)
Quality Estimation Shared Task Open issues Conclusions
How to use estimated PE effort scores?
Do translators prefer detailed estimates (sub-sentence level)or an overall estimate for the complete sentence?
Too much information vs hard-to-interpret scores
Quality estimation vs error detection
IBM’s Goodness metric: classifier with sparse binaryfeatures (word/phrase pairs, etc.)
Estimating machine translation quality 40 / 46
![Page 114: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/114.jpg)
Quality Estimation Shared Task Open issues Conclusions
How to use estimated PE effort scores?
Do translators prefer detailed estimates (sub-sentence level)or an overall estimate for the complete sentence?
Too much information vs hard-to-interpret scores
Quality estimation vs error detection
IBM’s Goodness metric: classifier with sparse binaryfeatures (word/phrase pairs, etc.)
Estimating machine translation quality 40 / 46
![Page 115: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/115.jpg)
Quality Estimation Shared Task Open issues Conclusions
Do we really need QE?
Can’t we simply add some good features to SMT models?
Yes, especially if doing sub-sentence QE/error detection
But not all:
Some linguistically-motivated features can bedifficult/expensive: matching of semantic rolesGlobal features are difficult/impossible, e.g: coherencegiven previous n sentences
Estimating machine translation quality 41 / 46
![Page 116: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/116.jpg)
Quality Estimation Shared Task Open issues Conclusions
Do we really need QE?
Can’t we simply add some good features to SMT models?
Yes, especially if doing sub-sentence QE/error detection
But not all:
Some linguistically-motivated features can bedifficult/expensive: matching of semantic rolesGlobal features are difficult/impossible, e.g: coherencegiven previous n sentences
Estimating machine translation quality 41 / 46
![Page 117: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/117.jpg)
Quality Estimation Shared Task Open issues Conclusions
Do we really need QE?
Can’t we simply add some good features to SMT models?
Yes, especially if doing sub-sentence QE/error detection
But not all:
Some linguistically-motivated features can bedifficult/expensive: matching of semantic rolesGlobal features are difficult/impossible, e.g: coherencegiven previous n sentences
Estimating machine translation quality 41 / 46
![Page 118: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/118.jpg)
Quality Estimation Shared Task Open issues Conclusions
Outline
1 Quality Estimation
2 Shared Task
3 Open issues
4 Conclusions
Estimating machine translation quality 42 / 46
![Page 119: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/119.jpg)
Quality Estimation Shared Task Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects oftranslation quality in terms of PE effort
PE effort estimates can be used in real applications
Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial interest
SDL LW: TrustScoreMultilizer: MT-Qualifier
A number of open issues to be investigated...
What we need
Simple, cheap metric like BLEU/fuzzy match level in TMs
Estimating machine translation quality 43 / 46
![Page 120: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/120.jpg)
Quality Estimation Shared Task Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects oftranslation quality in terms of PE effort
PE effort estimates can be used in real applications
Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial interest
SDL LW: TrustScoreMultilizer: MT-Qualifier
A number of open issues to be investigated...
What we need
Simple, cheap metric like BLEU/fuzzy match level in TMs
Estimating machine translation quality 43 / 46
![Page 121: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/121.jpg)
Quality Estimation Shared Task Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects oftranslation quality in terms of PE effort
PE effort estimates can be used in real applications
Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial interest
SDL LW: TrustScoreMultilizer: MT-Qualifier
A number of open issues to be investigated...
What we need
Simple, cheap metric like BLEU/fuzzy match level in TMs
Estimating machine translation quality 43 / 46
![Page 122: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/122.jpg)
Quality Estimation Shared Task Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects oftranslation quality in terms of PE effort
PE effort estimates can be used in real applications
Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial interest
SDL LW: TrustScoreMultilizer: MT-Qualifier
A number of open issues to be investigated...
What we need
Simple, cheap metric like BLEU/fuzzy match level in TMs
Estimating machine translation quality 43 / 46
![Page 123: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/123.jpg)
Quality Estimation Shared Task Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects oftranslation quality in terms of PE effort
PE effort estimates can be used in real applications
Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial interest
SDL LW: TrustScoreMultilizer: MT-Qualifier
A number of open issues to be investigated...
What we need
Simple, cheap metric like BLEU/fuzzy match level in TMs
Estimating machine translation quality 43 / 46
![Page 124: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/124.jpg)
Quality Estimation Shared Task Open issues Conclusions
Conclusions
It is possible to estimate at least certain aspects oftranslation quality in terms of PE effort
PE effort estimates can be used in real applications
Ranking translations: filter out bad quality translationsSelecting translations from multiple MT systems
Commercial interest
SDL LW: TrustScoreMultilizer: MT-Qualifier
A number of open issues to be investigated...
What we need
Simple, cheap metric like BLEU/fuzzy match level in TMs
Estimating machine translation quality 43 / 46
![Page 125: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/125.jpg)
Quality Estimation Shared Task Open issues Conclusions
Journal of MT - Special issue
15-06-12 - 1st CFP
15-08-12 - 2nd CFP
5-10-12 - extended submission deadline
20-11-12 - reviews due
January 2013 - camera-ready due (tentative)
WMT-12 QE Shared Task
All feature sets available
Estimating machine translation quality 44 / 46
![Page 126: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/126.jpg)
Quality Estimation Shared Task Open issues Conclusions
Estimating machine translation quality
State-of-the-art systems and open issues
Lucia Specia
University of [email protected]
6 September 2012
Estimating machine translation quality 45 / 46
![Page 127: Estimating machine translation quality · State-of-the-art systems and open issues Lucia Specia University of She eld l.specia@sheffield.ac.uk 6 September 2012 Estimating machine](https://reader034.fdocuments.net/reader034/viewer/2022050308/5f70dd1625df446d1615780f/html5/thumbnails/127.jpg)
Quality Estimation Shared Task Open issues Conclusions
References
Lucia Specia.
Exploiting Objective Annotations for Measuring TranslationPost-editing Effort.
In Proceedings of the 15th Conference of the European Associationfor Machine Translation, pages 73–80, Leuven, 2011.
Lucia Specia, Dhwaj Raj, and Marco Turchi.
Machine translation evaluation versus quality estimation.
Machine Translation, pages 39–50, 2010.
Estimating machine translation quality 46 / 46