11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
Click here to load reader
-
Upload
riilp -
Category
Technology
-
view
484 -
download
1
Transcript of 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
![Page 1: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/1.jpg)
Machine translation evaluation
Hermes Traducciones y Servicios Lingüísticos
![Page 2: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/2.jpg)
MT at Hermes 2
Pure RBMT engines with pre- and post-processing macros.
Texts from technical domains.
Applied-technology department has been working for over a
year in MT engines.
Over 250,000 words post-edited with internal engines in the
last year.
Average new word count for projects post-edited with internal
engines: 9,000 words.
![Page 3: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/3.jpg)
Our purpose with MT evals
Automated metrics might help us:
predict PE time and productivity gains;
negotiate reasonable discounts;
evaluate quality of engines;
measure performance of applied-technology department;
not depend on human-reported data.
3
![Page 4: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/4.jpg)
What we hoped to find
We hoped some metric would correlate with productivity gain
data provided by post-editors.
We gathered BLEU, F-Measure, METEOR and TER
values.
Ideally, we would end up relying on automated metrics rather
than time and productivity measurements reported by post-
editors.
4
![Page 5: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/5.jpg)
What we hoped to find 5
0.00
20.00
40.00
60.00
80.00
100.00
120.00
0.00 20.00 40.00 60.00 80.00 100.00 120.00
Productivity gain %
![Page 6: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/6.jpg)
What we hoped to find 6
0.00
20.00
40.00
60.00
80.00
100.00
120.00
0.00 20.00 40.00 60.00 80.00 100.00 120.00
Productivity gain %
![Page 7: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/7.jpg)
7
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00
Productivity gain %
BLEU
F-Measure
TER
METEOR
What we actually found: No correlation
![Page 8: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/8.jpg)
What we actually found: No correlation 8
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00
Productivity gain %
BLEU
F-Measure
TER
METEOR
![Page 9: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/9.jpg)
Reasons for the variability
Different CAT environments (Trados Studio, memoQ,
Idiom, TagEditor, etc.).
Different engines (per domain, per client, etc.).
Different clients, different needs.
Different post-editors.
Or, if same post-editor, different post-editing skills over time.
Different word volumes.
Specific productivity or consistency-enhancement
processing can affect metrics negatively.
9
![Page 10: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/10.jpg)
Productivity-enhancement example
Source: Add events as described in Adding Events to a Model.
PE: Agregue los eventos como se describe en Adición de eventos a un
modelo.
Raw 1: Agregue los eventos como se describe en la adición de los eventos a
un modelo.
Raw 2: Agregue los eventos como se describe en Adding Events to a Model.
Scores:
Raw 1 Raw 2
BLEU 68,59 53,33
TER 17,65 29,41
10
Metrics for Raw 1 are significantly
better, but Raw 2 is faster to post-edit
thanks to automatic terminology
insertion tools (such as Xbench).
![Page 11: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/11.jpg)
Human evaluation
Adequacy: How much of the meaning expressed in the gold-
standard translation or the source is also expressed in the target
translation?
4. Everything
3. Most
2. Little
1. None
Fluency: To what extent is a target side translation grammatically
well informed, without spelling errors and experienced as using
natural/intuitive language by a native speaker?
4. Flawless
3. Good
2. Dis-fluent
1. Incomprehensible
11
Source: TAUS MT evaluation guidelines https://evaluation.taus.net/resources/adequacy-fluency-guidelines
![Page 12: 11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation](https://reader038.fdocuments.net/reader038/viewer/2022100600/555025b8b4c9058f2f8b46ee/html5/thumbnails/12.jpg)
Conclusions
We combine automated metrics with time/productivity data reported
by post-editor for final evaluation of internal MT performance.
Poor post-editing skills or any project-specific contingency can be
counter-balanced with good automated metrics.
We look for qualitative information in automated metrics, not
quantitative.
BLEU values of 65 and 70 for two different engines tell us both
are good engines, not that one will render 5% better results than
the other.
12