D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation...

44
This document is part of the Project “Machine Translation Enhanced Computer Assisted Translation (MateCat)”, funded by the 7th Framework Programme of the European Commission through grant agreement no.: 287688. Machine Translation Enhanced Computer Assisted Translation D5.5 – Third Report on Lab and Field Tests Authors: Marcello Federico, Nicola Bertoldi, Mauro Cettolo, Matteo Negri, Marco Turchi, Luisa Bentivogli, Holger Schwenk, Frédéric Blain Dissemination Level: Public Date: 16 November 2014

Transcript of D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation...

Page 1: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

This document is part of the Project “Machine Translation Enhanced Computer Assisted Translation (MateCat)”, funded by the 7th Framework Programme of the European Commission through grant agreement no.: 287688.

Machine Translation Enhanced Computer Assisted Translation

D5.5 – Third Report on Lab and Field Tests

Authors: Marcello Federico, Nicola Bertoldi, Mauro Cettolo, Matteo

Negri, Marco Turchi, Luisa Bentivogli, Holger Schwenk,

Frédéric Blain Dissemination Level: Public Date: 16 November 2014

Page 2: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted Translation D5.4: Second report on Lab and field tests

2

Grant agreement no. 287688 Project acronym MateCat Project full title Machine Translation Enhanced Computer Assisted Translation Funding scheme Collaborative project Coordinator Marcello Federico (FBK) Start date, duration November 1st 2011, 36 months Dissemination level Public Contractual date of delivery

October 31st, 2013

Actual date of delivery

November 16th, 2013

Deliverable number 5.5 Deliverable title Third Report on Lab and Field Tests Type Report Status and version Final, V1.0 Number of pages 44 Contributing partners FBK, LEMANS WP leader Translated Task leader FBK Authors Marcello Federico, Nicola Bertoldi, Mauro Cettolo, Matteo Negri, Marco

Turchi, Luisa Bentivogli, Holger Schwenk, Frédéric Blain Reviewer Ulrich Germann EC project officer Aleksandra Wesolowska The partners in MateCat are:

Fondazione Bruno Kessler (FBK), Italy Université Le Mans (LE MANS), France The University of Edinburgh (UEDIN) Translated S.r.l. (TRANSLATED)

For copies of reports, updates on project activities and other MateCat-related information, contact: FBK MateCat Marcello Federico [email protected] Povo - Via Sommarive 18 Phone: +39 0461 314 521 I-38123 Trento, Italy Fax: +39 0461 314 591

Copies of reports and other material can also be accessed via http://www.matecat.com

© 2014, Marcello Federico, Nicola Bertoldi, Mauro Cettolo, Matteo Negri, Marco Turchi, Luisa Bentivogli, Holger Schwenk, Frédéric Blain No part of this document may be reproduced or transmitted in any form, or by any means, electronic or mechanical, including photocopy, recording, or any information storage and retrieval system, without permission from the copyright owner.

Page 3: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Executive Summary

This deliverable reports the results of the final field and lab tests of the MateCat project, whichwere conducted during Summer and Fall 2014. The field test aimed at evaluating the impactof self-tuning, user-adaptive and informative MT components on the users’ productivity. Thefield test used the MateCat Tool Version 3 developed by the industrial partner, as well as theMT engines and other core components developed by the research partners, such as modulesfor quality estimation and terminology extraction.

This year we employed professional translators for our tests, who worked under realisticconditions in the two translation directions English-to-Italian and English-to-French, and inthree linguistic domains: Legal Text, Information Technology and TED Talks. The main evalu-ation was carried out by comparing a state-of-the-art, domain-adapted “pre-MateCat” baselinesystem against a full-fledged “post-MateCat” system that offered self-tuning as well as user-adaptive and informative MT functionality. Additional focused field tests were carried out tomeasure the effectiveness and robustness of self-tuning and informative MT (quality estimatevisualisation and bilingual term extraction). The lab tests were performed on the data that wascollected during the field tests of the full-fledged MT systems. The goal was to assess the qual-ity of the suggestions proposed by the full-fledged engine, and to determine if differences fromthe “pre-MateCat” system have an impact on the quality of the final results. With respect tothe ambitious objectives set for the last year of the project (i.e. to achieve 15% MT quality andproductivity improvement and 60% of user acceptance of informative MT), the results reportedindicate that all the project objectives have been met.

3

Page 4: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Contents

1 Introduction 6

2 Field Test of Fully Fledged MT 72.1 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Baseline systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Adaptive system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Lab Test of Fully Fledged MT 173.1 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Experimental Set-up and Evaluation Protocol . . . . . . . . . . . . . . . . . . 173.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Field-Test of Quality Estimate Visualisation 204.1 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3.1 The Impact of QE Labels on Post-editing Times . . . . . . . . . . . . . 214.3.2 Post-study Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Field-Test of Bilingual Term Extraction 255.1 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 Field Test of Self-Tuning MT 316.1 Motivation and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.2 Evaluation Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316.3 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.3.1 Domain adapted system . . . . . . . . . . . . . . . . . . . . . . . . . 32

4

Page 5: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

6.3.2 Project adapted system . . . . . . . . . . . . . . . . . . . . . . . . . . 326.3.3 Continuous space language model . . . . . . . . . . . . . . . . . . . . 33

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346.4.1 Results for the Lab tests . . . . . . . . . . . . . . . . . . . . . . . . . 346.4.2 Results for the field test . . . . . . . . . . . . . . . . . . . . . . . . . 346.4.3 Impact on user productivity . . . . . . . . . . . . . . . . . . . . . . . 37

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7 Conclusion 38

5

Page 6: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

1 Introduction

In the last three months of the MateCat project, final field and lab tests were conducted to assessthe overall technical progress of the project. In accordance with the project work plan, the finalfield tests were performed with the final version of the MateCat Tool. The goal was to measurethe usability and utility of the tool and its integrated machine translation (MT) technology. Thelab test aimed at evaluating in isolation the quality and accuracy of specific MT componentsdeveloped by the partners.

Recall that the following three research problems were the focus of the research efforts bythe academic partners of MateCat during the project:

• Self-tuning MT, i.e. methods to automatically tune MT engines to specific domains ortranslation projects;

• User adaptive MT, i.e. methods to quickly adapt MT in response to user corrections andfeedback;

• Informative MT, i.e. MT that supplies additional information beyond translation sugges-tions to enhance users’ productivity and work experience.

The results of these efforts have converged into a new generation of Computer-aided Transla-tion (CAT) software, which was mainly developed by the industrial partner. The MateCat Toolis not only a stable, enterprise-level translation workbench (currently used by few thousandsof professional translators), but also an advanced research platform for integrating new MTfunctions, running post-editing experiments and measuring user productivity. The software isdistributed under the LGPL open source license, and combines features of the most advancedsystems (commercial tools such as the popular SDL Trados Workbench,1 as well as free tools,such as OmegaT2) with new functionalities. These include: (i) an advanced API for the MosesToolkit,3 customizable to different languages and domains; (ii) ease of use through a clean andintuitive web interface that enables the collaboration of multiple users on the same project;(iii) translation memory (TM), concordance, and terminology support; and (iv) advanced log-ging functionalities.

The MateCat Tool runs as a web-server accessible through Chrome, Firefox and Safari.The CAT web-server connects with other services via open APIs: the TM server MyMemory4,the commercial Google Translate (GT) MT server, and a list of Moses-based servers specified

1http://www.translationzone.com/

2http://www.omegat.org/

3http://www.statmt.org/moses/

4http://mymemory.translated.net

6

Page 7: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Figure 1: The MateCat Tool editing page.

in a configuration file. While MyMemory and GT are stable and established web services,customized Moses servers have to be first installed and set-up. The Moses translation APIextends the GT API in order to support self-tuning, user-adaptive and informative MT. Duringeach post-editing session, the CAT server records for each segment all suggestions received,the final post-edit and the post-editing time. All statistics can be summarized in an editing logpage, and downloaded as a standard .CSV file for further processing. The document formatsupported natively by the MateCat Tool is XLIFF,5, but the tool can also be configured to utilizeexternal file converters. Unicode (UTF-8) encoding is supported, including non-latin alphabetsand right-to-left languages. Mark-up tags in input texts can also be handled.

The following sections address the individual field and lab tests that were conducted in thefinal round of evaluations.

2 Field Test of Fully Fledged MT

2.1 Motivation and Goals

This field test aimed at evaluating the impact of self-tuning MT, user-adaptive MT and infor-mative MT on user productivity. In particular, the goal was to compare user productivity withthe MateCat Tool under two different and contrastive working conditions. In the first condition,translators received MT suggestions from a state-of-the-art domain-adapted but static MT en-

5http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html

7

Page 8: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

gine whose setup and knowledge bases do not change while the translation project is in progress.In the second condition, translators receive suggestions from a domain-adapted MT engine thatalso has the aforementioned capabilities:

• self-tuning MT: the engine adapts at the end of each working day from all the availabletranslation project specific data, i.e. the source document and the so far translated seg-ments (project adaptation);

• user-adaptive MT: the engine continuously adapts from each single segment post-editproduced by the translator (online adaptation);

• informative MT: specifically, (i) the engine also supplies the user with an adaptive qualityestimation score along with each MT suggestion (quality estimation); (ii) terminology isautomatically extracted from the task document and added to the MT engine (terminologyhelp).

Productivity gains were assessed for different translators, in two translation directions, English–Italian and English–French, and in two distinct domains, namely the “official” legal and infor-mation technology domains. TED talks were added as a third “non-official” domain for thetranslation direction English–French, in response to one of the recommendations from an ear-lier project review. We used two key performance indicators to measure user productivity:(i) the post-editing effort (PEE), i.e., the amount of corrections made by the translators on theMT suggestions, as measured by the HTER metric, and (ii) the time to edit (T2E), i.e., theaverage number of words translated per hour.

2.2 Evaluation Protocol

Adaptive MT aims at providing better translation suggestions and thus reducing the requiredpost-editing effort. Testing this hypothesis empirically means comparing post-editing effortwith and without adaptation. Testing it under realistic working conditions means having userstranslate entire documents rather that isolated sentences, so that the adaptive system can actuallylearn from previous corrections.

Two important design decisions were involved in our experiments. First, whether to per-form the comparison within-subject (same translator) or between-subject (different translators),and second, whether to run the contrastive tests on the same document or on different texts.Our past experience has shown that human translation evaluation calls for within-subject de-signs, because of the considerable natural variation among translators in terms of post-editingeffort and time-to-edit. These variations are likely due to subjective preferences and/or differ-

8

Page 9: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

ent professional skills with respect to the document to be translated. The observed inter-subjectvariability is in general much larger than the effects we want to measure in our experiments.

A within-subject design means repeating measurements under different conditions. Hence,the second design decision is weather to perform the measurements on different documents, ondifferent portions of the same document, or on the same document at different times. Previouspost-editing experiments, in which MT suggestions of subsequent sentences were randomlypicked from different MT systems, showed that the results can be highly influenced by the highvariability of translation difficulty among different sentences, both for humans and machinetranslation.6 Random sampling of sentences also works against adaptive MT systems, becauseworking on a sub-sample of sentences significantly reduces the amount of text repetitions thatan adaptive system can leverage from, and because it can lead to inconsistent behaviour, as thesystem will sometimes learn from the user corrections and sometimes not.

Hence, for the final test we decided to adopt the following experimental design, which weassessed in previous exploratory experiments. Within-subject experiments are conducted byletting our subjects post-edit the same project7 twice, with a month-long interval between thetwo sessions, a lapse of time we conjecture sufficient for the test subject to forget very detailedinformation about the project. In order to limit other “carry-over effects”, in particular learningeffects due to experience with the CAT tool and with post-editing, each post-editing exercisewas arranged in two sessions: (i) a warm-up (or calibration) session lasting one full day, duringwhich the translator post-edits the first portion of the project with MT suggestion from the staticbaseline; and (ii) the actual test session lasting one or more days, during which the translatorpost-edits the rest of the project, the first time with MT suggestions from the static baseline andthe second time (one month later) with MT suggestions from the adaptive system.

2.3 Experimental Set-up

We ran our post-editing experiments between August and September 2014 on two translationdirections and three tasks. For each direction/domain combination under consideration, weran contrastive experiments with four professional translators. Table 1 shows the number ofsegments and words to be translated by each translator in the warm-up and test sessions, re-spectively.

6Notice, that this issue is just one example of the well known “language-as-a-fixed-effect” fallacy problemspotted by Clark [1973], which suggests that statistical tests involving random samples of language should alwaystry to model the random effects introduced by the chosen test sample.

7Each project consists of one or more documents.

9

Page 10: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Table 1: Summary statistics of the translation tasks run during the field tests: number of sen-tences and tokens of the source documents translated during each of the two sessions (warm-upand test).

Domain Warm-up Testsegments tokens seg tok

Information Technology 236 3,284 300 3,874Legal 132 3,125 152 3,641TED Talks 200 3,388 165 2,967

2.4 Baseline systems

For each task and laguage pair, the baseline system MT system was a state-of-the-art, phrase-based Moses system (Koehn et al. [2007]). It features a statistical log-linear model interpolatinga phrase-based translation model (TM), a lexicalized phrase-based reordering model (RM), onelanguage model (LMs), and distortion, word and phrase penalties.

For three translation direction/domain combinations out of the five investigated in this trial,namely Legal Text in English–French / English–Italian and Information Technology in English–Italian, the amount of training data was sufficiently large to achieve a translation quality goodenough for post-editing; henceforth a standard training regime was applied. Translation and re-ordering models were estimated following the Moses protocol with default setup, using MGIZA++[Gao and Vogel, 2008] for word alignment. For language modeling, we used the IRSTLMtoolkit for standard n-gram modeling with an n-gram length of 5 and Modified Shift BetaSmoothing [Federico et al., 2008].

For the other two tasks, namely English–French Information Technology and TED talks, theamount of available in-domain training data was comparatively small, as shown in Table 2. Thetraining data was therefore augmented by means of the following automatic and unsupervisedprocedure. Data selection (cf. Deliverable 2.1) was performed to collect additional in-domaindata from larger general-domain parallel corpora, by exploiting the bilingual cross-entropy dif-ference [Axelrod et al., 2011] with mode=3, using the original in-domain parallel texts as seeddata. Different amounts of texts were selected from each generic corpus, and then concatenatedto build one large parallel corpus.

Two TMs and two RMs were first estimated separately on the original in-domain trainingcorpus and on the selected data using the standard Moses protocol, and then combined with theback-off technique,8 a simplified version of the fill-up method of Bisazza et al. [2011], withoutthe provenance feature, resulting in one single TM and one single RM.

8The usage back-off (instead of fill-up) permits a more straightforward application of the project adaptation forthe full-fledged system.

10

Page 11: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Table 2: Available training data for the all tasks and language pairs (number of running wordsin the target language).

English-French English-ItalianIT Legal TED IT Legal

in-domain 18M 71M 3.5M 60M 63Mgeneric 1.1G - 1.4G - -selected 22M - 63M - -

To reduce model size and decoding time, the TM was pruned according to the RelativeEntropy approach proposed by Ling et al. [2012], and stored with the compact phrase tableimplementation of Junczys-Dowmunt [2012]. The language model was estimated as a mixtureof two components estimated on the target side of the in-domain original training corpus andthe selected data, respectively; this LM also used an n-gram length of 5.

For each task, the baseline system was tuned on a project-specific document, optimizingBLEU with Minimum Error Rate Training [Och, 2003]. Optimal values of three independentruns of MERT were averaged to improve the robustness of the tuning procedure [Cettolo et al.,2011]. Table 2 shows statistics of the available training data exploited for the creation of thebaseline systems.

2.5 Adaptive system

For each task and language pair, the adaptive MT system was derived from its correspondingbaseline system (see Section 2.4), and supported self-tuning, user-adaptive, and informative MT,and also offered terminology help, as explained in the following sections. As user-adaptive andinformative MT inherently depends on the user post-edits, different instances of the adaptivesystems were deployed, one for each translator.9

Each system contained more feature functions than the baseline; the process of tuning theweights was similar to that of the baseline systems, but it simulated the post-editing task usingpreviously collected post-edits. The optimal weights were shared among the translator-specificinstances.

Self-tuning MT: Project adaptation was performed (see Deliverable 1.2 and Cettolo et al.,2014) considering the post-edits collected in the warm-up session as well as the source of the full

9Hence, during the second session of the test, 20 independent full-fledged systems were handled. Although notneeded, also during the first session of the test an independent baseline system were instantiated for each translator.

11

Page 12: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

project text as a seed,10 and the data used for building the baseline system as generic data. Theproject-specific TM and RM were trained on the selected data and combined with the baselinemodels using the back-off technique. The LM was estimated as a mixture of the project-specificand baseline LMs. Where the LMs of the baseline system were already a mixture, the LM ofthe adaptive systems had 3 components, otherwise 2.

User-adaptive MT: Adaptive systems adapt their models and consequently their behavior inresponse to the feedback provided by the translator (see Deliverable 2.2 and Bertoldi et al., 2014,to appear). To accomplish this, the systems are equipped with dynamic cache-based models[Bertoldi, 2014]. For each post-edited segment, source text and the post-edited and approvedtranslation are first aligned at word level by means of an enhanced online version of MGIZA++[Farajian et al., 2014]. Phrase pairs and target n-grams are then extracted, and inserted into thedynamic cache-based models. As the entries of the dynamic models are approved and preferredby the translator, they receive a score bonus over alternatives already in the static models, sothat they are more likely chosen during decoding. The value of the reward decreases over time.

Terminology Help: The post-edits collected in the warm-up session as well as the sourceof the full project text were also used to generate a bilingual terminology vocabulary to em-bed in the adaptive system following the procedure described by Arcan et al. [2014b] (see alsoDeliverable 3.2). The set of bilingual entries were included in additional translation and lan-guage cache-based models, and rewarded during decoding.11 Terminology assistance was notavailable in the English–French TED system.

Informative MT: The adaptive system also returns an estimate of the translation quality ofthe proposed suggestion (cf. Deliverable 3.2), which is shown to the translator in the MateCatTool GUI (cf. Deliverable 4.3). The quality confidence value is computed by means of AQET,an Adaptive Quality Estimation Tool [Turchi et al., 2014a]. The quality score visualization canbe seen in Figure 1 close to the yellow MT suggestion label. AQET was trained on the warm-updata (source, target and post-edited sentences) and its models and behaviour were adapted on-the-fly according to the translator post-edits. In the English-French TED task, AQET learneddirectly from the test set, because the training data was not available. In all the field tests, SparseOnline Gaussian Processes were used as online learning algorithm. All baseline and adaptivesystems were run on Amazon Elastic Compute Cloud virtual machines.12

10For English-French TED task, source and post-edits of the warm-up were not included in the seed data, becausea preliminary analysis revealed no correlation between warm-up and test talks.

11The reward assigned to terminology entries was kept constant over time.12aws.amazon.com/ec2/

12

Page 13: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

2.6 Results

To recapitulate: our within-subjects post-editing experiments in this study aimed at studyingthe differences in user productivity when post-editing suggestions from a domain-adapted staticMT engine vs. post-editing suggestions produced by an adaptive MT engine featuring projectadaptation, online user adaptation, adaptive quality estimation, and automatic terminology ex-traction.

As both conditions implicitly depend on the impact of MT on post-editing and translation, itis of course important to measure first how relevant this activity actually was in the field-test wecarried out. Having been developed also a professional tool in a production environment, theMateCat Tool tries to supply the user with at least three suggestions, one from the MT engineand two from translation memories (TM). Suggestions are ranked according to the match score;the actual match score in the case of TM and a fixed score of 86% in the case of MT output.Note that in the adaptive condition, the MT quality score shown to the user was not used forranking the suggestions. To gauge the usefulness of MT per se in this scenario, we also recordedfor all tasks and across the two contrastive conditions: (i) the frequency of MT suggestionsthat were shown as top suggestions, and (ii) the frequency at which users post-edited the topsuggestion. Results of the analysis are summarised in Figures 2 and 3. Figure 2 shows that thenumber of MT suggestions appearing in the top position varies significantly from task to task,and that for each given task it is quite similar across the two tested conditions. The cross-taskdifferences are due to differences in the coverage provided by the translation memory, whichis clearly higher for the Information Technology domain than for the TED Talks domain. Thesmall cross-condition differences are due to two external factors: (i) the natural growth of thetranslation memory during the one month interval between the two experiments (which, as areal-world resource beyond the control of the experimental setup, could not be “frozen” for ourexperiments), and (ii) differences in MT server time-outs in the two sets of experiments — inorder to avoid a sluggish user experience, time-outs are imposed on MT server requests.

As the primary goal of our study is to evaluate the impact of different MT settings, wedecided to focus our analysis only on segments for which the top suggestion was an MT sug-gestion and whose source length was of at least 10 words. The latter criterion still captures thevast majority of segments, but excludes segments that are typically far from the document-levelaverages for post-editing effort and time-to-edit and thus tend to blur the picture. Time-to-editstatistics were computed on the same segments after removing outliers, namely segments forwhich the translation time was below 0.5s per word ore above 30s per word. The rational is toexclude from the analysis segments for which no post editing and likely even no reading wasperformed, and segments during the translation of which the translator probably interruptedher/his work.

13

Page 14: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Figure 2: Percentages of MT matches proposed as top suggestions.

Figure 3: Percentage of top suggestions actually post-edited by the translators.

Results of the field test are summarized in Table 3. For each official translation direction,task, and evaluated condition we report the average post-editing effort (PEE) and time-to-edit(T2E) for four translators. Relative gains (�%) and their statistical significance are reportedfor each metric. Statistical significance was determined via mixed-effects regression models[Baayen et al., 2008], including post-editors and sentence identifiers as random effects. The full-fledged adaptive systems lead to significant improvements with respect to both key performanceindicators for all official tasks. Improvements are in the range of 12% � 18% for PEE, and10% � 37% for T2E. All gains in PEE and T2E are significant at level p < 0.001, except

14

Page 15: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Table 3: Post-edit effort (HTER) and Time-to-edit (word/hour, WH) results of the field test.Statistical significance (Sign): *** p < 0.001; ** p < 0.01; * p < 0.05. Notice: the TED taskwas run without any warm-up session.

PEE-HTER T2E-WHLang Task Static Adaptive �% Sign Static Adaptive �% SignEn-Fr IT 33.34 28.94 13.20 *** 1804 2465 36.74 ***En-Fr LE 23.38 20.48 12.40 *** 1827 2159 18.17 ***En-It IT 26.17 23.02 12.04 *** 2234 2651 18.67 ***En-It LE 30.64 25.15 17.92 *** 1602 1761 9.93 **En-Fr TED 54.65 56.46 -3.29 ** 1510 1747 15.70 ***

for one gain in T2E, which is significant at level p < 0.01. Averaged over all official tasksand languages, the PEE of the static condition is 28.63%, whereas it drops to 24.66% in theadaptive condition. In other words: the use of adaptive systems as opposed to static systemsresults in a statistically significant global reduction in PEE of 13.87% (p < 0.001; note thatfor this statistical test, task id and language id were also considered as random effects). Thecorresponding figures for T2E are 1891 words per hour in the static condition and 2298 wordsper hour in the adaptive condition, resulting in an overall T2E gain of 21.52% (p < 0.001).

Finally, the additional experiment on the non-official TED task resulted in a slight increaseof 3% in PEE (p < 0.01), but a 16% gain in T2E (p < 0.001). These results show that for theTED talk translation task, even though the suggestions obtained by adaptation were no betterin terms of percentage of words to fix, they could be post-edited significantly faster than thoseprovided by the static MT system.

2.7 Discussion

With respect to the original goal of the MateCat project to achieve a 15% improvement inproductivity over a baseline system, we were able to get very close to the target in terms ofpost-editing effort (13.87% improvement in terms of HTER), and to exceed the expectationswith a 21.52% gain in translation speed.

Let us now turn our attention to an analysis of the dynamic behaviour of the systems andusers that casts some light on the overall issues related when experiments are performed un-der real-world working conditions. Recall that our field tests employed professional translatorsworking remotely, who were supposed to follow minimal instructions in order to ensure thatexperiments were carried out in a consistent way. In particular, the translators were asked toprovide a draft translation of the supplied document by not accessing any source of information

15

Page 16: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

0 50 100 150

1525

35

English-French IT T01

Index

0 50 100 150

1525

35

English-French IT T02

Index

CMHTER

0 50 100 150

1525

35

English-French IT T03

Index

CMHTER

0 50 100 150

1525

35

English-French IT T04

Index

CMHTER

0 20 40 60 80 100

1015

2025

30

English-French LEGAL T01

Index

0 20 40 60 80 100

1015

2025

30

English-French LEGAL T02

Index

CMHTER

0 20 40 60 80 100

1015

2025

30

English-French LEGAL T03

IndexCMHTER

0 20 40 60 80 100

1015

2025

30

English-French LEGAL T04

Index

CMHTER

0 50 100 150

1520

2530

35

English-Italian IT T01

Index

0 50 100 150

1520

2530

35

English-Italian IT T02

Index

CMHTER

0 50 100 150

1520

2530

35English-Italian IT T03

Index

CMHTER

0 50 100 150

1520

2530

35

English-Italian IT T04

Index

CMHTER

0 20 40 60 80 100

1020

3040

50

English-Italian LEGAL T01

0 20 40 60 80 100

1020

3040

50

English-Italian LEGAL T02

CMHTER

0 20 40 60 80 100

1020

3040

50

English-Italian LEGAL T03

CMHTER

0 20 40 60 80 100

1020

3040

50

English-Italian LEGAL T04

CMHTER

Figure 4: Cumulative post-edit effort along the segments of the test document with the staticsetting (blue line) and the adaptive setting (red line), for each official task (row-wise arrange-ment), and for each translator (column-wise arrangement). Translators an each row post-editedwith the same static MT engine and with a personalised adaptive MT engine.

but the suggestions supplied by the MateCat tool. Even though the trial participants were work-ing under (apparently) the same experimental conditions (same documents, same tool, sametranslation memory, same machine translation engine), the results they produced show, at firstglance, an impressive variability. However, these outcomes did not surprise us at all. Sim-ilar results were already observed in the previous field tests, which suggests that in order toget meaningful results, experiments should be run with large documents and with many trans-lators. On the other hand, this requirement calls for robust and powerful significance testingmethods, capable to handle repeated observations and heterogeneous data. A major techni-cal advancement of the project was the development of suitable experimental protocols for thefield test, and the development and adoption of powerful statistical testing methods based onmixed-effects models.

To have a glimpse of the high variability in the collected data, we show in Figure 4 thetrend of the cumulative post-edit effort (HTER) along the test document, under both the tested

16

Page 17: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

conditions, for all the translators. By comparing the plots along each row, which correspond totranslators working on the same document and translation directions, it is clearly evident thattranslators may produce very different post-edits. In particular, the blue plots correspond toincremental HTER against suggestions from the static MT engine which is shared among allusers. The similar trends of the curves correspond to the performance of the MT engines ondifferent parts of the document (documents are very far from being independent and identicallydistributed samples of segments). The different levels of the blue curves in each row show howdifferently users post-edit the very same suggestions: some translators post-edit more, othermuch less. Finally, the distance between the blue and red curves shows how much translatorsgain in post-edit effort when they receive suggestions from the adaptive system.

3 Lab Test of Fully Fledged MT

3.1 Motivation and Goals

The raw and post-edited translations produced in the official tasks of the field tests for the full-fledged MT systems were subsequently subject to human quality evaluation. The goal of thislab test was to determine if (i) the quality of the adaptive MT engine is inherently better thanthat of the static baseline; and (ii) the quality of the resulting post-edits is affected by whetherthe underlying MT suggestion was produced by the static or the adaptive system.

3.2 Experimental Set-up and Evaluation Protocol

We arranged the data from the 16 experiments in the Legal and IT domain field tests (4 post-editors ⇥ 2 translation directions ⇥ 2 domains) into 32 human evaluation tasks contrasting, foreach post-editor and source segment, either

• the two MT suggestions received under the static and adaptive settings; or

• the two post-edits (PE) produced under the static and adaptive settings.

We thus obtained 32 human evaluation tasks: 16 where the quality of the two MT suggestionshad to be judged, and 16 where the quality of the two resulting post-edits was to be assessed. Toavoid priming effects and possible biases towards the translations to be assessed, the tasks weregiven to 32 different domain-expert translators as judges, so that each judge saw each sourcesegment exactly once. The evaluation items were randomly assigned to each task, so that eachjudge had to assess the same amount of data (MT suggestions or post-edits depending on thetask) from all post-editors.

17

Page 18: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

To ensure consistency in the comparison, we selected from the “test session” data used inthe field tests (see Table 1) all and only segments for which (i) the MT output was presented asthe first suggestion; (ii) this suggestion was chosen for post-editing by all the translators; and(iii) the corresponding source item consisted of at least 10 words. In total, this selection processproduced per language pair 152 evaluation items in the Information Technology domain, and112 evaluation items in the Legal domain.

Quality judgments were collected with the tool MT-EQuAl [Girardi et al., 2014],13 whichwas developed within the MateCat project. For each source sentence, two translations (MTsuggestions or post-edits, depending on the task) had to be scored on a 5-level Likert scalecorresponding to increasing levels of quality, ranging from “useless” to “human quality” trans-lation. The numerical scores assigned by the judges to each of the two translations in eachevaluation item were then transformed into rankings14 : a higher quality score corresponds to a“win”, a lower quality score is a “loss” and the same quality score constitutes a “tie”. Wins andties were collapsed into one class; ranking statistics were then computed at the language/tasklevel by estimating for each contrastive condition (static vs. adaptive) the expected win-or-tieprobabilities on the judgements available for all post-editors.

3.3 Results

The win-or-tie probabilities on MT and PE quality thus computed are reported in Table 4, to-gether with relative change (�%) and an assessment of the differences’ level of statistical sig-nificance. Significance tests were computed by applying a pairwise random permutation test.

In terms of raw MT quality, the adaptive MT system produces statistically significantlybetter results only on the IT task. In the Legal domain we actually observe a small degradationin quality for French and merely a very small increase in quality for Italian.

In terms of PE quality, however, we observe that the post-edits based on the adaptive MTimprove over those based on the static MT in both domains and both languages pairs. Theseimprovements in quality, which we actually did not expect, range between 2% and 10% andare statistical significant in three tasks out of four. The overall relative improvement over alllanguages and domains is 5% with p < 0.05.

13www.mt4cat.org/software/mt-equal14For the specific purpose of this experiment, we could have asked judges to directly rank the two translations

relatively to each other. However we preferred to ask independent quality ratings on a 1-5 scale as, being moreinformative, they open the possibility of further experiments, such as the comparison of the 4 adaptive MT systems,or the analysis of the quality distance between the MT systems or the post-edits.

18

Page 19: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Table 4: Machine translation (MT) and post-editing (PE) quality evaluated through humansjudgements. Figures correspond to the percentage of MT outputs/post-edits that rank top underthe given condition (including ties). Statistical significance (Sign): *** p < 0.001; ** p < 0.01;* p < 0.05.

MT Quality PE QualityLang Task Static Adaptive �% Sign Static Adaptive �% SignEn-Fr IT 63.40 70.19 10.71 *** 64.76 68.17 5.27 *En-Fr LE 68.36 66.35 -2.94 66.29 67.80 2.28En-It IT 67.31 72.04 7.03 *** 64.23 70.85 10.31 ***En-It LE 68.47 68.80 0.48 66.35 69.64 4.96 *

Table 5: Machine translation (MT) and post-editing (PE) quality evaluated through humansjudgements. With respect to the previous table the analysis is carried out after mapping theoriginal judgements from a 1-5 scale into a 2-5 scale. Figures correspond to the percentage ofMT outputs/post-edits that rank top under the given condition (including ties).

MT Quality PE QualityLang Task Static Adaptive �% Sign Static Adaptive �% SignEn-Fr IT 66.41 73.52 10.71 *** 66.45 69.16 4.08En-Fr LE 73.77 73.88 0.15 66.85 68.64 2.68En-It IT 68.83 73.81 7.24 *** 64.27 70.89 10.30 ***En-It LE 71.65 73.32 2.33 66.35 69.64 4.96 *

3.4 Discussion

At first glance, the results obtained in this evaluation clearly seem to contradict the conclusionswe drew in Section 2 (cf. Table 3), namely that full-fledged adaptive MT systems result insignificant improvements for all tasks. The PEE/HTER scores in the Legal domain in particulardo not match the corresponding figures related to MT quality: even though judges seem to haveobserved little or no improvements in the quality of the raw MT output, adaptive MT clearlyoutperforms static MT in terms of PEE/HTER.

A simple explanation for the results obtained in the Legal domain could be that its MToutputs are much difficult to evaluate. Sentences in this domain are typically long, and qualitydifferences in several places in pairs of long sentences are probably difficult to take apart andtransform into a reliable quality judgement, especially when the overall translation quality islow. To eliminate this source of noise in the data, we repeated the evaluation after mapping all1-5 grades into a 4-value scale. In practice, we eliminated the 1-2 score differences by upgradingall scores 1 to 2. The new MT and PE quality results are reported in Table 5. The new resultsare more in line with those obtained in the field tests: MT quality for the adaptive condition is

19

Page 20: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

always better for all tasks, with very small improvements on legal task and large improvements,up to 11% (statistically significant), in the IT tasks. As for statistical significance, it must benoted that we collected only one judgement for each MT output. For MT outputs with thecomplexity of the Legal task, a single judgment may not be enough to measure significantdifferences by using a 1–5 Likert scale. The overall relative improvement in MT quality overall languages and tasks is 6% with p < 0.001, and the overall relative improvement in PE qualityis 4% with p < 0.001.

To conclude: besides confirming the results obtained in the field tests, a noteworthy outcomeof this evaluation is the unexpected improvement in the quality of the final product. The resultsshow that the MT adaptive system not only makes post-editors more productive, it also improvesthe quality of their work.

4 Field-Test of Quality Estimate Visualisation

4.1 Motivation and Goals

One of the key questions in utilising Quality Estimation (QE) [Mehdad et al., 2012, Turchiet al., 2014a] in a CAT scenario is how to relay QE information to the user. In this targetedfield test, we evaluated a way of visualising MT quality estimates for the user that is based on acolor-coded binary classification (‘good’ vs. ‘bad’ or ‘useless’) as an alternative to real-valuedquality scores (less immediate than the two-color scheme as they require some interpretationby the user). In this context, ‘good’ means that post-editing the translation is expected to befaster than translation from scratch; ‘useless’ means that post-editing the translation is expectedto take longer than translation from scratch. The effectiveness of this approach was evaluatedin terms of its effect on post-editing time (cf. Section 4.3.)

4.2 Experimental Set-up

We modified the MateCat GUI [Federico et al., 2014] to show the following behaviour.

1. The tool provides only one single translation suggestion (MT output) per segment, insteadof the usual three (1 MT suggestion plus 2 TM matches).

2. The edit area is not automatically pre-filled with the suggestion. Instead, a fetching mech-anism allows the user to either explicitly select the MT suggestion to pre-fill the edit area,or to ignore it and translate from scratch by directly writing into the edit area. Users’fetching actions can be logged, in order to obtain a first basic insight about the usefulness

20

Page 21: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

of the MT suggestion. The intuition behind this setup is that if the suggestion is deemedto be useful by the post-editor, he or she will explicitly select it by a mouse click.

3. Each translation suggestion is presented with a coloured flag (green for good, red for use-less), which indicates its expected quality/usefulness to the post-editor. In the contrastivecondition (no binary QE visualization), grey is used as the neutral and uniform flag color.

The experiment was set up for intra-annotator comparison on a single long document as follows:First, the document is split in half. The first half serves as the training portion for a binaryquality estimator; the second half is reserved for evaluation. The training portion is machine-translated and post-edited under standard conditions with the static MT system. Based on theirpost-edits, the raw MT output samples are then labeled as good or bad depending on the HTERbetween raw MT output and post-edited version. Based on the empirical findings of prior work[Turchi et al., 2013, 2014b], we used an HTER threshold of 0.4 (‘good’ if HTER 0.4; ‘bad’otherwise).

For each individual human translator, we then train a separate support vector machine(SVM) as a translator-specific binary classifier on the labeled samples, using the 17 baselinefeatures proposed by Specia et al. [2009]. The features are extracted from the data available atprediction time, that is, source text and raw MT output. The SVM parameters are optimized bycross-validation on the training set.

With these classifiers, we then assign quality flags to the raw segment translations in the testportion of the respective document. For this portion, each translator is given an even distribu-tion of segments flagged according to the test condition (green/red indicating MT quality), andsegments flagged according to the baseline condition (uniform grey flags). After post-editing,the post-editing times are analysed with respect to the impact that the binary coloring schemehas on overall post-edit times.

4.3 Evaluation

4.3.1 The Impact of QE Labels on Post-editing Times

We applied this procedure to one English–Italian document from the IT domain, which waspost-edited independently by four professional translators. Training and and test portion con-tained 542 and 847 segments, respectively. Half of the 847 test segments were presented withcoloured quality flags, with a ratio of green to red labels of about 325:98 (i.e., ca 70% ‘good’,30% ‘useless’).

To analyse the impact of the coloured quality labels on translators’ productivity, we firstcompared the average post-editing time (milliseconds per word) under the test condition (MT

21

Page 22: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

0 5 10 15 20 25 30 35 40 45 501500

2000

2500

3000

3500

4000

PET

(mSe

c/w

ord)

Maximum Sentence Length

GreyRed/Green

Figure 5: Incremental average post-editing time (milliseconds per word) for different maximumsentence lengths. Dashed lines refer to the segments for which the suggestion was presentedwith a coloured quality flag; solid lines refer to the segments for which the suggestion waspresented with the uninformative grey flag.

suggestions presented with coloured flags), against the average post-edit times under the base-line condition (uninformative grey flags). Figure 5 shows the incremental average post-editingtime (in msec./word) in the two conditions for sentence lengths ranging from 1 to 49 words, i.e.the maximum length in our dataset.15 As shown by the figure, post-edit times with the coloredquality flags are below those in the baseline condition for all sentence length cut-offs. Exceptfor very short segments, however, the gains are rather small, with values of around 150 msec.per word for segment lengths up to 15-20 words. The gain in post-editing speed observed hereis not statistically significant (with p > 0.05), according to the Wilcoxon signed rank test.

Figure 5 shows the results of the same analysis on select sub-samples of the data set. Here,the post-editing data were filtered based on the amount of post-editing that was performed onthe respective segments. Different HTER thresholds of 0.2, 0.4, 0.6 and 0.8 were applied forfiltering, removing instances with HTER values above the given threshold. The results of thisanalysis suggest that the colour coding of QE estimates is more useful for segments that receivelittle post-editing, whereas the benefit disappears as the underlying translations require moreand more editing. The significance tests for this experiment confirm this observation, withp < 0.001 for HTER 0.2, and p < 0.05 for HTER 0.4/0.6. In accordance with theresults for the whole dataset, the difference in post-editing time with and without coloured flagsis no longer significant when the overall quality of the translations is poor (HTER 0.8).

There are several possible explanations for these results. One is that the binary classifier15We discarded as outliers segments with post-editing times of less than 500 msec. or more than 30 sec. per

word. Segments with unrealisticly short post-editing times may not even have been read completely, and very longpost-editing times suggest that the post-editor interrupted his or her work or got distracted.

22

Page 23: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

0 5 10 15 20 25 30 35 40 45 501500

2000

2500

3000

3500

4000

PET

(mSe

c/w

ord)

Maximum Sentence Length

HTER <=0.2

0 5 10 15 20 25 30 35 40 45 501500

2000

2500

3000

3500

4000

PET

(mSe

c/w

ord)

Maximum Sentence Length

HTER <=0.4

0 5 10 15 20 25 30 35 40 45 501500

2000

2500

3000

3500

4000

PET

(mSe

c/w

ord)

Maximum Sentence Length

HTER <=0.6

0 5 10 15 20 25 30 35 40 45 501500

2000

2500

3000

3500

4000

PET

(mSe

c/w

ord)

Maximum Sentence Length

HTER <=0.8

Figure 6: Incremental average post-editing time (milliseconds per word) for different sentencelengths and for different levels of quality of the MT suggestions. Dashed lines refer to thesegments for which the suggestion was presented with a coloured quality flag; solid lines referto the segments for which the suggestion was presented with the uninformative grey flag.

makes more errors on poorer raw translations. Despite the rather high classification accuracy(78%), our error analysis indicated that most of the misclassified instances fall in the “bad” class(false positives, i.e. green labels wrongly assigned to poor suggestions). This, in turn, can beexplained by the rather unbalanced distribution of the training examples: the average amountof negative training instances obtained for the four post editors is indeed 30.31%± 4.12 of thetotal. Moreover, the negative training instances are very few for HTER values larger than 0.6,which makes the learning task particularly difficult.

Another explanation is that in the case of translations that require a lot of editing, the timesavings that can be obtained by deciding more quickly to discard a poor translation suggestion,based on its label, is far outweighed by the overall translation or post-editing effort required forthe respective segment.

4.3.2 Post-study Questionnaire

In addition to the objective evaluation of post-editing times, we also administered a short post-study questionnaire. The questionnaire contained nine statements about the usefulness of theassigned quality labels. For each statement, the post-editors were asked to indicate their agree-ment with the respective statement on a 5-point Likert scale ranging from “strongly disagree”

23

Page 24: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Table 6: Questionnaire for the evaluation of Binary Quality Estimation. Column 3 provides thescores assigned by each translator and column 4 provides the average scores assigned to eachstatement.

Statement Scores Avg. score1 Used as a translation quality indicator, the GREEN and RED flags

assigned to each MT suggestion helped me to speed-up the trans-lation work.

3-5-4-4 4

2 Overall, the suggestions marked with GREEN flags were ofhigher quality compared to those marked with RED flags.

2-4-5-5 4

3 More often I selected (mouse click) and corrected the suggestionsmarked with GREEN flags rather than those marked with REDflags.

3-2-2-5 3

4 I often made small corrections to the suggestions marked withGREEN flags, thus producing a final translation similar to thesuggested segment.

4-4-4-5 4.25

5 By post-editing the suggestions marked with RED flags I oftenproduced a final translation very different from the suggested seg-ment.

3-4-4-5 4

6 During my work I have developed a tendency to always read thesuggestions marked with GREEN flags.

5-4-4-5 4.5

7 During my work I have developed a tendency to ignore the sug-gestions marked with RED flags.

2-2-3-2 2.25

8 During my work, my attention to the assigned flags decreased(independently from the colour).

3-1-2-2 2

9 As a quality indicator, coloured flags are more informative andeffective than numeric values (e.g. scores in the 1%-100% range).

3-4-4-3 3.5

(score = 1) to “strongly agree” (score = 5). The statements included in the questionnaire andthe average feedback scores collected from the post-editors are shown in Table 6.

The participants’ responses confirm our overall positive evaluation. On average, the par-ticipants agree that binary QE labels are a useful feature. The answers to Statement 7 reveal,however, that the read flag did not fully fulfil its purpose: in spite of the intended function ofthe red flag (tell the user that the suggestion can be ignored as it needs complete rewriting),the translators kept reading also the suggestions labelled as not useful. This could simply bedue to the low accuracy of our classifiers on the “bad” class. On the other hand, we cannotexpect translators to change fundamental habits in the course of translating or post-editing afew hundred sentences.

24

Page 25: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

4.4 Discussion

The MateCat objective in terms of Informative MT was to reach 60% user acceptance in fieldtests for Informative MT by the end of the third year. For Task 3.2, the progress towards thisgoal has been assessed by means of a quantitative analysis of translators’ productivity and bymeans of a post-study questionnaire. In the quantitative analysis, significant post-editing timevariations with and without binary QE labels show that QE information brings significant post-editing time reductions when the quality of the MT suggestions is in the medium-high range(0 HTER 60). Despite the small number of participants involved in this small study,their responses to the questionnaire suggest that the 60% user acceptance goal was reached.Indeed, three out of four participants agreed with the statement: “Used as a translation qualityindicator, the GREEN and RED flags assigned to each MT suggestion helped me to speed-upthe translation work.”

5 Field-Test of Bilingual Term Extraction

5.1 Motivation and Goals

With respect to bilingual terminology extraction [Arcan et al., 2014a], we conducted experi-ments to determine if, and to what extent, automatically extracted bilingual terms can be usefulto a terminologist when creating a glossary or a term base. In this context, usefulness was de-fined as the ability to identify in a given source document term-translation pairs that are bothrelevant in the domain of the document and useful to a terminologist, i.e., the term itself, or itstranslation, are not obvious or known to inexperienced terminologists.

5.2 Evaluation Protocol

The field test on bilingual terminology extraction was carried out starting as follows. Followingthe bilingual term extraction method described in [Arcan et al., 2014a], we first extracted alist of terms from a document in the source language, document based on domain relevanceindicators. Theres were then associated with corresponding translations using Wikipedia cross-lingual links.

In order to organize the list by taking into account also the usefulness of the extracted termsto a terminologist, we applied a ranking mechanism based on counting the number of Wikipediacross-lingual links associated to each term. Our assumptions are that: (i) the number of links tothe description of a term in other languages is a good indicator of how widely the term itself isknown, and (ii) the most widely known terms are the less useful ones from the terminologist’s

25

Page 26: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

perspective. For instance, although the English terms “operating system” and “IPL” (“Infor-mation Processing Language”) are both relevant in the Information Technology domain, theyhave different degrees of potential usefulness, being the first probably known to any terminol-ogist and the latter probably known only to the most expert ones in the field. The number ofcross-lingual links associated with each of the terms in Wikipedia, 180 and 6, respectively, isa potentially good indicator of such differences. In accordance with these assumptions, the ex-tracted terms were ranked by considering the number of cross-lingual links, so that the termsmore likely to be useful (i.e., with a lower number of equivalent pages in other languages) werepromoted, while the less useful ones (those with a higher number of equivalent pages in otherlanguages) were demoted.

For the evaluation, a group of terminologists was presented with the original source lan-guage document, a list of htermsrc, termtgt, snippeti triples, and a questionnaire. In each triple,termsrc, termtgt, and snippet are the extracted term in the source language, its translation in thetarget language, and a short text snippet that provides additional information to understand themeaning of termsrc. The text snippet was extracted from the first sentence in the correspondingWikipedia page. The questionnaire (cf. Figure 7) was designed to collect human judgementsabout each triple, with respect to the following dimensions: correctness of the translation, termcompleteness and relevance for the domain, and usefulness of the triple for the terminologist.

26

Page 27: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Figure 7: Questionnaire for the evaluation of bilingual terminology extraction.

27

Page 28: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Table 7: Bilingual Term Extraction evaluation. Translation quality of the target terms (correctvs. wrong), domain relevance (correct vs. wrong) and usefulness (useful vs. not useful) of thebilingual terms evaluated through the questionnaire.

EN-FR EN-ITPositive Negative Positive Negative

Translation Quality 84.26± 6.51 15.74 ± 6.51 84.71± 3.96 15.29± 3.96Domain Relevance 78.89± 13.80 21.11± 13.80 82.75± 9.92 17.25± 9.92Usefulness 67.04± 10.01 32.96± 10.01 79.80± 17.78 20.20± 17.78

The experiments on bilingual terminology extraction were conducted for two languagecombinations: English–Italian and English–French. For each language pair, a list of tripleshtermsrc, termtgt, snippeti was compiled by extracting the relevant terms from the same En-glish document (7,158 tokens in total). This document came from the Information Technologydomain. The resulting lists consisted of 102 triples for English–Italian, and 108 triples forEnglish–French, with the small difference in size due to the slightly lower coverage of theItalian Wikipedia, which does not allow to find an Italian equivalent for each English term iden-tified in the source document. For each language combination, five terminologists with differentdegrees of expertise (ranging from one to thirteen years) evaluated the bilingual terms lists byfilling the questionnaire shown in Figure 7.

5.3 Results

The results of evaluation are summarized in Table 7. The second row of the table reports theaverage amount of (termsrc, termtgt, snippet) triples for which termtgt (i.e. the translation ofan identified term termsrc) is correct/wrong. As can be seen from the table, around 84.0%

of the translated terms is correct on both language pairs. These numbers are rather consistentacross the five terminologists involved in each experiment, with variance of ±6.51 for EN-FRand ±3.96 for EN-IT.

The third row of Table 7 reports, for both language pairs, the average values regarding thedomain relevance dimension. The analysis has been carried out by counting the proportion of(termsrc, termtgt, snippet) triples in which termsrc and termtgt are both domain-specific termsand are correctly identified (i.e. in case of multiwords, the whole term and its translation havebeen correctly recognized). If any of the two conditions does not hold, the whole triple has beenadded to the count of the non-relevant ones. Also in this case the results are rather positive: forEN-FR the average amount of relevant triples is 78.89%(±15.43), while for EN-IT it reaches82.75%(±9.92).

The fourth row of Table 7 reports, for both language pairs, the average values regarding

28

Page 29: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

the usefulness dimension. The analysis has been carried out by counting the proportion of(termsrc, termtgt, snippet) triples identified as useful for a terminologist in the task of com-piling a term base. Each terminologist was asked to consider the triple as a whole, that is tomark as relevant only the triples in which all the elements (including the text snippet) refer toa concept that is relevant in the domain of the source document and not too obvious or widelyknown. As shown in the table, the proportion of useful terms identified is rather high on bothlanguage pairs. Average values are indeed 67.04%(±10.01) for EN-FR, and 79.8%(±17.78)

for EN-IT, showing the overall effectiveness of our method.We also analysed the distribution of the useful terms in the two lists (108 triples for EN-FR,

and 102 triples for EN-IT). For each triple in the list, we considered the feedback provided byexpert (those having at least six years of experience) and junior terminologists (those having atmost two years of experience). For each language pair, the usefulness judgements of two expertand two junior terminologists where hence considered independently. This was done by startingfrom the top of the list (i.e. from the terms with fewer cross-lingual links in Wikipedia, hencemore useful according to our sorting criterion) and counting the average of positive feedbacksfor each item. Figure 8 shows the cumulative average usefulness of the items in the two lists,according to the two terminologists’ profiles. For both language pairs we observe similar trendsthat indicate: (i) the high usefulness of top ranked terms; and (ii) the higher usefulness of theretrieved terms for the less experienced terminologists. For instance, around 80% of the English-French triples and 90% of the English-Italian triples in the top 50 positions were marked asuseful by the junior terminologists. These probabilities slightly drop if we look at the feedbackof expert terminologists. In this case, the probability of finding a term that is useful to an expertterminologist in the top 50 terms is around 0.77 for EN-FR and 0.74 for EN-IT.

In general, if we consider the entire lists, for both language pairs and terminilogist profilesthe usefulness of the identified terms is always above 60%. With respect to the MateCat objec-tives for the third year (“60% user acceptance of informative MT in field tests”), these resultsindicate the overall success of WP3 activities also on Task 3.1 (Terminology Help).

5.4 Discussion

The progress on Task 3.1 led to a reliable method to extract from a monolingual documentrelevant and useful bilingual terms. The results of an ad-hoc field test on this task revealed thatthe extracted terms are substantially above all possible “60%” thresholds in terms of correctness,relevance, and usefulness.

29

Page 30: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

0 20 40 60 80 100 1200.7

0.75

0.8

0.85

0.9

0.95

1

Bilingual Terms

Perc

enta

ge o

f Use

fuln

ess

En−Fr Bilingual Term List

Expert Terminologist Junior Terminologist

0 20 40 60 80 100 120

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Perc

enta

ge o

f Use

fuln

ess

Bilingual Terms

En−It Bilingual Term List

Expert Terminologist Junior Terminologist

Figure 8: Distribution of useful bilingual terms over the entire list (EN-FR top, EN-IT bottom).

30

Page 31: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

6 Field Test of Self-Tuning MT

6.1 Motivation and Goals

This field test aimed at validating both the effectiveness and the robustness of the self-tuningMT functionality in the framework of a translation project covering several days. Through awell-defined adaptation protocol, which embeds the technology of continuous space languagemodeling, the test should prove the gain in user productivity for a translation task run overfive days. The test is performed in two parts: during the first part of the experiment, trans-lators receive MT suggestions from a state-of-the-art domain-adapted Moses engine, withoutany adaptation step between each day of work. For the second part, the MT suggestions areprovided by a MT system which was previously adapted to the current project using the humantranslations of prior working days.

In addition to the two key performance indicators, namely both the post-editing effort andthe time to edit, this field test also aimed at evaluating the translation quality throughout both theBLEU and the TER metrics. Beyond the user productivity gain, the test has to make sure thatno regression in the translation quality is observed after several days of work due to overfittingof the project adaptation, since previous working days are being used to adapt.

6.2 Evaluation Protocol

This test aims at highlighting the contribution of adaptive MT on user productivity. The evalu-ation protocol is hence similar to the one presented in fully fledged field test with regard to theadaptive functionality: we decided to ask the same translators to post-edit the same documentwith and without adaptive MT, with a latency period between the two parts of the experiment.

Moreover, in order to respect realistic conditions of work, we decided to set up a uniqueuser-specific Moses engine per translator. By these means, any inter-user side-effects due topersonal choices or stylistic edits are avoided. In addition, we obtain multiple references forassessing the results of the test.

Consequently, it is desirable for the assessment of the adaption schemes that human trans-lators work in a synchronized manner, i.e. if the same amount of data is translated every day byeach translator. The systems are then adapted, individually for each translator, using previousdays of work, and the translators use their own adapted systems for the next day, and so on.

To summarize, the test was organized as follows: (i) translators post-edit the same amountof data each day, over 5 days, using a static state-of-the-art domain-adapted Moses engine;(ii) translators post-edited the same amount of data, over 5 days of work, but this time using anindividualized and daily project-adapted Moses engine.

31

Page 32: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Table 8: Summary statistics of the translation task run during the field test.

Domain: DocumentLEGAL segments tokensTask:

day1 144 3.3kday2 129 3.4kday3 136 3.3kday4 130 3.2kday5 132 3.3k

all 671 16.6k

6.3 Experimental Set-up

We ran experiments using a post-editing workflow in September and October 2014, translatingfrom English into French. For these experiments, we asked four professional translators topost-edit translations in the Legal domain. Statistics of the document to translate are shown inTable 8.

The MT systems used for our test are described in the following section.

6.3.1 Domain adapted system

In this section we first describe our domain adapted (DA) system which was built before thehuman translators started working. We then describe our project adapted (PA) system whichresults from the integration of the daily translations made by the human translators into theDA system. In both systems (DA and PA), data selection is performed by using the XenC tool[Rousseau, 2013]. It is based on the algorithms proposed by Moore and Lewis [2010], Axelrodet al. [2011]. To adapt a generic system to a specific domain (Legal in our case), an in-domaincorpus is required. For that purpose, we combined the dgtna and the OPUS-ECB corpora. Thistotals to about 22M words. Data selection is performed for all other available corpora, i.e. theparallel texts from Europarl, JRC-Acquis, news commentary software manuals of the OPUScorpus, translation memories and the United Nations corpus. Monolingual data selection wasalso performed from large amounts of newspaper texts (more the than 700M words from theWMT evaluations). A domain-specific development set of about 32K words was also available.

6.3.2 Project adapted system

Project adaptation is performed iteratively during the lifetime of the translation project (5 daysin our case). After the first day of work by the human translators (based on the DA SMT system),

32

Page 33: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

knowledge about the new translated text is injected into the SMT system in order to improvetranslations of the next day. Basically, we add the newly translated text of the current day to thedevelopment set, then we perform a new monolingual and bilingual data selection based on thisnew development set. The project-adapted system is build on this selected data and optimizedon the new development set. This procedure is identical to the one used for our English/Germansystem [Cettolo et al., 2014]. Data selection significantly reduces the amount of data usedfor building the translation and language model, i.e. 512M!26M words for the bitexts and1.3B!178M words for the LM training data. As a positive side effect, the models are muchsmaller and easier to deploy on standard computers usually found in the LSP environment.

When performing project adaptation of an SMT system, we assume that the documents ofa large project are similar and adapting the SMT system to the already processed days willimprove the translation quality on the next days. However, we need to be careful to not “over-adapt” the system to a particular day of the project, this is particularly risky since the dailyamount of new project data is relatively small (about 3k words). Therefore, we add three timesthe daily data to our existing domain-specific development set. The factor of three was empiri-cally chosen during our lab tests to account for the different sizes. Also, all the preceding daysare conserved, i.e. when we adapt after three days, we use all the data from the first three days.

6.3.3 Continuous space language model

Over the last years, there has been significantly increasing interest in using neural networks inSMT, and more generally NLP. LIUM has a long experience with this technology. We traineda neural network language model, also called continuous space language model (CSLM), onexactly the same data than the classical n-gram back-off LM. The CSLM was integrated intothe Moses decoder and was used by all the translators in this field test. As far as we know, thisis the first time that a neural network based SMT system is used in a professional environment.

Another important advantage of the CSLM is the fact that it can be very efficiently adapted.After each day of work, we dispose of a very small amount of project specific data (aboutthree thousand words). This data is used to perform a new, project-specific data selection toadapt the back-off LM. On the other hand, the CSLM can be directly adapted by performingincremental training of the DA CSLM. This adaptation can be performed in a couple of minutes.The effectiveness of this new adaptation technique was observed in our initial lab test for aEnglish/German SMT system (see deliverable D1.2). This field test confirmed our findings in arealistic environment.

33

Page 34: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Table 9: BLEU and TER scores for 5 days for the English-French lab test, legal domain.Evaluation of the DA system (baseline), PA system, and their CSLM version.

Method BLEU TERDay1 (104) DA 52.52 33.10

DA+CSLM 53.74 32.58Day2 (122) DA 48.09 35.83

DA+CSLM 49.19 35.38PA+CSLM-adapt 50.26 34.77

Day3 (116) DA 60.75 25.06DA+CSLM 60.24 26.03PA+CSLM-adapt 58.41 28.91

Day4 (116) DA 49.48 35.57DA+CSLM 50.40 35.62PA+CSLM-adapt 51.40 35.62

Day5 (106) DA 52.71 34.01DA+CSLM 53.40 34.04PA+CSLM-adapt 55.89 32.10

6.4 Results

6.4.1 Results for the Lab tests

Before detailing the results of the field test itself, we present in Table 9 the BLEU and TERscores of our internal lab test which aimed to simulate project adaptation over five days. Foreach day, about 100 segments were used which corresponds to the size of the field test data(about three thousand words).

The BLEU and TER scores improve using project adaptation for all the days with exceptionof day 3, despite the fact that the performance of the DA system is already very good (BLEUscore of around 50 with one reference translation). The data of the third day seems to be veryspecific: the BLEU score of the DA system is almost 10 points higher than for the other days.The same difference applies for the TER score. It is likely, that there is a large overlap with thetraining data of the DA system. Therefore, project adaptation is unlikely to improve the system.

6.4.2 Results for the field test

In this section we present the field-test results on the English-French data of the Legal domainover 5 days. Note that only the results of three of the four translators are presented. Thetranslators were asked to respect some rules for the test, and the fourth post-editor did notcomply with those rules. We thus had to discard his work to avoid any biased results.

34

Page 35: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Table 10: BLEU scores for 5 days on the English-French data of the Legal domain for thethree translators. The DA system (baseline), PA system, and their CSLM version are shown.The parenthesized BLEU score is calculated using the references of all three translators. TheBLEU score represented between brackets is calculated using a generic reference provided bythe European Commission.

Day Method Translator 1 Translator 2 Translator 31 DA 49.72 (63.69) [24.89] 48.84 (63.69) [24.89] 30.23 (61.23) [24.89]

DA+CSLM 53.98 (67.04) [25.65] 52.25 (67.04) [25.65] 31.02 (67.04) [25.65]2 DA 48.35 (62.13) [22.78] 44.07 (62.13) [22.78] 30.68 (62.13) [22.78]

DA+CSLM 51.93 (65.64) [23.77] 46.61 (65.64) [23.77] 31.87 (65.64) [23.77]PA+CSLM-adapt 50.13 (65.21) [24.13] 54.61 (67.97) [24.78] 35.20 (65.12) [23.84]

3 DA 53.75 (67.14) [24.16] 46.88 (67.14) [24.16] 44.16 (67.14) [24.16]DA+CSLM 58.83 (70.70) [25.15] 49.73 (70.70) [25.15] 46.17 (70.70) [25.15]PA+CSLM-adapt 63.48 (76.74) [26.38] 60.23 (75.90) [27.14] 59.70 (78.21) [27.53]

4 DA 51.48 (64.74) [25.01] 43.22 (64.74) [25.01] 39.77 (64.74) [25.01]DA+CSLM 56.84 (68.61) [25.04] 45.68 (68.61) [25.04] 42.33 (68.61) [25.04]PA+CSLM-adapt 55.81 (71.59) [26.74] 57.19 (72.05) [26.55] 54.03 (73.69) [26.91]

5 DA 53.30 (67.07) [24.78] 47.77 (67.07) [24.78] 42.18 (67.07) [24.78]DA+CSLM 57.09 (69.70) [25.57] 50.06 (69.70) [25.57] 44.10 (69.70) [25.57]PA+CSLM-adapt 56.15 (72.90) [28.56] 61.83 (75.21) [27.98] 56.97 (74.60) [27.72]

Tables 10 and 11 respectively show the BLEU and TER scores for each translator for sev-eral SMT systems. The first column indicates the day of the fields test. The second columnrepresents three SMT systems, namely: the domain adapted system, noted DA (baseline), thedomain adapted including an CSLM, noted DA+CSLM and the project adapted system (allmodels were updated, including the CSLM) noted PA+CSLM-adapt. The third, fourth and fifthcolumns respectively represent the BLEU scores (Table 10) and TER scores (Table 11) for thethree translators. The first score is calculated using the reference produced by the translatorhimself. The second score (represented between parenthesis) is calculated using the referencesof all the three translators. The third score (represented between brackets) is calculated accord-ing to a generic reference provided by the European Commission. By these additional results,we aim to assess whether their is a tendency of the systems to adapt strongly to the particularstyle of one translator, or whether they still perform well with respect to “independent” refer-ences. On day 1, only the DA and DA+CSLM systems are presented as the project adaptationcan only start after the first working day.

First of all, we can notice that the CSLM significantly improves the BLEU scores of the PAsystems for all translators. We can also notice that the BLEU score of translator 3 (calculated onhis own reference) is much lower than the BLEU scores of the two other translators. However,

35

Page 36: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Table 11: TER scores for 5 days on the English-French data of the Legal domain for thethree translators. The DA system (baseline), PA system, and their CSLM version are shown.The parenthesized TER score is calculated using the references of all three translators. TheTER score represented between brackets is calculated using a generic reference provided by theEuropean Commission.

Day Method Translator 1 Translator 2 Translator 31 DA 33.34 (28.10) [54.59] 32.99 (28.10) [54.59] 48.62 (28.10) [54.59]

DA+CSLM 31.13 (25.73) [54.94] 31.43 (25.73) [54.94] 48.50 (25.73) [54.94]2 DA 35.33 (30.73) [56.63] 37.44 (30.73) [56.63] 49.03 (30.73) [56.63]

DA+CSLM 33.06 (28.86) [56.30] 36.24 (28.86) [56.30] 49.12 (28.86) [56.30]PA+CSLM-adapt 34.31 (29.07) [56.18] 30.48 (27.21) [56.30] 47.29 (29.62) [56.53]

3 DA 30.76 (26.68) [55.49] 35.09 (26.68) [55.49] 38.05 (26.68) [55.49]DA+CSLM 27.87 (24.70) [55.09] 33.86 (24.70) [55.09] 36.72 (24.70) [55.09]PA+CSLM-adapt 25.24 (20.04) [54.13] 27.48 (20.40) [54.16] 27.42 (20.99) [53.77]

4 DA 33.01 (29.07) [55.90] 38.31 (29.07) [55.90] 41.96 (29.07) [55.90]DA+CSLM 29.79 (27.12) [56.78] 37.92 (27.12) [56.78] 41.03 (27.12) [56.78]PA+CSLM-adapt 30.47 (25.87) [55.21] 30.15 (25.53) [56.12] 32.70 (24.03) [55.86]

5 DA 31.34 (26.31) [54.78] 34.38 (26.31) [54.78] 39.41 (26.31) [54.78]DA+CSLM 29.52 (24.88) [52.59] 33.94 (24.88) [54.74] 38.85 (24.88) [54.74]PA+CSLM-adapt 31.52 (24.43) [53.08] 26.19 (22.34) [53.16] 30.46 (23.71) [54.31]

after the third day, he reaches the same level. This could indicate that the adaptation process haslearned his particular style. Overall, it can be clearly seen that the adaptation scheme is veryeffective; in most of the cases, the difference between the baseline system (DA+CSLM) andthe fully adapted system (PA+CSLM-adapt) is more than 10 points BLEU. The only exceptioncan be observed for the first translator, days 2, 4 and 5. However, the project-adapted systemis always better or identical when multiple independent references are used. It is interesting tonote that our adaptation procedure always improves the translation quality with respect to theindependent reference translation of the European Commission.

A quite similar tendency can be observed when analyzing TER (see Table 11). First, trans-lator 3 has a much higher TER than the two other translators during the first two days, butthe system seems to learn his style and the TER reaches a comparable level at day 3. Second,project adaptation always lowers the TER with respect to the individual reference, with the sameexceptions than in Table 10. Finally, the TER with respect to the generic reference provided bythe European Commission is lower in nine out of twelve cases.

36

Page 37: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Table 12: Overall time-to-edit for the field test on the legal domain. Measurements are taken onpost-edits performed with the domain-adapted static MT system (DA+CSLM) and the project-adapted MT system (PA+CSLM-adapt).

User Time to edit (words/hour)Task ID DA+CSLM PA+CSLM-adapt �

En-Fr

T1 928 1283 38.3%T2 1533 1816 18.5 %T3 308 704 128.5%

Table 13: Time-to-edit according to time ranges. Measurements are taken on post-edits per-formed with the domain-adapted static MT system (DA+CSLM) and the project-adapted MTsystem (PA+CSLM-adapt).

User TTE Time to edit (words/hour)ID (ms) DA+CSLM PA+CSLM-adapt �

T1

>30k 732 1023 39.8%>60k 634 1023 61.5%

>300k 21 335 1508.7%>600k 53 - -

T2

>30k 1195 1523 27.4%>60k 1081 1432 32.6%

>300k 206 - ->600k - - -

T3

>30k 254 586 130.7%>60k 223 539 141.5%

>300k 103 264 157.0%>600k 38 132 250.6%>900k 21 112 439.3%

6.4.3 Impact on user productivity

In the following sections, we provide the results on user productivity, which is the main goal ofour test.

Table 12 reports, for each translator, the time-to-edit for the two conditions of our field test,along with the percentage of relative improvement. We can observe a very high productivitygain for all translators between the two parts of our test, from 18.5% to 38.3%. We will seein following that the gain for T3 could be biased by the working speed of the translator ratherthan the improvement of translation quality only. According to our previous results, althoughour baseline system was pretty well domain-adapted, our project adaptation procedure worksat incrementally reducing the HTER and thus reduces the post-editing effort. The post-editor

37

Page 38: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

productivity is then improved, consequently.In order to strengthen this observation, we summarized in Table 13 the time-to-edit in words

per hour rate according to a time threshold (30, 60, 300 and 600 seconds). In that manner, wewanted to observe if the most time-consuming segments (logically the longest ones) are morereasonable to post-edit by applying our adaptation procedure. Thus, for the first translator(named T1), we observe that, after adaptation, there are no more segments which require morethan 600k ms (i.e more than 10 minutes) to be post-edited. With regards to the second translator,T2, he never passed more than 10 minutes to post-edit any segments of the test. After adaptation,the time-to-edit never exceeds 5 minutes, which is an interesting point. For the third translator,a striking point is his “low” productivity in comparison to the two others translators. He is aboutthree time slower than T1, and about four to five time slower than T2. Even if it was confirmedthat all translators are experts with post-editing process, we assume that either T3 has somedifficulties with the legal domain or he has just took his time to perform the test, or both. Thefact that he took his time could explain why we observed editing time over than 900k ms (i.e. 15minutes) for some segments. Thus, this could partially explain the productivity improvements.Whatever is the reason of his speed, our adaptation scheme roughly doubled his productivity.This gain is most visible for the long segments. productivity improvements can be nonethelessobserved for this translator,

To summarize the results on user productivity, we observed good gains using project adap-tion. This confirms our findings of the lab test.

6.5 Discussion

This field test had two objectives: validate that project adaptation is effective for large projectscovering several days, and confirm that neural network language models can be integrated intothe MateCat workflow. Both objectives were achieved. The adaptation technique of the CSLMis in fact very efficient, with respect to translation quality as well as the time needed to performthe adaptation.

7 Conclusion

The third field and lab tests were run as planned (during Summer and Fall 2014) by using thethird version of the MateCat Tool, the MT engines and the other core components developedby the consortium. Building on the experience gained in the two previous field tests, the overallpreparation and execution of this final round were smooth and efficient. This resulted in atimely conclusion of the evaluation, which gave to the involved partners enough time to perform

38

Page 39: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

results’ analysis and draw informative and reusable conclusions.According to the work plan, the goals for the second year were to achieve 15% MT quality

and productivity improvement and 60% of user acceptance of informative MT by integratingself-tuning, user-adaptive and informative MT. The progress towards achieving such objectiveshas been measured on two language pairs (English-Italian and English-French), and three dis-tinct domains (legal, information technology and TED talks).16

The comparison between the final full-fledged system and the baseline “pre-MateCat” sys-tem was very positive. By considering the final (and ambitious) success criteria set, that isgetting 15% improvement in productivity over the baseline, we were able to get very close tothe target in terms of post-editing effort (13.87% of gain) and to exceed the expectations interms of translation speed with a 21.52% gain. Since translation speed is the main indicator ofproductivity, this latter result gives a very precise idea of the success of the project in pushingthe state of the art in computer-assisted translation.

Additional, more focused field tests have been also carried out to measure the effectivenessand robustness of self-tuning and informative MT. On the self-tuning MT front, the twofold goalwas to validate the effectiveness project adaptation for large projects covering several days, andconfirm that neural network language models can be integrated into the MateCat workflow. Bothobjectives were achieved, with improvements both in terms of BLEU/TER score (even morethan 10 points) and in terms of user productivity (with percentages varying with translators’speed, and ranging between 18.5% for the fastest user and 128% for the slowest one).

Regarding Informative MT, we separately evaluated the progress made on the terminologyhelp and sentence/word level confidence subtasks. In terms of user acceptance, the results onbilingual terminology extraction indicate the suitability of our bilingual term lists to supportthe work of terminologists. The analysis of users’ feedback collected by means of a question-naire resulted in correctness, relevance and usefulness values that largely exceed the 60% useracceptance success criterion.

With respect to sentence/word level confidence, our latest binary QE components were eval-uated with translators working with an adapted version of the MateCat tool. Also in this case,the analysis of their feedback revealed a good level of satisfaction, especially in terms of produc-tivity increase (75% of positive judgements). This is also reflected by the quantitative analysis ofpost-editing time variations, which indicates that QE information brings significant post-editingtime reductions when the quality of the MT suggestions is in the medium-high range.

Lab tests were run to manually evaluate the quality of the MT suggestions received bythe translators as well as their post-edits based on those suggestions. The goal was to assess

16The TED talks domain has been added to address a recommendation by one of the project reviewers, ad-vocating for the inclusion of a more challenging domain characterized by higher language diversity and lowerrepetitiveness.

39

Page 40: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

the quality of the suggestions proposed by the full-fledged engine, and verify if differencesagainst the “pre-MateCat” system also have an impact on the quality of the final results. Theoverall results are in line with those obtained in the field tests: the MT quality of the full-fledged engine is higher (with a statistically significant 7% improvement) and its output leadsboth to larger improvements on post-editors’ productivity and, surprisingly, on higher qualitytranslations (with a statistically significant 5% improvement).

40

Page 41: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

References

Mihael Arcan, Claudio Giuliano, Marco Turchi, and Paul Buitelaar. Identification of BilingualTerms from Monolingual Documents for Statistical Machine Translation. In Proceedings ofthe 4th International Workshop on Computational Terminology (Computerm), pages 22–31,Dublin, Ireland, August 2014a. Association for Computational Linguistics and Dublin CityUniversity. URL http://www.aclweb.org/anthology/W14-4803.

Mihael Arcan, Marco Turchi, Sara Tonelli, and Paul Buitelaar. Enhancing Statistical MachineTranslation with Bilingual Terminology in a CAT Environment. In Proceedings of the 11thBiennial Conference of the Association for Machine Translation in the Americas (AMTA2014), Vancouver, BC, Canada, 2014b.

Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Domain Adaptation via Pseudo In-DomainData Selection. In Conference on Empirical Methods in Natural Language Processing, pages355–362, Edinburgh, United Kingdom, 2011. ISBN 978-1-937284-11-4.

R. Harald Baayen, Douglas J. Davidson, and Douglas M. Bates. Mixed-effects modeling withcrossed random effects for subjects and items. Journal of memory and language, 59(4):390–412, 2008.

Nicola Bertoldi. Dynamic models in moses for online adaptation. Prague Bulletin of Mathe-matical Linguistics, 101:7–28, 2014.

Nicola Bertoldi, Patrick Simianer, Mauro Cettolo, Katharina Waschle, Marcello Federico, andStefan Riezler. Online adaptation to post-edits for phrase-based statistical machine transla-tion. Machine Translation, Special Issue on Post-editing, 2014, to appear.

Arianna Bisazza, Nick Ruiz, and Marcello Federico. Fill-up versus Interpolation Methods forPhrase-based SMT Adaptation. In International Workshop on Spoken Language Translation(IWSLT), pages 136–143, San Francisco, CA, 2011.

Mauro Cettolo, Nicola Bertoldi, and Marcello Federico. Methods for smoothing the optimizerinstability in SMT. In MT Summit XIII: the Thirteenth Machine Translation Summit, pages32–39, Xiamen, China, 2011.

Mauro Cettolo, Nicola Bertoldi, Marcello Federico, Holger Schwenk, Loıc Barrault, andChristophe Servan. Translation project adaptation for mt-enhanced computer assisted trans-lation. Machine Translation, 28(2):127–150, 2014.

41

Page 42: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Herbert H. Clark. The language-as-fixed-effect fallacy: A critique of language statistics inpsychological research. Journal of verbal learning and verbal behavior, 12(4):335–359,1973.

Mohammad Amin Farajian, Nicola Bertoldi, and Marcello Federico. Online word alignmentfor online adaptive machine translation. In EACL 2014 Workshop on Humans and Computer-assisted Translation (HaCaT), pages 84–92, Gothenburg, Sweden, April 2014.

Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. IRSTLM: an Open Source Toolkit forHandling Large Scale Language Models. In Proceedings of Interspeech, pages 1618–1621,Brisbane, Australia, 2008.

Marcello Federico, Nicola Bertoldi, Mauro Cettolo, Matteo Negri, Marco Turchi, Marco Trom-betti, Alessandro Cattelan, Antonio Farina, Domenico Lupinetti, Andrea Martines, AlbertoMassidda, Holger Schwenk, Loıc Barrault, Frederic Blain, Philipp Koehn, Christian Buck,and Ulrich Germann. The MateCat Tool. In Proceedings of COLING 2014, the 25th Inter-national Conference on Computational Linguistics: System Demonstrations, pages 129–132,Dublin, Ireland, August 2014. Dublin City University and Association for ComputationalLinguistics. URL http://www.aclweb.org/anthology/C14-2028.

Qin Gao and Stephan Vogel. Parallel implementations of word alignment tool. In SoftwareEngineering, Testing, and Quality Assurance for Natural Language Processing, SETQA-NLP ’08, pages 49–57, Stroudsburg, PA, USA, 2008. Association for Computational Lin-guistics. ISBN 978-1-932432-10-7. URL http://dl.acm.org/citation.cfm?id=1622110.1622119.

Christian Girardi, Luisa Bentivogli, Mohammad Amin Farajian, and Marcello Federico. Mt-equal: a toolkit for human assessment of machine translation output. In Proceedings of COL-ING 2014, the 25th International Conference on Computational Linguistics: System Demon-strations, pages 120–123, Dublin, Ireland, August 2014. Dublin City University and Asso-ciation for Computational Linguistics. URL http://www.aclweb.org/anthology/

C14-2026.

Marcin Junczys-Dowmunt. Phrasal rank-encoding: Exploiting phrase redundancy and trans-lational relations for phrase table compression. Prague Bull. Math. Linguistics, 98:63–74,2012. URL http://ufal.mff.cuni.cz/pbml/98/art-junczys-dowmunt.

pdf.

P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen,C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: Open Source

42

Page 43: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Toolkit for Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of theAssociation for Computational Linguistics Companion Volume Proceedings of the Demo andPoster Sessions, pages 177–180, Prague, Czech Republic, 2007. URL http://aclweb.

org/anthology-new/P/P07/P07-2045.pdf.

Wang Ling, Joao Graca, Isabel Trancoso, and Alan W. Black. Entropy-based pruning forphrase-based machine translation. In Proceedings of the 2012 Joint Conference on Em-pirical Methods in Natural Language Processing and Computational Natural LanguageLearning (EMNLP-CoNLL), pages 962–971, Jeju Island, Korea, July 2012. URL http:

//www.aclweb.org/anthology/D12-1088.

Yashar Mehdad, Matteo Negri, and Marcello Federico. Match without a Referee: EvaluatingMT Adequacy without Reference Translations. In Proceedings of the Seventh Workshop onStatistical Machine Translation, pages 171–180, Montreal, Canada, 2012.

Robert C. Moore and William Lewis. Intelligent selection of language model training data. InACL (Short Papers), pages 220–224, 2010.

Franz Josef Och. Minimum Error Rate Training in Statistical Machine Translation. In ErhardHinrichs and Dan Roth, editors, Proceedings of the 41st Annual Meeting of the Associationfor Computational Linguistics, pages 160–167, 2003. URL http://www.aclweb.org/

anthology/P03-1021.pdf.

Anthony Rousseau. Xenc: An open-source tool for data selection in natural language process-ing. The Prague Bulletin of Mathematical Linguistics, 100:73–82, 2013.

Lucia Specia, Nicola Cancedda, Marc Dymetman, Marco Turchi, and Nello Cristianini. Esti-mating the Sentence-Level Quality of Machine Translation Systems. In Proceedings of the13th Annual Conference of the European Association for Machine Translation (EAMT’09),pages 28–35, Barcelona, Spain, 2009.

Marco Turchi, Matteo Negri, and Marcello Federico. Coping with the subjectivity of humanjudgements in MT quality estimation. In Proceedings of the Eighth Workshop on StatisticalMachine Translation, pages 240–251, Sofia, Bulgaria, August 2013. Association for Compu-tational Linguistics. URL http://www.aclweb.org/anthology/W13-2231.

Marco Turchi, Antonios Anastasopoulos, Jose G.C. de Souza, and Matteo Negri. AdaptiveQuality Estimation for Machine Translation. In Proceedings of the 52nd Annual Meetingof the Association for Computational Linguistics (ACL ’14). Association for ComputationalLinguistics, 2014a.

43

Page 44: D5.5 – Third Report on Lab and Field Tests Translation Enhanced Computer Assisted Translation D5.5: Third report on Lab and Field Tests Contents 1 Introduction 6 2 Field Test of

Machine Translation Enhanced Computer Assisted TranslationD5.5: Third report on Lab and Field Tests

Marco Turchi, Matteo Negri, and Marcello Federico. Data-driven Annotation of Binary MTQuality Estimation Corpora Based on Human Post-edition. Machine translation, SpecialIssue on Post-editing, 2014b. To appear.

44