Training Guide PE Certification
-
Upload
eric-dufour -
Category
Documents
-
view
216 -
download
0
Transcript of Training Guide PE Certification
-
8/17/2019 Training Guide PE Certification
1/79
Training_Guide_PE_CertificationRevision Date: 30/10/2013
SDL CertificationPost-editing Certification
-
8/17/2019 Training Guide PE Certification
2/79
ii
Table of contents
1 Introduction
1.1 About this training workbook ...................................................................................................................... 1
2 A brief history of post-editing and MT
2.1 What is MT? ............................................................................................................................................... 2
2.2 MT development in the last century ........................................................................................................... 2
2.3 A short history of MT at SDL ...................................................................................................................... 5
3 Post-editing versus Translation
3.1 Global developments and the localisation industry .................................................................................... 8
3.2 Why post-edit? ......................................................................................................................................... 10
3.3 Why translate? ......................................................................................................................................... 11
4 MT Technologies
4.1 The challenges of MT .............................................................................................................................. 12
4.2 Rules-based Machine Translation (RBMT) .............................................................................................. 14
4.3 Statistical Machine Translation (SMT) ..................................................................................................... 18
4.4 Hybrid Systems ........................................................................................................................................ 21
5 How the MT output is created
5.1 Baselines ................................................................................................................................................. 23
5.2 Verticals ................................................................................................................................................... 24
5.3 Customisations ........................................................................................................................................ 26
5.4 Engine training process ........................................................................................................................... 27
6 From the MT output onwards: the basics of post-editing 6.1 Introduction to post-editing ....................................................................................................................... 34
6.2 Degrees of post-editing ............................................................................................................................ 35
6.3 The quality check process ....................................................................................................................... 38
7 How to get the most out of MT
7.1 What makes an effective post-editor? ...................................................................................................... 40
7.2 Post-editing quality expectations ............................................................................................................. 41
7.3 Under-editing ........................................................................................................................................... 43 7.4 Over-editing ............................................................................................................................................. 44
-
8/17/2019 Training Guide PE Certification
3/79
iii
7.5 Help improve MT for the future ................................................................................................................ 47
8 Expected Statistical MT behavior
8.1 Common patterns to watch for when post-editing .................................................................................... 50
8.2 How to provide feedback to improve the MT output ................................................................................. 52
9 Using BeGlobal baselines in SDL Trados Studio
9.1 BeGlobal baselines .................................................................................................................................. 59
9.2 How to add SDL BeGlobal Community as a translation provider in SDL Trados Studio .......................... 59
10 Summary
10.1 Conclusion to training workbook .............................................................................................................. 63
11 Further references
11.1 More information on MT and post-editing ................................................................................................ 65
12 Appendix:
12.1 Post-editing examples.............................................................................................................................. 67
-
8/17/2019 Training Guide PE Certification
4/79
1
1 Introduction1.1 About this training workbook
The scope of this training workbook is to introduce the reader to the techniques and
skills involved in post-editing machine translation (MT) output. It provides practical
examples of best-practice post-editing and recurrent issues such as over-editing and
under-editing. Moreover, it aims to familiarise translators with MT technology in order to
enable their involvement in the entire process from training engines to post-editing
content to publishable quality.
The document covers the following areas:
The history and development of MT
The various MT technologies currently used and the effects they have on the
quality and post-editability of the MT output
The post-editing and quality check processes and their relation to conventional
human translation A guide to effectively post-editing MT output to understandable and publishable
quality
Common patterns to watch for when post-editing MT output
Using BeGlobal baselines in Studio
Where to find further information on MT and post-editing processes
In addition, the document aims to address some of the common misconceptions about
MT:
MT is taking away my job
MT output is always low quality
MT material is only useful when it can be easily edited
MT does not leave any room for creativity
MT does not fit with my translation style
MT technology is too complicated Post-editing is less skilled than translation
-
8/17/2019 Training Guide PE Certification
5/79
2
2 A brief history of post-editing and MT2.1 What is MT?
Machine Translation (MT) is automated translation that uses software to translate text
from one natural language to another. It is one of the oldest applications of Artificial
Intelligence and both facilitates and accelerates the creation of high quality translations.
Post-editing MT output can increase productivity in comparison with conventional
translation. It allows companies to deliver a high quality translation at greater speed,
and consequently at lower cost, and as such can be considered a new industry “trend”.
However, it is important to remember that MT does not replace human translators. MT
is a tool rather than an end solution and a stage of human correction will always be
necessary when post-editing to a publishable quality. Nonetheless, it is an effective tool
when understood and used correctly.
Uses of Machine Translation
2.2 MT development in the last century
Following on from the efforts of war-time cryptography, MT is generally considered to
have started in the 1950s. In 1954, the successful execution of the Georgetown
• MT is generated by baseline engines orcustomised engines and the output is useddirectly, with no human intervention.
• This solution is used mostly for content such asemails, support content or instant messages,where the user wants to have an idea of thecontent, without the need for high quality.
Fully AutomatedUseful Translation
(FAUT)
• MT output from customised MT engines ispost-edited by linguists to a quality levelequivalent to conventional translation.
• Post-editing MT content is the preferredsolution for publishable documents. It is usedas part of a high quality translation process.
Post-editing
-
8/17/2019 Training Guide PE Certification
6/79
3
Experiment - the fully automated translation of approximately sixty Russian sentences
into English - ushered in an era of significant funding for MT research in the USA.
Researchers believed they could produce a fully automated MT system within three tofive years. This endeavour proved more difficult than expected, however, and ten years
later funding was cut when it became clear that the development of MT had not
progressed as far as originally hoped.
Early attempts at MT typically failed because of a lack of coverage. The models
functioned by encoding a limited selection of transformational rules that simply did not
provide for the diversity of natural language translation. Consequently, the first attempts
in the 1970s and 1980s to commercialise MT operated by drastically increasing the
number of encoded transformational rules. This produced Rules-Based Machine
Translation (RBMT), which functioned relatively successfully with targeted human
feedback over a particular domain. However, this led to the further problem of how to
make the abundance of transformational rules needed to encode language pairs co-
operate with one another. The answer was a statistical approach to MT.
In the late 1980s, computational power increased and became less expensive and as a
result interest picked up in Statistical Machine Translation (SMT). From the 1990s,
statistical learning approaches came to the fore, led by cutting-edge work from the IBM
research team. SMT systems no longer required the same human effort to encode
transformational rules and update lexicons and terminology lists, but rather exploited
the wealth of existing translations, covering numerous language pairs, to extract rules
based on statistical probability.
Since the 1990s, SMT has been pushed forward through intensive research and
training as well as support from Industry, Defence Advanced Research Projects Agency
(DARPA) and EC FP7. Statistical MT has been deployed in real-world, commercial
contexts by Language Weaver (now part of SDL), Google, Microsoft and IBM, and
there is on-going research into hybrid phrase-based and syntax-based MT. In 2011,
SMT was boosted with Google's announcement that it would charge for access to the
Google Translate API. Shortly afterwards, Microsoft also announced that it would start
charging for use of the Microsoft Translator API. These two events can be viewed as akey milestone for the Machine Translation Industry and the Localisation Industry as a
-
8/17/2019 Training Guide PE Certification
7/79
4
whole. The progression to a paid API model for machine translation is a clear sign that
both the use and the quality of MT has matured to a level where enterprises and
developers see sufficient value in MT to invest in it.
After many decades, it appears that the models used in MT are more in line with our
understanding of how human language cognition and processing operates. This does
not mean that MT output is of an equal standard to output produced by the human
brain. However, we now understand more about what MT can contribute to the
Localisation Industry and have an invaluable tool for translation that is becoming ever
more prominent in the field.
MT accuracy is improving every year and many new techniques are being developed
and deployed as the field becomes more and more interdisciplinary, drawing from
computer science, linguistics, probability theory, algorithm design, automata theory and
engineering.
-
8/17/2019 Training Guide PE Certification
8/79
5
Some facts about MT today
2.3 A short history of MT at SDL
SDL first adopted MT into Translation Services in the year 2000 after acquiring a Rules-
Based Machine Translation (RBMT) engine from Transparent Language, which
became SDL Enterprise Translation Server (ETS). In 2004, the Knowledge-based
Translation System (KbTS) Group was set up to use ETS in a high quality translation
process.
In 2009, Statistical Machine Translation (SMT) was beginning to establish itself firmly in
the localisation industry following rapid development. SDL forged a strategic
partnership with leading SMT developer, Language Weaver, allowing SDL to extend
the languages supported by MT.
In 2010, SDL acquired Language Weaver and are continuing to invest heavily in the
development of SMT technology. SDL rolled out this capability to their Production
• Of the top 50 global companies, 53% publicallyacknowledge that they use an MT solution
• 54% of non-Anglophones use MT when visiting Englishlanguage websites
• 75% of people use free MT tools
• It is estimated that at least three-quarters of web userstake advantage of free translation tools due to the greateraccessibility and integration of MT solutions.
-
8/17/2019 Training Guide PE Certification
9/79
6
Offices which resulted in a huge increase in scalability and allowed the process to grow
rapidly. KbTS was re-branded in 2011 to iMT (intelligent Machine Translation) and the
first post-editing projects were rolled out using SDL Language Weaver SMT.
Today the SDL iMT department consists of an in-house team of language specialists,
MT scientistis and project managers, supplemented by trained teams in the Production
Offices plus a large fully-trained freelance post-editing team. The iMT team are
responsible for the maintenance of MT engines and for all MT evaluations and
customisations within SDL Global Solutions. The Project Management team manages
the set-up of projects, plans and schedules the customisations. The linguists are
responsible for evaluating the project data for MT suitability based on the content to be
translated. Once the project is approved for MT, the linguists prepare the data, test the
results and organise training for the linguistic team in the Production Office as well as
the freelancers who will work on the project. This approach of preparation, testing and
training helps guarantee a high quality MT engine and therefore a high quality final
translation.
And as for the future, developments within MT are made through improved models and
algorithms as well as by adding more high quality training data. SDL is constantly
working on improvements to the machine translation technology so that even better MT
engines can be created going forwards. The future for MT at SDL is full of possibilities
and iMT will be on-hand to offer its many years of expertise as the range of MT
solutions increases.
-
8/17/2019 Training Guide PE Certification
10/79
7
Brief Timeline of MT at SDL
2000
Acquisition of Rules Based Machine Translation (RBMT)
engine from Transparent Language: SDL EnterpriseTranslation Server (ETS)
2004 Knowledge-based Translation System (KbTS) Group set up touse ETS in a high quality translation process
2009 Partnership with Language Weaver (LW)
2009 Training from LW on how to customise SMT engines
2010 Rollout of post-editing process to Production Offices
2010 Due diligence and acquisition of Language Weaver
2011 Re-branding of KbTS to iMT
2011 First iMT projects using SMT
2013 Continued development of SMT within SDL
-
8/17/2019 Training Guide PE Certification
11/79
8
3 Post-editing versus Translation3.1 Global developments and the localisation industry
An increasing number of companies are entering the international market and are
publishing localised materials in a bid to reach more customers and realise greater
sales opportunities. This is based on the finding that 85% of consumers feel that having
pre-purchase information in their own language is a critical factor in buying services.
IBM estimates that 2.5 quintillion (1018) bytes of data are created every day and that
90% of corporate data originated in just the last three years. On average, companies
translate this content into 11 languages. At the same time, strong competition and the
need for faster turnaround times means that there is an immediate need to lower costs
and achieve savings through efficient and streamlined technology processes.
Key trends impacting on global businesses
Many of the recent trends affecting global business and information management will
have important consequences for the field of translation in the coming years. By the
end of 2014 there will be 2 billion users of computers and most of the growth forecast is
in the upcoming markets. This means that there will be more customers for software
• Business globalisation
• Internet use of multiple devices
• Explosive growth in digital content
• Effective targeting and revenue capture
• Growth of translated content
• Multimedia and video
• Extreme brand management across all channels
• Social media and community
-
8/17/2019 Training Guide PE Certification
12/79
9
and appliances and consequently a larger need for translations of user interfaces and
manuals.
In addition, by 2014, there will also be 2.5 billion users of the internet, which is 36% of
the world‟s population, compared with 22% in 2010. Information equivalent to 10 billion
DVDs will be sent over the internet each month. Not everyone will be able to access the
information in the language of origin and consequently there will be a larger demand for
translations in order to make information as widely accessible as possible.
Furthermore, Cloud Computing has also begun to make an impact in the technologies
industry. The use of the cloud is growing, and more and more users will needtranslations of the materials and content. The user interfaces will also require
translation as the number of end-users with different language requirements grows.
Thus, the demand for translation of both content and the interface itself is steadily
increasing.
Finally, social networking tools are rapidly increasing in popularity. The content lacks
specific structure and often involves interaction between users in various languages.
Companies are increasingly adopting social networking and professional use will
ultimately mean that more translations are needed and in a shorter time – in fact, often
in real-time, as and when content is created. Again, this will result in a greater need for
translation.
In all of the above, the importance of English as a global lingua franca is slowly
decreasing. Between 2000 and 2010, the two languages with the greatest growth on
the internet were Arabic and Mandarin Chinese – both of which grew twentyfold. In
contrast, content in English „only‟ tripled. Proportionally, then, English is declining in
importance relatively quickly. It is estimated that by 2020 English will have lost its status
as a lingua franca altogether. However, rather than being replaced with another natural
language, linguistic diversity will be the new status quo and translation will be key to
communication. In summary, then, there will be an increasing demand for more content
at greater speed and in an increasing number of languages.
So the question is, how can MT and post-editing help respond to these trends?
-
8/17/2019 Training Guide PE Certification
13/79
10
3.2 Why post-edit?
In the last few years, there have been significant developments in MT technology. SDL
has always been up to date with this development, and uses MT mainly to increase
efficiency whilst still delivering quality. This is achieved through integration of the MT
engines with SDL‟s translation environments – SDL Trados Studio, TMS and
WorldServer – which results in a streamlined process, leading to faster turnarounds
and higher cost-effectiveness.
A growing number of SDL‟s customers and freelance translators now rely on MT for a
high-quality, integrated translation process. Customised machine translation enginesdeliver output of such good quality that post-editing is faster than translating from
scratch. Indeed, MT solutions can reduce production times by as much as 50% in some
cases. As such, many clients consider MT the only viable way to process the enormous
volume of content they need to localise. Moreover, in certain cases, it allows the client
to consider translating content that they would not otherwise have tackled as the cost
would have been prohibitive.
However, post-editing is not only of value to the client but also has many advantages
for the translator. SDL‟s intelligent Machine Translation will help freelance translators to
remain competitive and save time. We combine our SMT technology with project-
specific Translation Memories to produce translations of post-editable quality that can
help to increase productivity. Post-editing is not inferior to conventional translation but
requires all the usual translation skills – such as domain knowledge, excellent
command of the source and target language, proficiency with CAT tools – plus a
willingness to embrace new technological advances.
The demand for MT solutions is growing quickly and post-editing is rapidly becoming a
basic skill for translators. Learning how to post-edit will give linguists a foothold in an
evolving market and open up new freelance possibilities. We have seen a real swing in
attitudes in the last few years with many clients looking to MT as the default option to
help deliver translation faster and cheaper – without sacrificing quality.
In summary, the following client and translator benefits apply:
-
8/17/2019 Training Guide PE Certification
14/79
11
3.3 Why translate?
Whilst post-editing can provide a number of benefits for clients and translators alike, not
all projects will be suitable for post-editing. Because MT typically reproduces the
material used to train the engine, previously unseen material can present difficulties.
This is particularly common in text types with highly complex sentence structures or
very specific terminology and texts with a high amount of ambiguity which require
translations to move away from the source.
At SDL, all content is evaluated carefully before a project or part of a project is
considered for MT. Machine Translation technology is improving all the time and
content types that were not suitable two years ago, are now handled very productively
using Machine Translation. In some cases, however, conventional translation will still
be the recommended solution for the foreseeable future.
• Lower cost• Faster time to market
• Publishable quality• Higher volumes for translation• Ability to handle digital content explosion
Clientbenefits
• A valuable new skill that opens moreopportunities
• Competitive edge in an evolving market• Greater speed and efficiency
• Higher volumes compensate for lower post-editing rates
Translator
benefits
-
8/17/2019 Training Guide PE Certification
15/79
12
4 MT Technologies4.1 The challenges of MT
MT shares many of the challenges of human language translation. These include the
ambiguity and polysemy of natural human language as well as the high levels of
linguistic diversity between languages. Particularly, where there is variation in the
morphological or syntactic characteristics of a language it becomes much harder for MT
to match the source and target phrases. Given that no linguistic information is encoded
into the statistical model this often presents problems.
Some of the main issues and active research problems for MT (as well as conventional
translation) are summarised below:
-
8/17/2019 Training Guide PE Certification
16/79
13
The challenges of MT
• Domain and genre: vocabulary; style (including active vs. passive)and sentence length will vary accordingly.
• Ambiguity: human language is ambiguous on both lexical andsyntactic levels
• E.g. "bank" can be the financial institution or the edge of a river
• E.g. "I saw the man with the telescope" - Is it the man or the speakerwho is holding the telescope?
• Variation in morphology and word order
• E.g. case and definiteness endings in Hungarian, and Swedish
• E.g. Verb - Subject - Object order in Arabic and Hebrew
• No one-to-one translation: a word that covers many social, culturaland linguistic meanings in one language may require finer distinctions
in another language and vice versa
• E.g. politeness levels in Japanese
• E.g. German "Tasse" = English "mug" or English "cup"
• Idioms: difficult to translate like any other form of formulaic language
• E.g. French "Avoir les dent longues" = English "To be ambitious" (Lit:"To have long teeth")
• Language specific characteristics• E.g. Arabic tokenisation, Chinese word segmentation, etc.
-
8/17/2019 Training Guide PE Certification
17/79
14
4.2 Rules-based Machine Translation (RBMT)
Chronologically speaking, Rules-Based Machine Translation (RBMT) was the first
approach to automated translation. It involves parsing a source sentence, analysing the
structure, converting this to a machine-readable code and then transforming it into the
target.
The core system is based on a set of grammatical rules for each of the languages,
combined with a dictionary. The dictionary contains source words and phrases, their
translations and detailed grammatical information, such as part of speech and
inflection. It provides the modules with the linguistic knowledge they need.
The rules are the “linguistic processor” of the system, responsible for analysis and
generation. They use linguistic information stored in the dictionary. These rules are
intended to represent the grammatical knowledge of speakers and specify inherent
agreement and relational information.
At the translation stage, the MT engine analyses each source sentence and tags the
words and phrases with their part of speech to identify the grammatical components, for
example, the subject, object and verb. The MT system then looks up the translations of
these grammatically tagged words and phrases in the machine dictionary and
combines them using the coded language rules for the target language. This builds the
translated sentence.
A large core dictionary provides the translations for everyday words and phrases. For
translations that use special terminology, an RBMT system can use custom dictionariesin conjunction with the baseline to improve translation accuracy.
Example
Determiner and noun need to agree in number and gender
Subject and finite verb need to agree in number
-
8/17/2019 Training Guide PE Certification
18/79
15
How to recognise RBMT output
The RBMT output is based on 3 factors:
Rules for language pair
General settings that can be customized (such as quotation marks, verb tense,
accents, decimal point)
The project dictionary where the specific terminology is entered and which is key
to improve the MT quality.
Some common issues can be identified when post-editing rules based machine
translation. Here we include some examples from English into French, Italian, Spanish,
Portuguese, Dutch, German, Swedish, and Finnish, which are the most common
languages for RBMT.
In order to recognise MT error patterns, post-editors should look out for the following
potential issues when post-editing.
Use of superfluous articles
Superfluous articles are commonly added in most languages, these can also occur
before proper nouns.
EN Source: Free High Speed Internet Access!
IT MT output: l‟ Accesso gratuito a internet ad alta velocità!
IT Post-edited: Accesso gratuito a internet ad alta velocità!
EN Source: Oil filter unit: Removal - Refitting
FR MT output: Bloc filtre à huile : La dépose - la Repose
FR Post-edited: Bloc filtre à huile : Dépose - Repose
Use of simple prepositions
-
8/17/2019 Training Guide PE Certification
19/79
16
When a term has not been entered in the Customised Dictionary, simple prepositions
are used and they should to be corrected when needed.
EN Source: Reconnect ECT sensor electrical connector.
FR MT output: Reconnecter le connecteur électrique de capteur ECT
FR Post-edited: Reconnecter le connecteur électrique du capteur ECT
Acronyms automatically translated into terms
When a specific acronym has not been entered in the Customised Dictionary it is
automatically and consistently translated into a common term which exists in the Core
Dictionary.
EN Source: MR
IT MT output: Sig.
DE MT output: Herr
FR MT output: M.
Proper nouns translated literally
EN Source: Thanks to Peter Ferry for reporting the VBScript/Jscript BufferOverrun Vulnerability.
IT MT Output: Grazie al Traghetto di peter per segnalare la Vulnerabilità legata al
sovraccarico del buffer di VBScript JScript.
IT Post-edited: Grazie a Peter Ferry per aver segnalato la vulnerabilità legata alsovraccarico del buffer di VBScript JScript.
EN Source: He lives in Palm Springs.
FR MT output: Il habite à Printemps de Paume.
FR Post-edited: Il habite à Palm Springs.
-
8/17/2019 Training Guide PE Certification
20/79
17
Capitalisation issues
The MT follows the source capitalisation, unless specific terms have been entered in
the Customised Dictionary with the required capitalisation (problem especially in IT
texts, e.g. UI options)
EN Source: Click Add Custom Phone Tune.
FR MT output: Cliquez sur Ajoutez l'Air Personnalisé de Téléphone.
FR Post-edited: Cliquez sur Ajouter une mélodie de téléphone personnalisée.
EN Source: Select the appropriate option in the Automatic Synchronizationsection
PT-BR MT output: Selecione a opção apropriada na seção Sincronização Automática
PT-BR Post-edited: Selecione a opção apropriada na seção Sincronizaçãoautomática
Disambiguation of homographs
You can encounter what we call “homograph resolution”. This means that the same
source term can be translated as a noun AND a verb (or an adjective, etc.), for example
NETWORK (a network, to network/networking).
When there is a homograph resolution issue, the entire syntax is misanalysed.
In the following examples the nouns are interpreted as verbs:
EN Source: Check box D6 on the blue label
DE MT output: Kasten D6 auf dem blauen Aufkleber prüfen
DE Post-edited: Kontrollkästchen D6 auf dem blauen Aufkleber
PT Source: The water reservoir does not contain enough water .
PT MT output: O reservatório de água não contém suficiente aguar .
-
8/17/2019 Training Guide PE Certification
21/79
18
PT Post-edited: O reservatório de água não contém água suficiente.
Compound formation and hyphenation issues
For some languages such as German and Finnish compounding rules may work. If they
do not work, the post-editor must amend accordingly and the term should get encoded.
RBMT – Pros and Cons
RBMT allows for excellent terminology control. There is no need for pre-existing TMs
as project dictionaries can be created from scratch and the output is systematic, rightly
or wrongly, meaning that experienced post-editors can post-edit quickly and reliably
with time. However, it can take a number of years to develop a new language pair and
the source must be well-written to generate good output. Moreover, project dictionaries
are time-consuming to create and therefore expensive to maintain and output is often
not very fluent and not sensitive to context, providing a single translation per term.
4.3 Statistical Machine Translation (SMT)
A Statistical Machine Translation (SMT) system learns to translate by analysing large
volumes of previously translated content. The starting point for training an engine is an
aligned corpus of source and translated sentences of hundreds of millions of words.
The training process subdivides each of the source sentences into words and series of
words (n-grams) and analyses the associated translated sentences. In this way the
training process determines for each n-gram in the source the most likely set of
• A lot of control of rules and terminology• Once the grammar is established, new projects can be created
from scratch relatively quickly
• Once set up, projects are easy to maintain• Consistent use of terminology
Pros
• The grammar is very time-consuming to develop• Rather literal translations• Too context-sensitive
Cons
-
8/17/2019 Training Guide PE Certification
22/79
19
translations. By analysing just the translated content, the training process learns the
order in which the translated words are most likely to occur. The more training data and
the more consistency there is in this data, the more accurate the process becomes.
In the next stage of the process, the system compiles all of the learned data into the
runtime MT engine. The runtime MT engine subdivides each sentence into smaller
chunks and looks up the possible translations in the compiled database. For a given
source sentence this process results in many possible translated sentences. The MT
engine uses the statistical data on the probability of a translation and the word order to
determine the best candidate for the MT output.
For general purpose translations, the system uses a baseline language engine that is
trained with a large corpus of broad spectrum content – hundreds of millions of words.
To enhance performance for applications that use specific terminology, a SMT system
can be trained with a corpus that contains only or mostly content that is close to the
data that is to be translated. An ideal corpus for this is a large Translation Memory (TM)
that contains the previous translations of a project. The recommended volume of data
required is 1 to 5 million words, although it is possible to work with less than 1 million.
This is known as customisation or training.
The quality of the MT output depends on both the linguistic and technical quality of the
material included. However compared to RBMT, SMT provides a more fluent translation
with some context-sensitivity and better reflects the style of the training material.
SMT – Pros and Cons
• Customisation times are quicker than with RBMT• Output reads more fluently and is stylistically better than the output
from a rules-based system
• Able to select the correct translation in certain contexts: e.g.“device” in IT domain
• Generally shorter setup times
Pros
-
8/17/2019 Training Guide PE Certification
23/79
20
Compared with RBMT, Statistical Machine Translation can offer a larger number of
languages for post-editing as engines are lower cost and faster to train, as well as
easier to maintain. Moreover, because SMT is trained with “real” sentences and
phrases the direct output can be more fluent than with RBMT, which is good for raw
output requirements and additionally helps the post-editor. In addition, there is a high
level of research activity surrounding SMT and performance improvement is predicted
for the future. For this reason, SMT is the technology of choice at SDL.
However, it should nonetheless be noted that SMT requires large amounts of memory
space and processing capacity – though this in itself becomes less of a problem with
technological developments. Moreover, the output is dependent on the quality and
volume of data used for the customization, and therefore the post-editor must be aware
of the range of common trends in order to post-edit accurately. Similarly, it is harder to
implement changes in terminology made by the client than with RBMT and a project
specific engine can only be created if there is sufficient data as a starting point.
Syntax-based SMT – pros and cons
Syntax-based translation is based on the idea of translating syntactic units, rather than
single words or strings of words. A Syntax-based statistical engine can improve
grammatical accuracy and ensure that verbs are realised in the correct position.
• Need for large bilingual corpora (millions of words)• Difficult to maintain (for retraining a high amount of content is
needed, which takes time to gather)
• Need for processing time – file processing times are higher with animpact on hardware costs
Cons
• Better modelling of target language structure• Ensures there is always a verb present• Realises the verb in the correct position• Better handling of function words, such as prepositions• Has a more powerful decoding algorithm
Pros
-
8/17/2019 Training Guide PE Certification
24/79
21
The following table summarises the key differences between SMT and RBMT:
Attribute SMT RBMT
Does not need a large
volume of aligned data fortraining/customisation +
Number of languagessupported +
Setup time for newlanguage +
Terminology control +
Software UI term handling +
Raw fluency +
Raw accuracy +
Level of research activity
and performanceimprovement predicted +
4.4 Hybrid Systems
One thing that is being explored in contemporary research into MT technology is the
possibility of creating a hybrid engine, where dictionaries, rules and statistical features
are combined so as to obtain the best of both worlds. This can be done in many
• Early stages of development
• Sometimes less accurate terminology as no link to baselineCons
-
8/17/2019 Training Guide PE Certification
25/79
22
different ways; examples are the use of a dictionary to enforce certain translations in
SMT and the use of statistical techniques to determine the best translation for a
homograph such as “bank” or “get”, where the translation is different depending on thecontext.
However, current solutions are fairly pragmatic and leave room for further development
in future. In some cases, hybrid systems do not back up to a baseline and this can
exacerbate common MT issues, such as terminology inconsistencies and/or content left
untranslated.
-
8/17/2019 Training Guide PE Certification
26/79
23
5 How the MT output is createdStatistical MT is now the technology of choice at SDL, so this course will now
concentrate on SMT technology.
SDL takes a three-pronged approach to SMT and uses the following different engine
types, matching the solution to the particular use case:
5.1 Baselines
The core MT engines developed by SDL are known as baselines. These baseline
systems are bilingual corpora used as general databases for each language pair. They
are based on a large translated corpus of hundreds of millions of words, taken from
reliable sources available in the public domain, such as news, IT documentation,
technical manuals and publically-available government material, and distributed across
various domains, including IT, automotive, news, sports, electronics, etc.
Baselines are under constant development and new releases are launched frequently.
Customised engines• Content trained for specific client corpus
Verticals
• Domain-specific engines
Baselines
• Generic engines containing diverse data
-
8/17/2019 Training Guide PE Certification
27/79
24
This solution produces good results for clients who require immediate access to MT,
who do not have sufficient volumes of data and/or wish to translate general content
across several domains.
Client-specific customisations and domain-specific verticals normally use baseline
engines as a backup; so if a certain word, phrase, or even grammatical structure is not
present in the training data, the engine may still be able to produce a translation.
Baselines – Pros and Cons
5.2 Verticals A vertical is a trained statistical engine that exclusively contains data related to a
specific subject area, or domain, such as IT, Automotive, Electronics etc. When a client
does not have enough translated data to be used for a client-specific training, a vertical
solution can be used instead of a customisation on top of the baseline corpus.
These domain-specific engines therefore provide a point of entry for projects that have
small TMs. They also prove useful in those cases where there is not enough time to
create a project-specific engine before the first jobs start to flow in. Because the vertical
Pros Cons
-
8/17/2019 Training Guide PE Certification
28/79
25
is a ready-to-use solution, it does not have the development effort involved in creating
client-specific engines.
Based on the higher volume of data used in a Vertical when compared to a
customisation, the engine is less likely to take translations from the baseline and
therefore less likely to produce a general instead of a more specific technical
translation. However, as the data for the Vertical will come from different sources within
a domain it is also more likely to find inconsistencies in style and terminology that will
need to be checked during the post-editing and quality-checking stages.
SDL Verticals are available for the following domains in a wide number of languages
These engines are always under development and, whenever there is a considerable
amount of new data and/or new technical features that can enhance the overall
performance of the engine, they are retrained to improve the overall quality of the MT
output.
Automotive Vertical
Consumer Electronics (CE) Vertical
HiTech (IT Hardware) Vertical
Travel Vertical
-
8/17/2019 Training Guide PE Certification
29/79
26
The vertical retraining process is designed to increase productivity when working with
vertical output. However, if a client prefers a specific translation for a certain term which
was correct in the original vertical, a retraining might mean that this term could bechanged to a more widely used translation. This will need to be corrected during post-
editing and we recommend adding terms like this to your QA check.
Verticals – Pros and Cons
5.3 Customisations
A customisation is a trained statistical engine that only (or mainly) contains client-
specific corpora. It involves preparing client-specific TMs in order to get the best MToutput for production. The recommended requirement for a successful customisation is
an aligned corpus of 1 million words of relevant customer data, although this may vary
per project and language pair, and it is possible to create a customisation with lower
volumes of customer data.
Using this type of material guarantees adherence to client-specific terminology and
style.
Pros Cons
-
8/17/2019 Training Guide PE Certification
30/79
27
As the machine translation output is fully based on the bilingual corpus, with no
syntactical or lexical data added, the quality of the output can only be as good as the
quality of the corpus. If the corpus data has inconsistent terminology and/or style, theresulting MT may also be inconsistent. That is why it is important that the linguist
responsible for the customisation chooses suitable data to be added to the SMT engine
training.
Customisation – Pros and Cons
5.4 Engine training process
When a project is sent to iMT, all the necessary data is collated – including project
TMs, sample files, project information, etc. The next step in the process is to evaluate
the source text and establish if it is suitable for machine translation. A source evaluation
will also allow the linguist to identify any possible issues with the use of MT on the
project, so that action can be taken during engine creation to try to minimise those
issues. If the data is suitable, then the TMs are prepared for training the engine. SMT
engine training is an iterative process, and involves the following steps:
Pros Cons
-
8/17/2019 Training Guide PE Certification
31/79
28
TM cleaning
Data cleaning is a process applied to the training corpus in order to make it compatible
with the platform where the SMT engines are created. This process improves the
quality of the data by removing content which could adversely affect the MT output,
such as tags, entities, misaligned segments, and corruptions. This could appear in the
output and provoke a drop in productivity. Some parts are also harmonised towards
achieving MT output that will be faster to post-edit, as less changes will be required.
Creation of training
During a customisation, several trainings with different combinations of data may be
uploaded to the system and then evaluated so the iMT team can select the one that
delivers the best results. A second trial is based on the results of the first one – the
problems found in the output are traced back to the TM data, which is then manipulated
further to try to solve the issues. The training with the best results is then deployed for
production.
Selection of test sentences
For MT testing purposes, the linguist selects a set of sentences which do not appear in
the corpus which will be uploaded to the SMT system. Ideally, the sentences should be
taken from new untranslated project files, as this is the best way to reproduce a realtranslation scenario and really test the engine to the max.
1• TM cleaning
2• Selection of test sentences
3• Testing
-
8/17/2019 Training Guide PE Certification
32/79
29
Testing
One of the biggest challenges within the MT industry at this point in time is to find an
automatic measure that will be able to forecast if a particular MT output will be able to
reach the particular user‟s goal. Achieving this objective is particularly difficult as there
are no unique solutions in translation. Many translations may be right for one sentence
and even more translations can be wrong. Since an automatic assessment of MT
output quality is generally based on comparing the MT to reference translations, finding
an automatic procedure to determine the MT output quality is a challenging task where
a lot of work is currently being concentrated.
Nowadays, many MT providers choose between human and automatic evaluations (or
a combination of both).
Human evaluation is normally centred on Likert-based scales. With this method,
resources are asked to score aspects of the MT output by following a list of parameters
associated with a numerical scale. For example, „score 5 if the output is entirely correct,
score 4 if the output is understandable but has grammatical errors,…‟.This kind of
assessment mainly focuses on understandability, although some vendors have started
looking into Likert-based scales that could help assess the post-editing effort. Human
evaluation can also be used to compare two or more MT engines or systems, and is
based on the evaluator stating their preference between two or more MT outputs
generated for the same source sentences.
Some of the disadvantages inherent with human evaluation are:
Performing this kind of tests is relatively expensive and time consuming, asseveral resources are required for assessing each and every engine.
Human evaluations are prone to subjectivity and final assessments may not be
consistent after all.
Resources need to be familiar with the scales and follow them to the letter in
order to obtain valid results.
-
8/17/2019 Training Guide PE Certification
33/79
30
However, when done well, a human evaluation is still often considered to be more
reliable than automated measures, and has the added advantage of a human translator
being able to provide useful comments on the issues found on the MT output.
The productivity increase though is still a difficult factor to predict for all cases, as
productivity may vary per job and also per resource (it varies with post-editing
experience, for instance). Most productivity tests in the industry are based on a
combination of measuring post-editing speed, and post-editing effort, or comparing
post-editing speed with conventional translation speed.
In the last decades, many measures for automated evaluation have been proposed.
Most automated measures assess the quality of the machine translation compared to a
reference translation which is deemed to be high quality. Some of the most widely
spread ones are detailed below.
BLEU (Bilingual Evaluation Understudy) score: this algorithm is meant to evaluate the
quality of text which has been machine-translated. The central idea behind BLEU is
“the closer a machine translation is to a professional human translation, the better it is”.For that, scores are calculated for individual translated segments – generally sentences
– by comparing them with a set of good quality reference translations. Those scores
are then averaged over the whole corpus to reach an estimate of the translation's
overall quality. Intelligibility or grammatical correctness are not taken into account
explicitly, they are supposed to be included in the correct reference translations.
NIST: the name of this metric comes from the US National Institute of Standards and
Technology. This measure is based on the BLEU score, but it differs from this algorithm
in several points.
Whilst BLEU simply calculates how many n-grams match both in the reference
translation and in the MT output and gives these n-grams the same weight, NIST also
calculates how “informative” a particular n-gram is. When a correct n-gram is found, the
algorithm measures if that combination is a common sequence in the corpus material or
if, on the other hand, that fragment is not that common in the data. Depending on the
result, an n-gram will be given more or less weight. To give an example, if the bigram
-
8/17/2019 Training Guide PE Certification
34/79
31
"on the" is correctly matched, it will receive lower weight than the correct matching of
bigram "interesting calculations", as this is less likely to occur.
NIST also differs from BLEU in how some penalties are calculated. For example, small
variations in translation length do not impact the overall NIST score as much as in
BLEU.
METEOR (Metric for Evaluation of Translation with Explicit ORdering): this metric was
designed to address some of the problems found in the more popular BLEU metric, and
also produce good correlation with human judgment at the sentence or segment level
(this differs from the BLEU metric in that BLEU seeks correlation at the corpus level).
For that, several features that had not been part of any other metrics at the time were
introduced. Matches in METEOR are made by following the parameters below, among
others:
Exact words: as with other metrics, a match is made if two words are identical in the
machine translation output and the reference translation.
Stem: words are reduced to their stem form. If two words have the same stem, a match
is also made.
Synonymy: words are matched if they are synonyms of one another. Words are
considered synonymous if they share any synonym sets according to an external
database.
TER (Translation Edit Rate): this metric measures the number of edits required to
change a machine translation output into one of the human references.
Levenshtein distance: this metric measures the similarity or the dissimilarity (“distance”)
between two text strings by calculating the minimum amount of single-character edits
(insertion, deletion, substitution) required to change one word into another. In the field
of machine translation, this can be done by comparing the raw MT output to the human
translation.
Let‟s look at a couple of examples:
http://en.wikipedia.org/wiki/Distancehttp://en.wikipedia.org/wiki/Distancehttp://en.wikipedia.org/wiki/Distancehttp://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/Distance
-
8/17/2019 Training Guide PE Certification
35/79
32
The Levenshtein distance between "sport" and "short" is 1, because 1 edit is required
to convert one word into the other (replace “p” with “h”).
The Levenshtein distance between “dog” and “frog” is 2, as it is not possible to convert
the first word into the second with fewer edits (replace “d” with “f” and add “r”).
This algorithm always has a maximum value that corresponds to the maximum length
of both input strings. In the case that 2 words do not have anything in common, the
minimum amount of edits will not exceed the maximum amount of characters of the
longer string.
Example: if we have “computer” and “alibi”, the Levenshtein distance will be 8 and no
higher than 8:
replace “c” with “a”
replace “o” with “l”
replace “m” with “I”
replace “p” with “b”
replace “u” with “I”
delete “t”
delete “e”
delete “r”
As with other automated measures, the results of the Levenshtein distance are not set
in stone. As mentioned before, there can be many correct translations for a single
source; however, the Levenshtein distance will not be able to measure quality on its
own. Results will vary, for example, if clauses are positioned differently in the MT output
and in the human reference translation.
Example:
-
8/17/2019 Training Guide PE Certification
36/79
33
MT: “If I go home after 10pm, I will let you know”.
Reference human translation: “I will let you know if I go home after 10 pm”.
In this case, the MT output is correct and no changes would be necessary during a
post-editing stage. However, the Levenshtein distance will be quite high, as many
changes would be required to turn the first sentence into the second one.
That suggests once more the importance of selecting large test beds to run any of
these automated evaluations on, as that will allow us to get more reliable results.
Automatic measures also have their limitations: the reference translation is not always
available, and those measures do not give an indication of post-editing productivity
expected. Therefore, they are useful for engine training development and comparison,
but not necessarily practical for a production scenario.
In January 2011, TAUS began working with a group of its enterprise members with a
clear objective in mind – tackle the general problem of evaluating translation quality. And consequently the idea of the Dynamic Quality Evaluation Framework (DQF) was
born.
The framework is still in development, and will allow users to profile their content and
receive guidance on best-fit evaluation techniques. A knowledge base documenting
best practices provides detailed practical information on how to carry out seven specific
types of quality evaluation. By establishing best practices, metrics and benchmarks
within a dynamic framework, the project team sought to apply best-fit evaluation
approaches depending on content type and usage, moving away from the dated, static
– one size fits all – approach used by most vendors.
-
8/17/2019 Training Guide PE Certification
37/79
34
6 Using the MT Output: the basics of post-editing
6.1 Introduction to post-editing
Post-editing is a new phase that replaces conventional translation for MT projects. It is
a change in the process, but the working environment remains the same. The same
applications and the same reference materials used in a conventional translation
project are also used when post-editing. Machine-translation is a new component in the
process that provides human translators more leverage along with the use of TMs.
Post-editors work on CAT tools editing fuzzy matches from the TM and machine-
translated segments to a publishable quality.
Post-editing is a skill which translators develop with time. Post-editors will not be fully
productive from day one as they need to learn their trade. Industry research has shown
that experience is the single most important factor in translation productivity andbecomes even more influential in post-editing. Over time, translators can adapt their
working practices to use the MT output to their advantage.
-
8/17/2019 Training Guide PE Certification
38/79
35
Integrating post-editing into a production environment
On a file for post-editing, the Translation Memory is applied as usual, to create the
100% matches and fuzzy matches. Machine translation is applied to any untranslated
text left after the TM is applied.
The post-editing phase itself involves a number of key stages. Since the post-editor is
attempting to be as efficient and productive as possible, preparation is key. Do not rush
ahead without taking time to consider the source and MT output. Determine the
useable parts and then build around these. Focus on accuracy, without under- or over-
editing, and finally check over the grammar and the terminology. Post-editors are
generally advised that if the text scans well, it will flow well.
6.2 Degrees of post-editing
The market makes a distinction between post-editing to publishable quality and post-
editing to an understandable level. Post-editing to publishable level is the highest
quality standard. This is in line with the expectations of the majority of SDL‟s clients.
-
8/17/2019 Training Guide PE Certification
39/79
36
After post-editing, files undergo a quality check to ensure that the translation is correct
and fluent. The final quality should be comparable to conventional translation.
Post-editing to understandable quality, or light post-editing is normally required for low
visibility text, or texts that would not otherwise be translated for a client as it would be
too expensive and time-consuming. A client might decide to opt for understandable
quality texts in order to reduce the number of support requests for a product or to
provide an extra service to the user, for example. Typical purposes of understandable
quality texts include offering users a quick answer on how to fix an issue or providing a
translation solution for low visibility content, such as FAQs, blogs, and knowledge
bases.
When post-editing to an understandable level alone, it is less important to correct style
and grammar so long as the meaning of a translation is clear. Most important, however,
is to follow the clear project requirements that should always be provided by the client
in advance.
Examples of light post-editing
LP SOURCE EN MT EN PE COMMENTS
IT-EN
Attrezzo dicompressione permisurare lasporgenza dellecanne dei cilindri (dautilizzare con380000364 e piastrespecifiche)
Tools for compression tomeasure cylinder linerprotrusion ( use with380000364 and specificplates)
Tool for compression tomeasure cylinder linerprotrusion ( use with380000364 and specificplates)
The plural needs to beedited because"attrezzo" is singular inthe Italian source, butthere is no need toremove the space afterthe bracket
IT-EN
Prima di iniziare
qualsiasi lavoro inquest'area, spegnereil motore ed estrarrela chiave diaccensione.
Always stop the engine
and remove the Keybefore working in thisarea.
Always stop the engine
and remove the Keybefore working in thisarea.
There is no need to
change the uppercase tolower case
FR-EN
Si la valeur souhaitéen’est pas obtenue,
répéter lesinstructions 3 à 5.
If the desired pressurehas not been reached,repeat instructions 3 to5.
If the desired pressurehas not been reached,repeat instructions 3 to5.
"Required" would bebetter than "desired",but since this is perfectlyunderstandable there isno need to change it.
-
8/17/2019 Training Guide PE Certification
40/79
37
EN-DE
To remove the 3Ddiffuser:
Zum Entfernen des 3DRefraktionstechnik:
Zum Entfernen des 3DRefraktionstechnik:
The MT has the wrongcase “des” instead of
“der”. But the MT
sentence is perfectlyunderstandable as it is.
EN-FR
The pressure isreduced to pilotpressure.
La pression est réduit àla pression pilote.
La pression est réduit àla pression pilote.
The gender agreement iswrong, should be“réduite“ instead of“réduit”, but the
sentence isunderstandable as it isand that does not needto be corrected.
Publishable quality vs. Understandable level
Post-editing to publishable quality is covered in mode detail in the next chapters. When
post-editing to publishable quality, the following rules apply:
• Most frequent form of post-editing• Generally used for higher visibility texts• Comparable to conventional translation• High quality expectations
• Follows standard client expectations
Publishable
Quality
• Less frequent form of post-editing• Generally used for lower visibility texts• Focus on meaning not on style and grammar• Expectations based on specific client
requirements
• Clear requirements are needed
Understandable
Level
-
8/17/2019 Training Guide PE Certification
41/79
38
6.3 The quality check process
It is recommended that the post-editing process is followed by a quality check, which is
the equivalent of conventional review.
1
•Read the source segment first and then the MT output
2•Determine the usable elements (single words and phrases) and makethem the basis for your translation
3
•Build from the MT output and use every part of the MT output that canspeed up your work
4
•Take care not to over-edit (unnecessary rephrasing) or under-edit (wrongprepositions, inflections, compounds, etc.) the MT output. The adjustmentof style (such as “may” versus “might”) can be optional, but grammaticalcorrectness in the target is not
5
•Correct any grammatical errors and make sure that the terminology of theMT output is compliant with glossaries and termbases. This will always
need to be checked as any inconsistencies in the training material will bereproduced in the output
6•Run the compulsory checks (spelling, grammar, terminology check)
7
•Finally, after post-editing each segment, reread your translation and make
sure that no details are missing and you have not left any words that arenot needed
-
8/17/2019 Training Guide PE Certification
42/79
39
As part of SDL‟s workflow, the quality check is performed as a separate step by a
reviewer and guarantees that the translation is fully publishable. To achieve this, quality
at source is key – the post-edited file should already be of publishable quality. Tofacilitate this, ensure that the post-editor receives clear instructions and has access to
all most up-to-date reference materials. The required QA checks need to be run and
can be used as an indication of the post-editing quality.
When quality-checking, always bear the MT in mind and understand the initial MT
output. Identify known problems in advance (see section 8) and make sure to include
them in your checks (e.g. wrong prepositions, terminology, known issues with MT). It is
important to learn to distinguish between what needs to be changed and what can
remain untouched. Note that there are some items which always need to be amended
by the post-editor. Examples include date formats, spacing, wrong prepositions or
terminology issues caused by several possible translations of the same word.
When quality-checking machine-translated material, focus on over-editing and under-
editing (depending on style and client requirements). Over-editing will lead to lower
productivity and needs to be avoided during both the PE and the QA check phase.
Under-editing may result in quality issues and will impact negatively on the time needed
for quality check.
Before starting a quality check, make sure that all the content has been translated.
Then check that the post-edited text reads well from a user„s point of view. The post-
edited text must match the source. Be careful to look for mistranslations, words left out
from the translation or additional words which are not on the source text. Check that
there are no typos. Scrolling down the file will enable you to spot spelling mistakes and
inconsistencies. Terminology should be consistent with the master glossary, especially
product names. It is vital that terminology is consistent. Sometimes terminology is not
consistent in the TMs and there are additional lists and guidelines for terminology.
Finally, check that style is overall consistent with the rest of the files and complies with
the style guide from the client.
-
8/17/2019 Training Guide PE Certification
43/79
40
7 How to get the most out of MT7.1 What makes an effective post-editor?
In order to post-edit effectively, it is essential to use the machine translation output as
much as possible. Do not ignore the machine translation output and do not translate
segments from scratch. In almost all cases some parts of the automatic translation
output can be used and help to speed up work.
The following guidelines will help you to identify usable parts and achieve the maximum
post-editing productivity. The translator needs to achieve publishable quality at the
post-editing stage without sacrificing translation speed. Once you have learnt to identify
usable parts and to use them, you will find post-editing easier and faster than
translating from scratch. Like any other new skill, however, there is a learning curve
with MT post-editing: the more you practice, the faster and easier it gets.
Post-editing tips
However, the MT is not only useful when it is easy to edit. You can also use the MT as
a source of inspiration when looking for the correct translation and pick out bits of the
sentence to reuse rather than trying to keep as much of the sentence as possible. This
Do not ignore orerase the MT
output
Maximise theusage of the MT
output
Use the
appropriatestyle andterminology
Follow theproject/client
style guidelines
If the MT meetsthe project
requirements,do not modify it
Do not spend timeresearching
terminologyunless the MT is
clearly wrong
Do not replacewords withsynonyms
Do not makealterations for
the sake ofvariation alone
If formatting is anissue, restore the
original sourceformat and paste
the useful MTparts instead
An alternative ifthere are manytags is to deletethem, edit the
text, then insertthe tags again
At the end, re-readthe segment andcompare it to the
source foraccuracy
-
8/17/2019 Training Guide PE Certification
44/79
41
is particularly relevant for longer sentences. Even sentences that are largely incorrect
can be useful so long as deleting the incorrect material is not time-consuming.
Apart from this, it is important to bear in mind that account knowledge is important for
post-editing as well. Whilst this is important for all translation projects – conventional as
well as MT – a solid knowledge of the project requirements with regard to style
guidelines, terminology, TM and client expectations will help you achieve good post-
editing productivity.
So what makes a good post-editor?
7.2 Post-editing quality expectations
The quality expectations will vary according to the degree of post-editing and the client
requirements. However, certain general principles apply. The aim is to deliver a high
quality translation faster than a conventional translation. Translation speed is a key
Excellentlinguistic
skills
Domain andsubject
knowledge
Proficiencywith CATtools and
automated
text-checking
Positiveattitude
towards MT
Practice!
-
8/17/2019 Training Guide PE Certification
45/79
42
factor when post-editing. Therefore, the machine translation needs to be corrected with
a view to maintaining efficiency.
There should be no difference in quality between a human translation and a post-edited
translation when post-editing to publishable quality. However, there may be a slight
shift in style. Style should be correct and appropriate to the project, but may need to be
less refined in order to allow for a more efficient use of the MT output. Where a client
specifically asks for MT to be used on their project, the client needs to be made aware
of this and expectations need to be managed accordingly.
There will of course be a certain amount of variation – but this is a feature ofconventional translation as well. So long as the quality criteria are adhered to, a post-
edited text will be considered to have met the quality expectations.
-
8/17/2019 Training Guide PE Certification
46/79
43
Post-editing quality criteria
There are two main issues that post-editors often face when attempting to fulfil the
highest possible quality criteria in the shortest amount of time. These are under-editing
and over-editing and will be discussed in more detail in the following sections.
7.3 Under-editing
If a post-editor has under-edited the MT output, they may have missed important errors
that needed to be corrected and may reflect badly on the quality of the translation.
Under-editing is generally characterised by the following features:
• The translation must be a correct reflection of the source.
• Spelling and punctuation must be correct.
• The translation must be grammatically and syntactically correct andreflect the conventions of the target language.
• The correct terminology must be applied and used consistently(including preferred translations for frequently occurring terms).
• Cultural references (date and time formats, units of measurement,number formats, currency, etc.) must be correctly adapted.
• The style and register of the target must be appropriate for thedocument type.
• The original formatting must be reproduced.
• Project guidelines must be followed.
• The translation must read well and be suitable for the end user.
-
8/17/2019 Training Guide PE Certification
47/79
44
Below are some examples of under-editing:
LP Source MT PE Reviewer Comment
EN-ES
On its wallsyou'll discoverthe figures of apuma and asnake.
En sus murallas,descubrirá lacifras de unpuma y unaserpiente.
En sus murallasdescubrirá lafiguras de unpuma y unaserpiente.
En sus murallasdescubrirá las figuras de unpuma y unaserpiente.
The term “cifras” hasbeen correctly post-edited and replaced with“figuras”, but the article“la” has not beenchanged to the pluralform.
EN-ES
Inside you cansee a
sacrificial altarmade of ahuge stone.
En su interior sepuede ver una
altar desacrificios de unaenorme piedra.
En su interior sepuede ver una
altar de sacrificioshecho con unaenorme piedra.
En su interior sepuede ver unaltar de
sacrificios hechocon una enormepiedra.
The preposition “de” hasbeen correctly post-edited, but the article“una” does notcorrespond to the gender
of the noun “altar” (“una”is feminine whilst “altar”is masculine).
EN-FR
How long willthe battery lastusinginteractive
features (suchas games) onmy phone?
Combien detemps durel'autonomie àpartir d'interactive
fonctions (commeles jeux) sur montéléphone ?
Combien de tempsdure l'autonomiede la batterielorsque j'utilise lesfonctionsinteractives
(comme les jeux)sur mon telephone?
Quelle estl'autonomie de labatterie lorsque
j'utilise lesfonctionsinteractives
(comme les jeux)de montelephone ?
"Combien de tempsdure" should not becombined with the word"autonomie". The litteraltranslation of "How longdoes XXX last" is notappropriate in thiscontext. The correctversion is "Quelle estl'autonomie".
The preposition "sur" isnot appropriate in thiscontext.
7.4 Over-editing
If a post-editor has over-edited the MT output, they may be taking extra time which may
affect their overall productivity and reduce the benefits of post-editing. Over-editing is
typically characterised by preferential rather than necessary changes.
• Errors (spelling, typos)• Mistranslations (target does not match source)
• Inconsistent terminology• Inaccuracy• Inconsistency in figures, units of measurement,
etc.
• Incorrect formatting• Not following project-specific instructions
Under-editing
-
8/17/2019 Training Guide PE Certification
48/79
45
There is always room to allow stylistic changes and creativity with post-editing, and
certainly stylistic features that do not meet with the client style guides should be
amended. The important thing to remember is not to let preferential changes distract
from necessary amendments and not to let these changes have a negative impact on
the overall productivity.
Below are some examples of over-editing:
LanguagePair Source MT PE with Overediting
PE withoutOverediting
Commenton
overeditedversion
DE-EN
Die Kühlungerfolgt durchdas massiveAluminium-Gehäuse unddie seitlichangebrachtenKühlrippen undkommt gänzlichohne Lüfteraus.
The cooling takesplace through the solidaluminum case and theside-mounted coolingfins and comescompletely withoutfans.
The cooling finsfitted on the side ofthe solid aluminiumcasing ensure thatthe computer iscooled, as it comescompletely withoutfans.
Cooling takes place through the solidaluminium casing andthe side-mountedcooling fins - there isno need whatsoeverfor fans.
Unnecessaryre-orderingand re-translating ofsegments
DE-EN
Aber nicht nurÄußerlich hatdiesesFestplattengehä
use einiges zubieten.
But not only on theoutside, this hard driveenclosure hassomething to offer.
This hard drive casinghas more than just agreat design.
But it's not only on theoutside where thishard drive casing hassomething to offer.
Overeditedversion isstylisticallymore
pleasing, butrequires amajorrewrite, whileversionwithoutoverediting isequallycorrect.
DE-EN
Fotos mit 1,3Megapixeln
Photos with 1.3megapixels
1.3 megapixel photos Photos with 1.3megapixels
Unnecessaryre-orderingof segments
DE-EN
Zudem stehenverschiedenSATA-Typen zurAuswahl, wiez.B. Micro SATAoder Slimline-
In addition there aredifferent SATA-typesare available, such asmicro SATA or SlimlineSATA.
There are varioustypes of SATAavailable for this, suchas micro SATA orslimline SATA.
In addition, there aredifferent SATA typesavailable, such asmicro SATA or slimlineSATA.
Unnecessaryre-phrasingand changeof syntax.
• Do not rewrite the translation unlessunavoidable
• Do not change correct and understandabletranslations, even if they could be phrased morenaturally or fluently
• If the MT output style meets the projectrequirements, do not change it
• Reduce changes to a minimum and focus onactual mistakes
Over-editing
-
8/17/2019 Training Guide PE Certification
49/79
46
SATA.
DE-EN
Mit der 1 Meter
langenTischantennekönnen SieIhren WLAN-Empfangdeutlichoptimieren.
With the 1 meter long
Tischantenne you cansignificantly optimize your WLAN-reception.
You can optimise your
WLAN receptionsignificantly using the1-m table-topantenna.
With the 1-m table-top
antenna you cansignificantly optimise your WLAN reception.
Unnecessary
re-orderingof segments;more of theMT can beleftunchanged ifsyntax iskept as is
EN-DE
Make sure thatthe brake pedalis depressedwhile youperform thisprocedure.
Sicherstellen, dass dasBremspedalniedergedrückt wirdwährend Sie diesesVerfahren durchführen.
Währenddessen mussdas Bremspedalweiterhin gedrücktwerden!
Das Bremspedal mussniedergedrückt sein,während Sie diesesVerfahrendurchführen.
Unnecessaryre-write;usable partsof the MTwere ignoredin overeditedversion
EN-DE
Install theBluetoothprinter on yourcomputer andset it as thedefault printer.
Installieren Sie dieBluetooth Drucker aufIhrem Computer, undrichten Sie ihn alsStandarddrucker.
Installieren Sie denBluetooth-Drucker aufIhrem Computer, undlegen Sie ihn alsStandarddrucker fest.
Installieren Sie denBluetooth-Drucker aufIhrem Computer, undrichten Sie ihn alsStandarddrucker ein.
Unnecessaryuse ofsynonyms;verb"einrichten"wasunnecessarilyreplaced by"festlegen"
EN-DE
Allow thecomputer tolockautomaticallyafter 10seconds.
Warten Sie, bis derComputer die Sperreautomatisch nach 10Sekunden.
Gestatten Sie, dass der Computer nach 10Sekunden automatischgesperrt wird.
Warten Sie, bis derComputer die Sperre nach 10 Sekundenautomatisch aktiviert.
Unnecessaryuse ofsynonyms;verb"warten" wasunnecessarilyreplaced by
"gestatten";"warten"conveyed thesamemeaning inthis context)
EN-DE
When theproximityfeature isenabled butinactive, thefollowingmessagedisplays in theBluetooth
Device Controlwindow for thephone:
Wenn der NäheFunktion aktiviert, abernicht aktiv ist, wird diefolgende Meldung in derBluetooth DeviceControl Fenster für dasTelefon:
Wenn dieNäherungsfunktioneingeschaltet aberinaktiv ist, wird imFenster "Bluetooth-Gerätesteuerung" fürdas Telefon diefolgende Meldungangezeigt:
Wenn dieNäherungsfunktionaktiviert aber nichtaktiv ist, wird diefolgende Meldung imFenster "Bluetooth-Gerätesteuerung" fürdas Telefon angezeigt:
Unnecessaryuse ofsynonyms;"eingeschaltet" is synonymto "aktiviert"and "inaktiv"is synonymto "nicht
aktiv" in thiscontext
EN-DE
This featureprovides a quickway to transferfiles withoutrequiring you tobrowse the filesystem on theother device.
Diese Funktion bieteteine schnelleMöglichkeit, Dateien,ohne die Datei zudurchsuchen auf deranderen Gerät zuübertragen.
Mithilfe dieserFunktion lassen sich Dateien schnellübertragen, ohne dasDateisystem desanderen Gerätsdurchsuchen zumüssen.
Diese Funktion bieteteine Möglichkeit,Dateien schnell ohneDurchsuchen desDateisystems desanderen Geräts zuübertragen.
Unnecessaryre-orderingof segments;more of theMT can beleftunchanged ifthe syntax iskept as is
EN-FR
Afterdisconnectingthe high voltage
terminals,busbars, etc.,insulate the
Après avoir débranchéles bornes hautetension, jeux, etc.,
isoler les pièces avec dela bande adhésiveisolante.
Après ledébranchement desbornes, barres
collectrices, etc. hautetension, isoler lespièces avec du ruban
Après avoir débranchéles bornes, barrescollectrices, etc. haute
tension, isoler lespièces avec du rubanisolant.
Unnecessarychange ofsyntax
-
8/17/2019 Training Guide PE Certification
50/79
47
parts withinsulating tape.
isolant.
EN-FR
For furtherinformation onthe Table View,see the tutorial"Table ViewProductivityFeatures"
Pour plus d'informationssur l'affichage entableau, voir lessections du tutoriel"Fonctions deproductivité - Affichageen tableau"
Pour obtenir de plus
amplesrenseignements surl’affichage en tableau,voir le tutoriel «Fonctions deproductivité -Affichage en tableau »
Pour plusd'informations surl’affichage en tableau,voir le tutoriel «Fonctions deproductivité -Affichage en tableau »
Correct
expression inMT; notneeding anyediting
EN-FR
Alternator isfound to benoisy
L'alternateur estbruyant
Le client trouve que l’alternateur estbruyant
L'alternateur estbruyant
Correctexpression inMT; notneeding anyediting
EN-FR
The oil in thesepassages istrapped and theblade does notmove.
L'huile dans cespassages est piégée etla lame ne bouge pas.
La lame ne bouge pascar l'huile de cesconduits est piégée.
L'huile dans cespassages est piégée etla lame ne bouge pas.
Unnecessaryrephrasing
EN-IT
Be sure that thehydraulic hoseis free ofabrasion.
Accertarsi che ilflessibile idraulico siaprivo di abrasioni.
Assicurarsi che ilflessibile idraulico siaprivo di abrasioni.
Accertarsi che ilflessibile idraulico siaprivo di abrasioni.
Unnecessaryuse of asynonym.
EN-IT
Adjust theangle by raisingthe rear of thevehicle toensure watercovers the
joints.
Regolare l'angolosollevando la parteposteriore del veicolo per assicurarsi chel'acqua copre i giunti.
Sollevando la parteposteriore del veicolo,regolare l'angolo perassicurarsi che l'acquacopra i giunti.
Regolare l'angolosollevando la parteposteriore del veicolo per assicurarsi chel'acqua copra i giunti.
Unnecessaryre-orderingof phrases.
EN-IT
The only way toallow thedevice tovalidate a self-signedcertificate is toinstall thecertificate onthe device.
L'unico modo perconsentire ildispositivo per convalidare un certificatoautofirmato perinstallare il certificatosul dispositivo.
Per permettere aldispositivo di convalidare un certificatoautofirmato, l'unicomodo è quello diinstallare il certificatosul dispositivo.
L'unico modo perconsentire aldispositivo di convalidare un certificatoautofirmato è quello diinstallare il certificatosul dispositivo.
Unnecessaryuse ofsynonymsandreordering ofphrases.
7.5 Help improve MT for the future
To make it easier to post-edit in the future make sure that you post-edit and translate in
an MT-friendly way using simple sentence structure and without adding additional
information or rephrasing the source and complicating the word order in the target
unnecessarily. This will improve the training material with which engines are retrained.
For some language combinations, the word order is considerably different between
source and target and this will always pose problems for MT. However, keeping closer
to the source is generally the best way forward:
-
8/17/2019 Training Guide PE Certification
51/79
48
In this instance, the second translation has the advantage that the word order in the
target is closer to the word order in the source. This can help the MT engine to match
up the words “error ” (German: “Fehler”) and “dash” (German: “Armaturenbrett”) more
easily with their correct translations.
If the verb is usually found at the beginning of the sentence in the source and at the
end of the sentence in the target, adding a lot of additional information in the middle
can also make it harder for the MT to match up source and target segments correctly.
As a rule, the MT engine can handle shorter phrases better than long convoluted
sentences.
A more MT-friendly style is also achieved by keeping trans