TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
-
Upload
taus-enabling-better-translation -
Category
Technology
-
view
633 -
download
0
description
Transcript of TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangeanic, 4 June 2012
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE
PangeaMT Putting Open Standards to work16:00-16:15Monday 4 June
Manuel HerranzPangeanic
PangeaMT – putting open standards to work… well
Manuel Herranz#manuelhrrnz #pangeanic E: [email protected]
MACHINE TRANSLATION
Make myday,
•
•
•
•
•
•
I S N O T
I Sbecome
a post-editor
Chomsky: Imagine that ifin the futurelarge enoughamounts of data existed, they could be processed bycomputers withenoughcomputingpower
rule-based systems, IBM licenses, many linked to patent EN/RU & Intel
First statistical papers
1st Open source SMT
Translation industryappropriating Moseshttp://euromatrixplus.net/moses
DIY SMT
http://t.co/HDTboxQ
PEAK of ColdWar and informationcontrol.Products & informationdirected toconsumers/ users / citizens
BEGINNING of data resources. Internet.Accessability toinformation
Content generated BY USERS / CONSUMERS / CITIZENS, multilingual, free informationexchange across theworld
Types of LSPs (Ben Sargent – TMS Inspiration Days April ‘11 – Krakow)
a) develop it for their use and for their clients (developers of a system),
b) buyers of systems (they do not want the headache of starting from scratch and prefer to buy ready-build solutions) and finally
c) there are those who prefer the mix&match approach (buying some good solutions outside and building interfaces and what they know works best for their business). The trend is towards unification
2007/08
.
2009/10
2011/12
• DIY SMT• Empower Users• Glossary• Automated re-training• Transfer architecture and know-how to users
• Compatibility withcommercial formats (ttx, sdlxliff, itd)
2007 and before
• RB tests with commercial software• Insufficiently good output• Only internal production
• EU Post-Editing Award• V1: Small data sets (2-5M words), automotive & electronics
• (ES), then Fr/It/De in other fields
• Division born• 00's of engine trials and language combinations
• Open-Source to commercial
• TMX / XLIFF workflows
As of May 2009: 487 Billion gigabytes or1,000,000,000 * 487,000,000,000 = 4,87 x 1020
EstimatesUp 50% a year (Oracle)Doubles every 11 hours (IBM)
OBJECTIVES = CHALLENGES 2007 - 2010
Turn academic development (Moses) into a commercial application.
To provide High Q MT for Post-Editing and save time and cost. No Google-type broad TR but domain-specific, user-centric.
Lower entry level for MT. Bring affordability user control / empowerment to MT. Bring it to the user, take away from programmer.
How? By fostering open-standard geared translation automation strategies.
To use only community-based Open standards –> Oasis / ISO: xliff / tmx, xml). NO proprietary formats (technology independence) so USERS are not “locked” in to buying and updating expensive software.
DIY SMT June 2011 http://t.co/HDTboxQ
9
The rush for data
Soon realised that there was a rush to gather data but that other resources around data were necessary
cleaning
More cleaning
10
cleaning
More cleaning
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>A system for recovering the methane that is emitted from the manure so that
it does not leak into the atmosphere.</seg>
</tuv>
<tuv xml:lang="FR-FR">
<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel
d'origine animale de sorte qu'il ne se dissipe pas dans l'atm sphère.</seg>
</tuv>
<tuv xml:lang=“EN-US">
<seg>On 22nd May we decided not to join the group.</seg>
<tuv xml:lang=“DE-DE">
<seg>Am 22. </seg>
<tu srclang="en-GB">
<tuv xml:lang="EN-GB">
<seg>The President of the United States visited Costa Rica.</seg>
</tuv>
<tuv xml:lang=“ES-ES">
<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora
Michelle, visitaron Costa Rica el pasado sábado.</seg>
</tuv>
11
cleaning
More cleaning
<tuv xml:lang=“JP">
<seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。
英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg>
<tuv xml:lang=“EN-US">
<seg>It is a journalistic point of view and strengths of the English-
language newspaper Japan Times. It includes a description of the exciting and
rewarding work of translation and interpretation, as well as the introduction of
consciousness and how to acquire the required professional skills. The road to
becoming a translator and interpreter also down to the actual work site, a
comprehensive guide to interpreting the reality of today'stranslation industry.
</seg>
•
•
•
•
•
•
•
•
•
Translation MT+PE
Automotive 400 wph 900 wph
Marketing 250 wph 450 wph
Software 350 wph 1,000 wph
•
•
•
•
•
•
Domains are managed at TM and at engine level
I created this engine with medical, pharma TMX and added environmental
TMs to boost coverage - Client deals with plant-based natural drugs / ayurveda
Tag-based TM selection
•
•
•
2015
2014
2013
2011
2010
2012
2018
2017
2016
User
em
po
werm
ent
• MT acceptance growth (still)
• Translator engagement challenge (being solved particularly with in-house translators & economic climate)
• Need for data is being addressed – still more work to be done.
• The difference will be madeby data handling and MTtechniques (hybrid, combination, syntax, re-ordering, etc)
• Users and practitioners now can build their own systems, A TREND BEING FOLLOWED BY OTHER PLAYERS.
Until 2011/12
YEAR2016
00
0's
of c
usto
miz
ed
MT
syste
ms
In 5 years... after 2017… where?
Tech. notthe realm of afew providers
Ubiquitious MT2009