TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

31
s Andrejs Vasiļjevs chairman of the board [email protected] Localization World, Santa Clara October 9, 2013 MT & Terminology: better together

description

This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme. For the latest updates go to http://www.statmt.org/mosescore/ or follow us on Twitter - #MosesCore

Transcript of TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

Page 1: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

sAndrejs Vasiļjevs

chairman of the [email protected]

Localization World, Santa ClaraOctober 9, 2013

MT & Terminology:

better together

Page 2: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

• Language technology developer

• Localization service provider

• Leadership in smaller languages

• Offices in Riga (Latvia), Tallinn

(Estonia) and Vilnius (Lithuania)

• 130 employees

• Strong R&D team

• 5 PhDs, 80+ research papers

• Trusted partner of the EU for

significant R&D projects

Page 3: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

data

challenge

Page 4: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

platform

challenge

[ttable-file]

0 0 5 /.../unfactored/model/phrase-table.0-0.gz

% ls steps/1/LM_toy_tokenize.1* | cat

steps/1/LM_toy_tokenize.1

steps/1/LM_toy_tokenize.1.DONE

steps/1/LM_toy_tokenize.1.INFO

steps/1/LM_toy_tokenize.1.STDERR

steps/1/LM_toy_tokenize.1.STDERR.digest

steps/1/LM_toy_tokenize.1.STDOUT

% train-model.perl \

--corpus factored-corpus/proj-syndicate \

--root-dir unfactored \

--f de --e en \

--lm 0:3:factored-corpus/surface.lm:0

% moses -f moses.ini -lmodel-file "0 0 3

../lm/europarl.srilm.gz“

use-berkeley = true

alignment-symmetrization-method = berkeley

berkeley-train = $moses-script-

dir/ems/support/berkeley-train.sh

berkeley-process = $moses-script-

dir/ems/support/berkeley-process.sh

berkeley-jar = /your/path/to/berkeleyaligner-

2.1/berkeleyaligner.jar

berkeley-java-options = "-server -mx30000m -ea"

berkeley-training-options = "-Main.iters 5 5 -

EMWordAligner.numThreads 8"

berkeley-process-options = "-

EMWordAligner.numThreads 8"

berkeley-posterior = 0.5

tokenize

in: raw-stem

out: tokenized-stem

default-name: corpus/tok

pass-unless: input-tokenizer output-tokenizer

template-if: input-tokenizer IN.$input-

extension OUT.$input-extension

template-if: output-tokenizer IN.$output-

extension OUT.$output-extension

parallelizable: yes

working-dir = /home/pkoehn/experiment

wmt10-data = $working-dir/data

Page 5: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

customization

challenge

Page 6: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

s

do-it-yourself

MT factory

on the cloud

Page 7: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

• Automated training of SMT

systems from specified

collections of data

• Repository of parallel and

monolingual corpora

• based on open-source MT

tools GIZA and Moses

• Services for data collection,

MT generation,

customization and running of

variety of user-tailored MT

systems

Page 8: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013
Page 9: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013
Page 10: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

Training Data Provided

Page 11: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

Platform Architecture

Training UsingSharing of training data

Giza++

Moses SMT toolkit

SMT Resource Repository

SMT Multi-Model Repository

(trained SMT models)

Pro

cesi

ng,

Eva

luat

ion

...

Up

load

An

on

ymo

us

acce

ssA

uth

enti

cate

dac

cess

System management, user authentication, access rights control ...

Web page

Web service

Web pagetranslation widget

CAT tools

Web browserPlug-ins

SMT Resource Directory

SMT System Directory

Moses decoder

Page 12: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013
Page 13: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

• Integration with CAT tools

• Integration in web pages

• Integration in web

browsers

• API-level integration

Page 14: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

integration

Page 15: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

• Training data on the LetsMT!

platform

• 119 languages

• 2,1 B parallel units in total

• 253 language pairs

• 860 corpora

• 249 production MT systems

currently on

the platform

Page 16: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

General Domain MT

English – Lithuanian

DATA

5.3 M parallel sentences

81 M monolingual sentences

QUALITY

LetsMT – 26.65 BLEU

Google – 25.85 BLEU

Beating

Google Translate

Page 17: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

• MT service for

e-Government

• Mobile Translation

app

• Desktop Translation

Tool

Page 18: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

%Productivity

►Average translation productivity:

►Baseline with TM only: 550 w/h

►With TM and MT: 731 w/h

32.9% productivity increase

►High variability in individual performance

►Increase of error score from 20.2 to 28.6 points but still at the level “GOOD” (<30 points)

25.1%

28.5%

Czech Polish

32.9%

Latvian

Page 19: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

How to instruct

SMT to use the

right terms?

ko

ks

tim

ber

Page 20: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

%

terminology

as a

service

Page 21: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

%

cloud-based

platform for

acquiring, cleaning,

sharing, and reusing

multilingual

terminological data

Page 22: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

TaaS Services

Page 23: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

Term identification and annotation

Page 24: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

Identifying and marking terms

Machine users

TaaS Terminology Services

ITS 2.0 enriched content

ITS2.0term-annotated content

export / visualisation

Showcase Web Page

Terminology Annotation

Web Service API

Plaintext

Term-annotated content

ITS 2.0 enriched content

ITS2.0term-annotated

content

CAT Tools MT Systems

ITS 2.0 enriched content

ITS2.0term-annotated

content

Human users(e.g., translators,

terminologists)

• New W3C standard for

Internationalization Tag Set ITS 2.0

Page 25: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

HTML Term AnnotationTerm entries for terms identified in EuroTermBank are stored in TBX format in a <script> element that is placed in the HTML5 document.

Page 26: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

XLIFF Term Annotation

Page 27: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013
Page 28: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013
Page 29: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

Narrow Domain Automotive MT

English – Latvian

DATA

2 M unique parallel sentences

1.9 M monolingual sentences

0.2 M in-domain monolingual

QUALITY

16% improvement from

terminology integration

Beating

Google Translate

Page 30: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

synergy of machine translation and terminology services on the cloud

Page 31: TAUS MT SHOWCASE, MT & Terminology Better Together, Andrejs Vasiljevs, 10 October 2013

tilde.com

The research within the projects LetsMT! and TaaS has received funding from the European Commission ICT Policy Support

Programme (ICT PSP) and FP7 Programme

thank you