Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic

IMPROVING HYBRID TRANSLATION

FULL-TEXT SEARCH ENGINE APPROACH

Lianet Sepulveda, Alexander Raginsky

Pangeanic

▪ Improving translation memory matching

▪ Natural Language Processing (NLP)

▪ Full-text search engine

▪ TM database

Agenda

Translation Memory (TM)

▪ Pre-translations stored in a database and offered as suggestions

▪ Implemented matching algorithm to propose a relevant translations

▪ exact match and fuzzy match

▪ segments similarities based on characters or tokens

PLN to improve matching algorithm

Approach

➢ Statistical Machine Translation (SMT)

➢ Computer-Aided Translation (CAT) environment

Run maintenance

● Search and replace

● Update TM entries

● Imports & Export entries

Translation

Memory

Improving TM entries

ElasticTM

Full-text search engine

NLP techniques

Improving TM Matching

perfect match by substitution

fuzzy match

{ “source_TM” : “I have 3 cats”, “target_TM” : “Yo tengo 3 gatos”, “score” : “80%” }

{ “source_TM” : “I have <number> cats”, “target_TM” : “Yo tengo <number> gatos”, “score” : “100%” }

Original TM

{ “input_source”: “I have 2 cats”, “output_target”: “ ” }

TM after preprocessing

● URLs

● Emails

● Dates

● Units

fuzzy match

{ “source_TM” : “I have a cat and I am very happy”, “target_TM” : “Yo tengo un gato y estoy muy feliz”, “score” : “44%” }

{ “target_TM” : “Yo tengo un gato y estoy muy feliz”, “source_TM” : “I have a cat”, “target_TM” : “Yo tengo un gato”, “source_TM” : “I am very happy”, “target_TM” : “Estoy muy feliz”, “score” : “100%” }

Original TM

{ “input_source”: “I have a cat”, “output_target”: “ ” }

TM after preprocessing

perfect match by substitution

▪ Several language → Maximise the reuse of existing human translation

▪ Linguistic feature → improving fuzzy matching

▪ string transformation

▪ segmentation rules

▪ pos tagger

▪ tokenizer

Linguistic approach to improve match

● Segment the text by sentence

○ Using delimiters like . ? ! , - :

○ Limited the total of words

● Intra-sentence segmentation

○ Using conjunctions, adverbs, determiners,

pronouns

○ Others approaches

● Replace segments

○ Numbers, dates, proper nouns and identifiers,

URLs, e-mail address, punctuation marks,

acronyms, variables.

● POS pattern string

● Named entity recognition

ElasticTM

source text

(Puscasu, 2004; Eriksson and

Myhrman, 2010; Orasan, 2000)

▪ TM database built from TMX files

▪ Based on the state-of-the-art full-text search engine

▪ Fast indexing, search and retrieval

▪ Supports advanced text retrieval techniques (fuzzy match,

regular expressions)

▪ Easily scalable

▪ Role-based security

ElasticTM

▪ PangeaMT

▪ As a preprocessing step before Moses SMT

▪ PangeaCrawler

▪ Automatic website translation

▪ Plugins for CAT tools

▪ As an auxiliary tool for human translators

ElasticTM - Intended Usage

ElasticTM - Design

EN ES FR ... NL

Search Engine

Map DB

ElasticTM

ElasticTM - Design (cont’d)

▪ Monolingual índices

▪ Memory-effective

▪ Implicit transitive language pairs

▪ Bilingual mappings

▪ Fast bidirectional id <-> id mapping

▪ Role-based security system

▪ Admin, project admin, user etc.

▪ Considered Lucene-based search engines:

▪ Solr and ElasticSearch

▪ Mature open source projects

▪ Have similar capabilities & performance

▪ ElasticSearch was picked mainly because of:

▪ Out-of-the-box scalability

▪ Powerful Query DSL (query language)

▪ Role-based security (via plugin)

ElasticTM - Search Engine

▪ Mapping source language segments to a target language

▪ Bidirectional map (id to id)

▪ Supports quick bulk incremental updates

ElasticTM - Map

▪ NoSQL key-value databases

▪ MongoDB

▪ CouchDB

▪ Redis

▪ ElasticSearch

▪ … many others …

▪ SQL databases

▪ MySQL

▪ PostgreSQL

ElasticTM - Map - Alternatives

Lack of upsert support for bulk updates

Handling duplicate entries

Scalability

ElasticTM - Map - Benchmarks

* The lower, the better

Time, s

Memory, MB

ElasticTM - Map - Benchmarks

ElasticSearch MongoDB CouchDB Redis

Add (47K) 83s 432s 67s 458s

Add (440K)

858s 6112s 644s 621s

Query (10K) 51s 187s 458s 72s

Query (440K) 1400s 6451 19647 1210s

Memory 252M 549M 771M 148M

ElasticTM - Scaling

Cluster

EN ...

EN-ES1

EN-ES2

Cluster

ES2 EN2 ... 2)

EN-ES1

EN-ES2

▪ Status

▪ Benchmarked alternatives

▪ Implemented and tested prototype

▪ Analyzed feasibility of linguistic methods

▪ Plans

▪ Build & scale ElasticTM to cover all available TMs in Pangeanic

▪ Implement plugins for CAT tools

▪ Develop linguistic processing for major language pairs

Status & Plans

Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic

Data & Analytics

Transcript of Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic

D7.4: Business Showcaseexpert-itn.eu/sites/default/files/outputs/EXPERT_D7.4... · 2016-09-21 · Representatives from Pangeanic (Spain), Translated (Italy), Hermes Traducciones (Spain),

Parameter Estimation LTI Parametrization ... - Maxim Raginsky

Ángel Domínguez Translator, DTP Dominguez - CV - english.pdfDTP / Layout / Design Eurostat (European Commission) / Pangeanic. Recreation of the document layout from PDF documents,

Part Ada Boost - Maxim Raginsky

Concentration of Measure Inequalities in Information Theory, Communications and Coding · 2015-02-26 · Second Edition,2014. Maxim Raginsky ... Introduction 1.1 An overview and a

TAUS MT SHOWCASE, I Used to Be a Translator, Now I Run MT, Manuel Herranz, Pangeanic, 12 June 2013

Report on Spanish National Translation Contracts€¦ · Report on Spanish National Translation Contracts Public Procurement Market Research: ... Carmen Herranz-Carr Pangeanic Carolina

Antimicrobial Assessment, and Literature Review Article Essential Oil from Piper aduncum: Chemical Analysis, Antimicrobial Assessment, and Literature Review Lianet Monzote 1, Ramón

Building social media for educators course by Jackie Raginsky

2. Project Management - Alexandre Helle & Manuel Herranz (Pangeanic)

Industry Shared Metrics with the TAUS Dynamic Quality ...Pangeanic Paypal Philips PTC Siemens Spil Games Systran VMware Welocalize Yahoo! Proceedings of MT Summit XV, vol. 2: MT Users'

Concentration of Measure Inequalities in Information ... · Theory, Communications, and Coding Maxim Raginsky Department of Electrical and Computer Engineering Coordinated Science

Pangeanic’s Do-It-Yourself Machine Translation: User ......Pangeanic/B.I Europa (Pangeanic for short, an associate member of B.I Corporation in Japan) is a Spain-based language service

Manuel Herranz - Pangeanic

· Lepore, Francesca Alice Vianello, Andrea Priori, Ulderico Daniele, Valentina Vitale, Francesco Della puppa, Giovanna Cavatorta, Lianet Camara, Ottavia Salvador ... Presiede: Mæia

Networking Session Hosted by - GALA Global PDF_2.pdfManuel Herranz, Pangeanic Jane Nemcova Kåre Lindahl, Venga Global In the past year, many mid-sized language services companies

Panacea presentation - Pangeanic - Budapest

Interactivity, Adaptation and Multimodality in Neural ......was done in collaboration with Miguel Domingo and the company Pangeanic, with funding from the Spanish Center for Technological

- User empowerment - DIY SMT - pangeanic.com...MT at Pangeanic, from Trial 2007/08 to Production. 2009/10 2011/12 • DIY SMT • Empower Users • Glossary • Automated re-training

Lecture VIII: Fourier series - Maxim Raginsky