9. Manuel Harranz (pangeanic) Hybrid Solutions for Translation
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Click here to load reader
-
Upload
riilp -
Category
Data & Analytics
-
view
52 -
download
0
Transcript of Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
![Page 1: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/1.jpg)
IMPROVING HYBRID TRANSLATION
TOOL
FULL-TEXT SEARCH ENGINE APPROACH
Lianet Sepulveda, Alexander Raginsky
Pangeanic
![Page 2: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/2.jpg)
▪ Improving translation memory matching
▪ Natural Language Processing (NLP)
▪ Full-text search engine
▪ TM database
Agenda
![Page 3: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/3.jpg)
Translation Memory (TM)
▪ Pre-translations stored in a database and offered as suggestions
▪ Implemented matching algorithm to propose a relevant translations
▪ exact match and fuzzy match
▪ segments similarities based on characters or tokens
PLN to improve matching algorithm
![Page 4: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/4.jpg)
Approach
➢ Statistical Machine Translation (SMT)
➢ Computer-Aided Translation (CAT) environment
Run maintenance
● Search and replace
● Update TM entries
● Imports & Export entries
Translation
Memory
Improving TM entries
ElasticTM
Full-text search engine
+
NLP techniques
![Page 5: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/5.jpg)
Improving TM Matching
perfect match by substitution
fuzzy match
{ “source_TM” : “I have 3 cats”, “target_TM” : “Yo tengo 3 gatos”, “score” : “80%” }
{ “source_TM” : “I have <number> cats”, “target_TM” : “Yo tengo <number> gatos”, “score” : “100%” }
Original TM
{ “input_source”: “I have 2 cats”, “output_target”: “ ” }
TM after preprocessing
● URLs
● Emails
● Dates
● Units
![Page 6: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/6.jpg)
Improving TM Matching
fuzzy match
{ “source_TM” : “I have a cat and I am very happy”, “target_TM” : “Yo tengo un gato y estoy muy feliz”, “score” : “44%” }
{ “target_TM” : “Yo tengo un gato y estoy muy feliz”, “source_TM” : “I have a cat”, “target_TM” : “Yo tengo un gato”, “source_TM” : “I am very happy”, “target_TM” : “Estoy muy feliz”, “score” : “100%” }
Original TM
{ “input_source”: “I have a cat”, “output_target”: “ ” }
TM after preprocessing
perfect match by substitution
![Page 7: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/7.jpg)
Improving TM Matching
▪ Several language → Maximise the reuse of existing human translation
▪ Linguistic feature → improving fuzzy matching
▪ string transformation
▪ segmentation rules
▪ pos tagger
▪ tokenizer
EN
ES
PT
JA
.
.
.
FR
EN
ES
PT
JA
.
.
.
FR
![Page 8: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/8.jpg)
Improving TM Matching
Linguistic approach to improve match
● Segment the text by sentence
○ Using delimiters like . ? ! , - :
○ Limited the total of words
● Intra-sentence segmentation
○ Using conjunctions, adverbs, determiners,
pronouns
○ Others approaches
● Replace segments
○ Numbers, dates, proper nouns and identifiers,
URLs, e-mail address, punctuation marks,
acronyms, variables.
● POS pattern string
● Named entity recognition
ElasticTM
TMX
files
source text
(Puscasu, 2004; Eriksson and
Myhrman, 2010; Orasan, 2000)
![Page 9: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/9.jpg)
▪ TM database built from TMX files
▪ Based on the state-of-the-art full-text search engine
▪ Fast indexing, search and retrieval
▪ Supports advanced text retrieval techniques (fuzzy match,
regular expressions)
▪ Easily scalable
▪ Role-based security
ElasticTM
![Page 10: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/10.jpg)
▪ PangeaMT
▪ As a preprocessing step before Moses SMT
▪ PangeaCrawler
▪ Automatic website translation
▪ Plugins for CAT tools
▪ As an auxiliary tool for human translators
ElasticTM - Intended Usage
![Page 11: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/11.jpg)
ElasticTM - Design
EN ES FR ... NL
Search Engine
EN
<->
ES
FR
<->
ES
FR
<->
NL
...
Map DB
ElasticTM
![Page 12: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/12.jpg)
ElasticTM - Design (cont’d)
▪ Monolingual índices
▪ Memory-effective
▪ Implicit transitive language pairs
▪ Bilingual mappings
▪ Fast bidirectional id <-> id mapping
▪ Role-based security system
▪ Admin, project admin, user etc.
![Page 13: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/13.jpg)
▪ Considered Lucene-based search engines:
▪ Solr and ElasticSearch
▪ Mature open source projects
▪ Have similar capabilities & performance
▪ ElasticSearch was picked mainly because of:
▪ Out-of-the-box scalability
▪ Powerful Query DSL (query language)
▪ Role-based security (via plugin)
ElasticTM - Search Engine
![Page 14: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/14.jpg)
▪ Mapping source language segments to a target language
▪ Bidirectional map (id to id)
▪ Supports quick bulk incremental updates
ElasticTM - Map
![Page 15: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/15.jpg)
▪ NoSQL key-value databases
▪ MongoDB
▪ CouchDB
▪ Redis
▪ ElasticSearch
▪ … many others …
▪ SQL databases
▪ MySQL
▪ PostgreSQL
ElasticTM - Map - Alternatives
Lack of upsert support for bulk updates
Handling duplicate entries
Scalability
![Page 16: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/16.jpg)
ElasticTM - Map - Benchmarks
* The lower, the better
Time, s
Memory, MB
![Page 17: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/17.jpg)
ElasticTM - Map - Benchmarks
ElasticSearch MongoDB CouchDB Redis
Add (47K) 83s 432s 67s 458s
Add (440K)
858s 6112s 644s 621s
Query (10K) 51s 187s 458s 72s
Query (440K) 1400s 6451 19647 1210s
Memory 252M 549M 771M 148M
![Page 18: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/18.jpg)
ElasticTM - Scaling
ghg
Cluster
EN ...
1)
EN-ES1
EN-ES2
ghg
Cluster
EN1
ES1
ES2 EN2 ... 2)
EN-ES1
EN-ES2
ES
![Page 19: Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic](https://reader038.fdocuments.net/reader038/viewer/2022100800/58ee878c1a28abb1638b463f/html5/thumbnails/19.jpg)
▪ Status
▪ Benchmarked alternatives
▪ Implemented and tested prototype
▪ Analyzed feasibility of linguistic methods
▪ Plans
▪ Build & scale ElasticTM to cover all available TMs in Pangeanic
▪ Implement plugins for CAT tools
▪ Develop linguistic processing for major language pairs
Status & Plans