September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual...
Transcript of September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual...
![Page 1: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/1.jpg)
Introduction to cross-lingual word-embeddings
Diego Sáez-Trumper,Wikimedia Research
September 2019
![Page 2: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/2.jpg)
![Page 3: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/3.jpg)
What you will learn
● What is a word embedding● How can you use cross-lingual word embeddings● What you can’t do with word embeddings
![Page 4: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/4.jpg)
EmbeddingsTaking a set in one domain and represent it in another domain preserving some notion of distance.
Examples:● Words → Vectors ; distance: words meaning● Documents → Vectors ; distance: documents topic● Images → Vectors ; distance: image content
![Page 5: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/5.jpg)
Word Embeddings● Transformation: Words → Vectors● Distance to preserve: Semantic
− Words that have similar meaning should be close in the vector space
− Toy Example:
■ Cat → [0.8, 0]
■ Tiger → [0.9, 0]
■ Car → [0, 0.6]
■ Truck → [0, 0.8]
![Page 6: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/6.jpg)
Sentence Embedding:● Similarly, we can transform a sentence in a vector
− ‘This is a sentence’→ [x0, x1]
− Toy example:
● ‘This is a great day’ -> [0.8, 0]
● ‘This is a beautiful day’ -> [0.7, 0]
● ‘Open Source is great’ -> [0, 0.8]
![Page 7: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/7.jpg)
![Page 8: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/8.jpg)
Difference with other approaches● Word embeddings are directly not based on string similarity
− For embeddings ● cat and car are very different
● For string similarity take a look to metrics like edit distance or Levenshtein distance
● Embeddings won't be script depent!○ You can use any script (latin, cyrillic, arabic, etc)
![Page 9: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/9.jpg)
Doing math with words (Example for an ideal embedding)
Check this app: https://rare-technologies.com/word2vec-tutorial/
King - Man + Woman = Queen
France - Paris + Portugal = Lisbon
Eat + Past = Ate
![Page 10: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/10.jpg)
Relationships between entities
https://rare-technologies.com/word2vec-tutorial/
![Page 11: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/11.jpg)
Relationships between entities (ii)
https://rare-technologies.com/word2vec-tutorial/
![Page 12: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/12.jpg)
Get similar concepts
https://rare-technologies.com/word2vec-tutorial/
![Page 13: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/13.jpg)
Embeddings are far from perfect!
https://rare-technologies.com/word2vec-tutorial/
X
![Page 14: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/14.jpg)
Warning!● Word embeddings are corpus dependent!
If you train your embeddings in a news dataset don’t expected they will work good on Wikipedia!
![Page 15: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/15.jpg)
How word embeddings are computed?You shall know a word by the company it keeps [Fisher 1957]
Words that occur in similar contexts tend to have similar meanings [Harris, 1954]
Predicting words by the context (CBOW):
yesterday was a really [...] day ----> Strong Candidates: ‘Nice’ , ‘Beautiful’
Less probable: delightful
Given a word predict Context (Skip Gram):
Is probable that yesterday was a really [...] day is a suitable context for delightful ?
![Page 16: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/16.jpg)
Word Embeddings Approaches
Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
Example from StackOverflow
![Page 17: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/17.jpg)
Word Embeddings Implementations● Examples:
− Word2Vec
− GloVe
− FastText
![Page 18: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/18.jpg)
FastText● Some characteristics
− Works stand-alone (in bash) or as python package
− Allows supervised and unsupervised tasks
− Uses subword information (character level)
● Fasttext(‘Dog’) ≈ Fasttext(Dogs)
− Pre-trained embeddings in Wikipedia in multiple languages
![Page 19: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/19.jpg)
Word Embeddings Size:● In practice we usually work with vectors of 150 to 300 dimensions:
− Example:
● ‘Word’→ [x0, x1, …, x299]
● Where -1 ≤ xn ≤ 1 ; and ‘Word’ could be any sequence of strings
![Page 20: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/20.jpg)
Recap● Word Embeddings
− transform strings in Vectors− words will similar meanings will have similar vectors− embeddings are computed using words’ context
![Page 21: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/21.jpg)
Cross-lingual word embeddings
![Page 22: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/22.jpg)
We use multilingual word embeddings to compare text across different languages
![Page 23: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/23.jpg)
Working with Multilingual embeddings● Problem
− Vectors values doesn’t have a meaning per se.
− Values will change depending on the corpus .
− Therefore, training in different languages (different corpus) will result in different embeddings values.
![Page 24: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/24.jpg)
![Page 25: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/25.jpg)
![Page 26: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/26.jpg)
Cross-lingual Embeddings● Solutions:
− Force the embeddings to be have specific values, using some anchors (like Facebook LASER).
− Learn a transformation using a bilingual dictionary
■ Knowing that (few) points that match in the two vector spaces, you can rotate (more precisely, do a linear transformation) one of the vector spaces to align in with the other.
![Page 27: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/27.jpg)
Knowing that:Publicaciones -> PublicationsDiscografía -> Discography
![Page 28: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/28.jpg)
![Page 29: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/29.jpg)
Warning!● Cross-Lingual embeddings will learn
analogies not identities!
You shouldn’t use cross-lingual embeddings for direct translation!
![Page 30: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/30.jpg)
When given a word W (or sentence) in language X and (a small) set of candidates in language Y, you want to measure which of the candidates the most similar to W
When cross-lingual embeddings will work?
Distance(‘Buenos días’,’Good Morning) < Distance(‘Buenos días’,’Thank you')
![Page 31: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/31.jpg)
Examples
![Page 32: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/32.jpg)
Sections headings alignments across languages● Problem:
− Given the most popular section headings in two languages create an alignment among them.
● Solution:− Cross-lingual embeddings− plus other features
![Page 33: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/33.jpg)
Section Alignment API
● Input language: es● Output Language: en● Section: Historia● API CALL
○ https://secrec.wmflabs.org/API/alignment/es/en/Historia
API's Documentation
![Page 34: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/34.jpg)
Section Alignment API
● Input language: en● Output Language: ru● Section: History● API CALL
○ https://secrec.wmflabs.org/API/alignment/en/ru/History
API's Documentation
![Page 35: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/35.jpg)
Section Alignment API
● Input language: es● Output Language: ja● Section: Historia● API CALL
○ https://secrec.wmflabs.org/API/alignment/en/ru/History
API's Documentation
![Page 36: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/36.jpg)
Section Alignment API
● Input language: en● Output Language: ru● Section: History● API CALL
○ https://secrec.wmflabs.org/API/alignment/en/ru/History
API's Documentation
Current languages supported:ar,fr,en,es,ja,ru
![Page 37: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/37.jpg)
Named Parameter Templates Alignments● Problem:
− The CX translation tool needs to translate templates. Automatic translation engines are not designed for translating templates
− There are different names and number of parameters in each language
● (Not perfect) Solution:− Use cross-lingual word embeddings− Languages covered:
■ es, en, fr, ar, ru, uk, pt, vi, zh, ru, he, it, ta, id, fa, ca
Check T221211
![Page 38: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/38.jpg)
![Page 39: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/39.jpg)
![Page 40: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/40.jpg)
![Page 41: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/41.jpg)
Template:Infobox publisher->Plantilla:Ficha de editorial
![Page 42: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/42.jpg)
Template:Infobox publisher->Plantilla:Ficha de editorial
![Page 43: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/43.jpg)
Warning!
Cross-Lingual embeddings won’t be as good as bilingual humans in creating alignments!
But:
● They can do the work really fast.● You can use them in all wikipedia
languages.● Even for unusual languages pairs.
![Page 44: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/44.jpg)
Takeaways● Word Embeddings allows machines to understand similarities
among words.● Cross-lingual embeddings allows machines to compare words
across different languages.● For Wikipedia use word embeddings trained on Wikipedia!
![Page 45: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/45.jpg)
Do you want more?● Do you want to know more about other possible usages of
embeddings on Wikipedia?− Check out our white paper about topic embeddings.
● Clone this repository and start working with multilingual embeddings with python:− https://github.com/digitalTranshumant/TutorialCrossLingualE
mbeddings
![Page 46: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f6098a74ba6dc16f20b4edb/html5/thumbnails/46.jpg)
Thanks!