Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj-...
Transcript of Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj-...
![Page 1: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/1.jpg)
Multilingual Natural Language Processing
Cristina Espana-Bonet
RICOH Institute of ICT, Tokyo, Japan
11th December 2017
![Page 2: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/2.jpg)
OutlineMultilingual NLP
1 Semantic Representations
2 Multilingual Resources
3 Projects
![Page 3: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/3.jpg)
Semantic Representations(Monolingual) Sentence Representations
Composition of word embeddings using operations(+,×) on vectors and matrices
Internal representations in seq2seq architectures orauto-encoders (NMT context vectors, skip-thoughtvectors...)
Latent paragraph vectors in word2vec-like NNs
![Page 4: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/4.jpg)
Semantic Representations(Multilingual) Sentence Representations
ML-NMT Context Vectors, why?
Machine Translation is naturally a bilingual task
Neural Machine Translation (NMT) encodes semantics inword vectors
Straightforward extension of NMT to multilingual NMT(ML-NMT)
ML word (or context) vectors lie in the same space
![Page 5: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/5.jpg)
Semantic Representationsin Neural Machine Translation Architectures
−→h1
−→h2
−→h3
−→h4
−→h5
−→h6
−→h7
−→h8
←−h1
←−h2
←−h3
←−h4
←−h5
←−h6
←−h7
←−h8
−→q
−→z1−→z2
−→z3−→z4
−→z5−→z6
enco
der
decoder
attention mechanism
softmax
![Page 6: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/6.jpg)
Semantic Representationsin Neural Machine Translation Architectures
−→h1
−→h2
−→h3
−→h4
−→h5
−→h6
−→h7
−→h8
←−h1
←−h2
←−h3
←−h4
←−h5
←−h6
←−h7
←−h8
−→q
−→z1−→z2
−→z3−→z4
−→z5−→z6
enco
der
decoder
attention mechanism
softmax
context vectors
dense vectors
one-hot encoding
![Page 7: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/7.jpg)
Semantic Representationsin Neural Machine Translation Architectures
−→h1
−→h2
−→h3
−→h4
−→h5
−→h6
−→h7
−→h8
←−h1
←−h2
←−h3
←−h4
←−h5
←−h6
←−h7
←−h8
−→q
−→z1−→z2
−→z3−→z4
−→z5−→z6
enco
der
decoder
The marshmallow has to be on top < eos >
Der Marshmallow muss
![Page 8: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/8.jpg)
Semantic Representationsin Neural Machine Translation Architectures
−→h1
−→h2
−→h3
−→h4
−→h5
−→h6
−→h7
−→h8
←−h1
←−h2
←−h3
←−h4
←−h5
←−h6
←−h7
←−h8
−→q
−→z1−→z2
−→z3−→z4
−→z5−→z6
enco
der
decoder
The marshmallow has to be on top < eos >
Der Marshmallow muss
VocabularyMarshmallow,
muss, oben,
drauf,
sein...
Vocabularymarshmallow,
has, to,
be, top...
![Page 9: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/9.jpg)
Semantic Representationsin Neural Machine Translation Architectures
−→h1
−→h2
−→h3
−→h4
−→h5
−→h6
−→h7
−→h8
←−h1
←−h2
←−h3
←−h4
←−h5
←−h6
←−h7
←−h8
−→q
−→z1−→z2
−→z3−→z4
−→z5−→z6
enco
der
decoder
< 2de > The marshmallow has to be on top
Der Marshmallow muss oben drauf sein
![Page 10: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/10.jpg)
Semantic Representationsin Neural Machine Translation Architectures
−→h1
−→h2
−→h3
−→h4
−→h5
−→h6
−→h7
−→h8
←−h1
←−h2
←−h3
←−h4
←−h5
←−h6
←−h7
←−h8
−→q
−→z1−→z2
−→z3−→z4
−→z5−→z6
enco
der
decoder
< 2de > De marshmallow moet er bovenop . < eos >
Der Marshmallow muss oben drauf sein
![Page 11: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/11.jpg)
Semantic Representationsin Neural Machine Translation Architectures
−→h1
−→h2
−→h3
−→h4
−→h5
−→h6
−→h7
−→h8
←−h1
←−h2
←−h3
←−h4
←−h5
←−h6
←−h7
←−h8
−→q
−→z1−→z2
−→z3−→z4
−→z5−→z6
enco
der
decoder
< 2en > De marshmallow moet er bovenop . < eos >
The marshmallow has to be on
![Page 12: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/12.jpg)
Semantic Representationsin Neural Machine Translation Architectures
−→h1
−→h2
−→h3
−→h4
−→h5
−→h6
−→h7
−→h8
←−h1
←−h2
←−h3
←−h4
←−h5
←−h6
←−h7
←−h8
−→q
−→z1−→z2
−→z3−→z4
−→z5−→z6
enco
der
decoder
< 2en > De marshmallow moet er bovenop . < eos >
The marshmallow has to be on
VocabularyMarshmallow,
muss, oben,
has, to,
be, top...
Vocabularymarshmallow,
has, top,
oben,
bovenop,
moet ...
![Page 13: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/13.jpg)
Semantic RepresentationsPerformance of Context Vectors
ML-NMT: {de, en, nl , it, ro} → {de, en, nl , it, ro}
2D t-SNT representation of context vectors
![Page 14: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/14.jpg)
Semantic RepresentationsPerformance of Context Vectors
ML-NMT: {ar , en, es} → {ar , en, es}
2D t-SNT representation of context vectors
![Page 15: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/15.jpg)
Semantic RepresentationsPerformance of Context Vectors
Semantic Textual Similarity Task (SemEval 2017)
track1 track2 track3 track4a track5ar–ar ar–en es–es es–en en–en
w2v 300-D 0.49 0.28 0.55 0.40 0.56w2v 1024-D 0.51 0.33 0.59 0.45 0.60ctx 1024-D 0.59 0.44 0.78 0.49 0.76
Pearson Correlation
![Page 16: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/16.jpg)
Semantic Representationsin Neural Machine Translation Architectures II
−→h1
−→h2
−→h3
−→h4
−→h5
−→h6
−→h7
−→h8
←−h1
←−h2
←−h3
←−h4
←−h5
←−h6
←−h7
←−h8
−→q
−→z1−→z2
−→z3−→z4
−→z5−→z6
enco
der
decoder
attention mechanism
softmax
context vectors
dense vectors
one-hot encoding
The marshmallow has to be on top < eos >
![Page 17: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/17.jpg)
Semantic Representationsin Neural Machine Translation Architectures II
−→h1
−→h2
−→h3
−→h4
−→h5
−→h6
−→h7
−→h8
−→h1
−→h2
−→h3
−→h4
−→h5
−→h6
−→h7
−→h8
−→q
−→z1−→z2
−→z3−→z4
−→z5−→z6
enco
der
decoder
attention mechanism
softmax
context vectors
dense vectors
one-hot encoding
factor concatenation−→h j = tanh
(−→W ‖|F|
k=1 Ek xjk +−→U−→h
j−1
)←−h j = tanh
(−→W ‖|F|
k=1 Ek xjk +−→U←−h
j−1
)
2de|-|- the|DET|0 marsh|NOUN|MRXML has|VERB|HS to|PREP|T be|VERB|P on|PREP|AN top|NOUN|TP
![Page 18: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/18.jpg)
Semantic RepresentationsMultilinguality and Interlinguality through Factors
Approximate phonetic encodings with Metaphone 3
Babel Synsets for nouns (incl. named entities, foreign wordsand numerals), adjectives, adverbs and verbs.Negation particles are tagged with NEG
PoS, Lemma & Stem
< 2de >|-|-|- the|DET|the|0|-marshmallow|NOUN|marshmallow|MRXML|bn:00053559n
has|VERB|has|HS|bn:00089240v to|PREP|to|T|-be|VERB|be|P|bn:00083181v on|PREP|on|AN|-
top|NOUN|top|TP|bn:00077607n
![Page 19: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/19.jpg)
Semantic RepresentationsMultilinguality and Interlinguality through Factors
Approximate phonetic encodings with Metaphone 3
Babel Synsets for nouns (incl. named entities, foreign wordsand numerals), adjectives, adverbs and verbs.Negation particles are tagged with NEG
PoS, Lemma & Stem
< 2de >|-|-|- the|DET|the|0|-marshmallow|NOUN|marshmallow|MRXML|bn:00053559n
has|VERB|has|HS|bn:00089240v to|PREP|to|T|-be|VERB|be|P|bn:00083181v on|PREP|on|AN|-
top|NOUN|top|TP|bn:00077607n
![Page 20: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/20.jpg)
Semantic RepresentationsBeyond Zero-Shot Languages with Factors
ML-NMT: {de, en, nl , it, ro} → {de, en, nl , it, ro} # es, fr !
2D t-SNT representation of context vectors
![Page 21: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/21.jpg)
Semantic RepresentationsImproving NMT with Factors
Average for 20 language pairs
![Page 22: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/22.jpg)
Semantic RepresentationsBeyond Zero-Shot NMT with Factors
![Page 23: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/23.jpg)
Semantic Representations...from Multilingual NMT Systems
Context vectors provide a better semanticrepresentation at sentence level than thecomposition of word embeddings
They are multilingual by construction
Interlingual factors help to position in the same spacenew (related) languages
... but you need time, GPUs and data
![Page 24: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/24.jpg)
Multilingual Resources
1 Semantic Representations
2 Multilingual ResourcesBabelNetWikipedia
3 Projects
![Page 25: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/25.jpg)
Multilingual ResourcesImportance of Resources
Data driven systems are (even) more prevalent with theboom of deep learning architectures
Data quality is (even) more important for neural systemse.g. SMT vs. NMT
![Page 26: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/26.jpg)
Multilingual ResourcesMultilingual Resources
![Page 27: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/27.jpg)
Multilingual ResourcesBabelNet
Multilingual encyclopedic dictionary
Semantic network
271 languages
14 million entries
![Page 28: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/28.jpg)
Multilingual ResourcesBabelNet: Related Research & Challenges
Multilingual Word Sense Disambiguation
de: Es|- war|bn:00083181v ein|- riesiger|- Erfolg|bn:15350982n
en: And|- it|- was|bn:00083181v a|- huge|bn:00098905asuccess|bn:00075023n
it: Ed|- e|bn:00083181v stato|bn:00083181v un|-enorme|bn:00102268a successo|bn:00078365n
nl : En|- het|- was|bn:00083181v een|- groot|- succes|bn:06512571n
ro: Si|bn:00012706n a|- fost|bn:00083181v un|- mare|bn:00098342asucces|bn:00075024n
![Page 29: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/29.jpg)
Multilingual ResourcesBabelNet: Related Research & Challenges
Multilingual Word Sense Disambiguation
de: Es|- war|bn:00083181v ein|- riesiger|- Erfolg|bn:15350982n
en: And|- it|- was|bn:00083181v a|- huge|bn:00098905asuccess|bn:00075023n
it: Ed|- e|bn:00083181v stato|bn:00083181v un|-enorme|bn:00102268a successo|bn:00078365n
nl : En|- het|- was|bn:00083181v een|- groot|- succes|bn:06512571n
ro: Si|bn:00012706n a|- fost|bn:00083181v un|- mare|bn:00098342asucces|bn:00075024n
![Page 30: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/30.jpg)
Multilingual ResourcesWikipedia
Multilingual, web-based encyclopedia
299 languages
46 million articles
English: 5,529,144 content articles;Japanese: 1,087,058 content articles
![Page 31: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/31.jpg)
Multilingual ResourcesWikipedia: Related Research & Challenges
Same article in multiple languages connected viainterlanguage links
Comparable corpora are easy to build
Categories allow to extract in-domain comparablecorpora (not so easy!)
Parallel sentences can be extracted from comparablecorpora (not so easy!)
![Page 32: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/32.jpg)
Multilingual ResourcesWikipedia: Related Research & Challenges
Same article in multiple languages connected viainterlanguage links
Comparable corpora are easy to build
Categories allow to extract in-domain comparablecorpora (not so easy!)
Parallel sentences can be extracted from comparablecorpora (not so easy!)
![Page 33: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/33.jpg)
Multilingual ResourcesWikipedia: Related Research & Challenges
In-domain Comparable Corpora
Identify comparable articles
Build a characteristic vocabulary for the domain ofinterest automatically from root articles
Explore the Wikipedia categories’ graph to select thesubset of categories in the domain
![Page 34: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/34.jpg)
Multilingual ResourcesWikipedia: Related Research & Challenges
Graph Exploration
![Page 35: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/35.jpg)
Multilingual ResourcesWikipedia: Related Research & Challenges
Parallel Sentence Extraction
Brute-force sentence-wise comparison for parallel pairsidentification
Affordable (sentences aligned at document level)
Similarity measures to detect them
Cosine on character n-grams and pseudo-cognates,length factors
Cosine on context vectors
![Page 36: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/36.jpg)
Projects
1 Semantic Representations
2 Multilingual Resources
3 ProjectsCLuBSQT21WikiTailor
![Page 37: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/37.jpg)
ProjectsThe CLuBS Project
CLuBS
Cross-Lingual Bibliographic Search
Initial Status: Multilingual en/es/de/fr retrievalplatform —PubPsych— online
Aim: Make PubPsych cross-lingual and improvetranslation and retrieval with the latest advances
![Page 38: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/38.jpg)
ProjectsPubPsych – https://pubpsych.zpid.de/pubpsych/
![Page 39: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/39.jpg)
ProjectsThe CLuBS Project
German Project (Leibniz Gemeinschaft)
Multilingual NMT systems
In-domain parallel sentence extraction
Query translation?
Cross-lingual retrieval?
![Page 40: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/40.jpg)
ProjectsThe QT21 Project
Aim: Improve translation quality for all Europeanlanguages
Initial Status: 3 well covered languages, problems withmorphologically complex and low-resourced languages
![Page 41: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/41.jpg)
ProjectsThe QT21 Project
European Project (Horizon 2020)
Multilingual NMT systems
Factored ML-NMT with interlingual information
![Page 42: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/42.jpg)
ProjectsWikiTailor
Your a-la-carte in-domain corporaextraction tool from Wikipedia
Joint work with Alberto Barron-Cedeno
Aim: Extraction of parallel corpora in any domain andlanguage from Wikipedia
Current Status: Extraction of comparable corpora inany domain and language from Wikipedia
![Page 43: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/43.jpg)
ProjectsWikiTailor
Your a-la-carte in-domain corporaextraction tool from Wikipedia
Joint work with Alberto Barron-Cedeno
Aim: Extraction of parallel corpora in any domain andlanguage from Wikipedia
Current Status: Extraction of comparable corpora inany domain and language from Wikipedia
![Page 44: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/44.jpg)
Thanks!
![Page 45: Multilingual Natural Language Processingcristinae/CV/docs/researchML.pdf · en: Andj- itj- wasjbn:00083181v aj- hugejbn:00098905a successjbn:00075023n it: Edj- ejbn:00083181v statojbn:00083181v](https://reader033.fdocuments.net/reader033/viewer/2022050114/5f4bce86c73ffb6385248119/html5/thumbnails/45.jpg)
Multilingual Natural Language Processing
Cristina Espana-Bonet
RICOH Institute of ICT, Tokyo, Japan
11th December 2017