Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages
-
Upload
aist -
Category
Presentations & Public Speaking
-
view
92 -
download
5
Transcript of Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages
![Page 1: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/1.jpg)
Morphological Analyzer and Generator for Russian and Ukrainian Languages
Mikhail Korobov AIST 2015
![Page 2: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/2.jpg)
Morphological Analysis: word -> possible grammatical tags
• стали: VERB,perf,intr plur,past,indc (ГЛ,сов,неперех мн,прош,изъяв);
• стали: NOUN,inan,femn sing,[nomn;gent;datv;loct] (СУЩ,неод,жр [ед,рд;ед,дт;eд,пр;мн,им;мн,вн])
• бутявка: NOUN,inan,femn sing,nomn (СУЩ,неод,жр ед,им)
![Page 3: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/3.jpg)
Moprhological Generation
• lemmatization: стали -> стать, ежом -> ёж
• inflection: стали -> (sing,3per,fut) -> станет
• inflection: ёж -> (datv) -> ежу
![Page 4: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/4.jpg)
pymorphy2: features• Morphological analysis of Russian words;
• morphological generation: lemmatization, inflection, number agreement;
• P(tag | word) estimates;
• out-of-vocabulary words handling;
• experimental support for Ukrainian language.
![Page 5: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/5.jpg)
pymorphy2: implementation• Python library and a command line tool
• Permissive open-source license: MIT for code, Creative Commons BY-SA for data
• 600+ unit tests; 90%+ test coverage
• Memory usage: 30MB = 15MB pymorphy2 + 15MB Python interpreter
• Speed: 20-100K words per second with an optional C++ extension
![Page 6: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/6.jpg)
Analysis of Vocabulary Words
• OpenCorpora dictionary for Russian (5M word forms, 400K lemmas);
• a dictionary based on LanguageTool data (2.5M word forms) by Andrey Rysin, Dmitry Chaplinsky, Mariana Romanyshyn, Vladimir Sevastyanov & others.
![Page 7: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/7.jpg)
Analysis of Vocabulary Words
Source dictionaries provide lexemes:
ёж NOUN,anim,masc sing,nomnежа NOUN,anim,masc sing,gentежу NOUN,anim,masc sing,datv...ежами NOUN,anim,masc plur,abltежах NOUN,anim,masc plur,loct
![Page 8: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/8.jpg)
Tasks
• Analyze: get a word from dictionary, return its tag
• Lemmatize: find a word in dictionary, get 1st word from its lexeme
• Inflect: find a word in dictionary, get a compatible word from its lexeme
![Page 9: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/9.jpg)
Efficiency considerations
• OpenCorpora XML dictionary is 400MB on disk
• XML search lookup is O(N)
• When loaded to an in-memory hash table (Python dict) dictionary takes several GB of RAM
![Page 10: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/10.jpg)
Solution
• Extract paradigms from lexemes; encode words as DAFSA.
• Also tried: succinct tries, two double-array tries
• 5M Russian word forms in DAFSA == 3MB RAM
![Page 11: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/11.jpg)
Lexeme word tag хомяковый ADJF,Qual masc,sing,nomn хомякового ADJF,Qual masc,sing,gent ... хомяковы ADJS,Qual plur хомяковее COMP,Qual хомяковей COMP,Qual V-ejпохомяковее COMP,Qual Cmp2похомяковей COMP,Qual Cmp2,V-ej
![Page 12: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/12.jpg)
Lexemeprefix stem suffix tag хомяков ый ADJF,Qual masc,sing,nomn хомяков ого ADJF,Qual masc,sing,gent ... хомяков ы ADJS,Qual plur хомяков ее COMP,Qual хомяков ей COMP,Qual V-ej по хомяков ее COMP,Qual Cmp2 по хомяков ей COMP,Qual Cmp2,V-ej
![Page 13: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/13.jpg)
Paradigmprefix suffix tag ый ADJF,Qual masc,sing,nomn ого ADJF,Qual masc,sing,gent ... ы ADJS,Qual plur ее COMP,Qual ей COMP,Qual V-ej по ее COMP,Qual Cmp2 по ей COMP,Qual Cmp2,V-ej
![Page 14: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/14.jpg)
Paradigm, encodedprefix_id suffix_id tag_id 0 66 78 0 67 79 ... 0 37 94 0 82 95 0 121 96 1 82 97 1 121 98
![Page 15: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/15.jpg)
DAFSA10
14
0
2
3
1
16
4 6
32И
sep
7
22sep8 9sep
И
13103
12103
102
2
2
0
17104
2
(word, paradigm_id, form_index) triples:(двор, 103, 0); (ёж, 104, 0); (дворник, 101, 2); (дворник, 102, 2); (ёжик, 101, 2); (ёжик, 102, 2)
![Page 16: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/16.jpg)
Out of Vocabulary Words
![Page 17: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/17.jpg)
Common prefixes removal: language-specific lists of common immutable
prefixes (e.g. "не", "псевдо")
• недопсевдоавиашоу == недо + псевдоавиашоу
• псевдоавиашоу == псевдо + авиашоу
• авиашоу == авиа + шоу
• шоу - a known word
![Page 18: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/18.jpg)
Words Ending with Other Dictionary Words Example: котопсина
• a word being analyzed has another word from a dictionary as a suffix;
• the length of this "suffix" word is no less than 3;
• the length of the word without the "suffix" is no greater than 5;
• "suffix" word is of an open class (noun, verb, adjective, participle, gerund)
![Page 19: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/19.jpg)
Endings Matching Example: бурбуляторовый
• words with common endings often have the same grammatical form
• pymorphy2 builds an index of all 1-5 char word endings and their analyses
• (frequency, paradigm_id, form_index) triple is stored for each ending
![Page 20: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/20.jpg)
Words with a Hyphen
• adverbs with a hyphen: по-хорошему
• particles separated by a hyphen: смотри-ка
• compound words: интернет-магазин, человек-паук
![Page 21: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/21.jpg)
P(tag | word) estimation
• Based on partially disambiguated OpenCorpora data;
• MLE with Laplace smoothing
![Page 22: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/22.jpg)
Evaluation: bad ideas
• evaluate pymorphy2 on OpenCorpora data
• evaluate Mystem on ruscorpora.ru (НКРЯ) data
![Page 23: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/23.jpg)
Evaluation Setup• pymorphy2 and Mystem 3.0;
• 100 randomly selected sentences from OpenCorpora ("microcorpus");
• 100 randomly selected sentences from ruscorpora.ru;
• tagsets are different; evaluation requires complicated tag matching and manual checking of all errors;
• available online (http://goo.gl/BNXQXf)
![Page 24: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/24.jpg)
Evaluation: errors (full grammatical tags, recall, errors in
hyphenated words are not considered errors)
0
7,5
15
22,5
30
pymorphy2 Mystem 3.0
89
15
10
microcorpus ruscorpora
![Page 25: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/25.jpg)
Evaluation: errors
0
3,5
7
10,5
14
Abbreviations People Names Regular Words Other Hyphenated Words*
11
2
6
1
14
02
44
9
pymorphy2 Mystem 3.0
![Page 26: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/26.jpg)
Evaluation: results• Both pymorphy2 and mystem made less than 1%
errors (without disambiguation); most errors are in special cases.
• Hard to draw a conclusion; interpretation of evaluation results is important.
• 6 errors in ruscorpora.ru gold results are found by parsing it with pymorphy2, 1 error in microcorpus gold results is found by parsing it with mystem.
![Page 27: Mikhail Korobov - Morphological Analyzer and Generator for Russian and Ukrainian Languages](https://reader033.fdocuments.net/reader033/viewer/2022052311/55a9a64b1a28aba5518b4858/html5/thumbnails/27.jpg)
Future work• Improve people names, abbreviations, hyphenated words
parsing;
• improve non-contextual P(tag|word) estimates;
• improve Ukrainian language support;
• add Belarusian language support;
• there is a room for speed improvements;
• nicer command-line utility;
• ideas?