Russian National Corpus - narod.ruolesar.narod.ru/papers/RNCworkshop_SCLCHarvard_2014.pdf1 Russian...
Transcript of Russian National Corpus - narod.ruolesar.narod.ru/papers/RNCworkshop_SCLCHarvard_2014.pdf1 Russian...
1
Russian National Corpus
ruscorpora.ru
Ekaterina Rakhilina, Vladimir Plungian, Olga Lyashevskaya, Dmitry Sichinava
RNC Workshop, SCLC 2014
16 Feb 2014 Harvard University
2
Preliminary plan
� Russian National Corpus Season 2014:
� hints and tricks
� new features and plans
� Corpus data for offline research
� Discussion
Your input is much appreciated!
3
Main participants
� V.V.Vinogradov Russian Language Institute
Russian Academy of Sciences Moscow
Yandex
Internet and technologycompany
5
RNC non-commercial partnership
� universities (Moscow, Saint-Petersburg, Saratov, etc.)
� research institutes (IPPI RAN, ILI RAN)
� IT-companies
� personal membership
You are welcome to share your corpus data through RNC!
New goals: Licensing issues and data distribution.
7
RUSCORPORA.RU family
� The main corpus of written Modern Russian (1700-present, 230 MW)
� Newspapers & news (2000-present, 174 MW)
� Corpus of Russian poetry (10 MW)
� Spoken corpus (11 MW)
� Multimedia corpus (4 MW)
� Accentuated corpus (14 MW)
� Parallel corpora (54 MW)
� Syntactic treebank (0,7 MW)
� Corpus of Russian dialects
� Russian-for-Schools corpus
8
RUSCORPORA.RU - new corpora
� Diachronic corpora:
� Old Russian
� Church-Slavonic
� Middle Russian
� Blogger corpus
� Learner corpora
13
Customizing subcorpus
The main corpus:
Modern fiction of various genres
Modern drama
Memoirs and biographies
Journalism and literary criticism
Scientific, popular scientific and teaching texts
Religious and philosophical texts
Technical texts
Business and jurisprudence texts
Day-to-day life texts, including texts not intended for publication (letters, diaries, etc.)
14
Hints & tricks
� sorting: надо же было ...
� раз...ся (рас...ся)
� Мама мыла раму.
� hypocoristic personal names not ending with *чка, *нька
� use word-formation
� вс- prefix
� also with possible alternations
� also on the 2nd place
15
Recent news from the RNC
� Poetry: up to 1990-2002
� MURCO: Multi-media corpus (movies, talks, etc.)
types of speech situations (welcome,questioning, interview, dispute, quarrel etc.)
gestures + gestures provided by speech
+ academic talks & discussions
+ Parallel Spoken Russian:
Gogol's Revizor on many stages (MultiParC)
� Diachronic evidences (Russian in XII-XVII cc.)
� Parallel corpora
18
RUSCORPORA.RU - new corpora
� Diachronic corpora
� Old Russian & Birch letters
� Church-Slavonic
� Middle Russian
� Slavic parallel corpora
� Blogger corpus
� Learner corpora
21
RNC annotation: the main corpus
Four major annotation layers:
� meta-textual annotation
register/genre, author, creation date, size, etc.
� word-level morphosyntactic annotation
lemma, POS, inflectional categories, distorted or anomalous forms etc.
� accentual annotation
normative place of accent, accentual shifts in fixed expressions
� lexico-semantic annotation
lexical classes of verbs, nouns, pronouns, adjectives and adverbs
+ new! word-formation annotation
prefixes, suffixes, roots
22
N-gram viewerhttp://ruscorpora.ru/ngram.html
� word forms - Графики
� cf. Google Books Ngram Viewer
� + wildcards *сторонился
� year span by by date of creation, not date of publishing (cf. GoogleBooks)
� smoothing (3... to 20 is recommended)
� lemmas, not words - Распределение по годам(output page)
� Статистика по метаатрибутам
25
Annotation mistakes and how to fix them
� Please tag mistakes if you come across them in the output data
26
Even more Russian corpora
in cooperation with the RNC
� "Simple" Russian (HSE in Nizhny Novgorod)
� "we cannot ask 5-year-old children to read examples from the corpus" (NB students!)
� a subcorpus of short simple sentences, frequent words from the "lexical minimum"
� "Non-perfect" Russian
� Heritage language in Finland and USA (study of language interference)
� Russian as L2 in Daghestan and other parts of Russia
� Learner corpus of academic writing
27
Even more Russian corpora
in cooperation with the RNC
� "Simple" Russian (HSE in Nizhny Novgorod)
� "we cannot ask 5-year-old children to read examples from the corpus" (NB students!)
� a subcorpus of short simple sentences, frequent words from the "lexical minimum"
� "Non-perfect" Russian
� Heritage language in Finland and USA (study of language interference)
� Russian as L2 in Daghestan and other parts of Russia
� Learner corpus of academic writing >
28
Корпус Академического Письма
http://web-corpora.net/RussianAcademCorpus/search/
Essays, drafts of term papers, other academic texts written by students
>> sociology, economics, politics, law, psychology, linguistics, management, etc.
>> 1 MW available so far
30
Corpus of academic writing
� 3 level of mistake annotation
1) linguistic type (orthography, punctuation, lexical choice, grammatical choice & form, discourse-oriented)
2) weight (minor mistake, medium level, major/critical mistake)
3) interpretation: what is the cause?(misprint, wrong synonym, mixt of constructions, etc)
31
Heritage languagehttp://web-
corpora.net/RussianLearnerCorpus/search
� National Heritage Language Resource Center (UCLA)
� Polynsky Lab in Harvard� О. Kisselev, A. Alsufieva, I.Dubibina et al.� E. Rakhilina and her research lab in HSE
Russian learner corpus
33
Some examples
� Эти ноутбуки потребляли меньше энергии, но были менее компактнее по объему.
� И прибыль от разрушения гораздо болеезаметна и быстра, нежели чем отстроительства.
� В русском языке семантический диапазонданного слова чрезвычайно широк, нежели в английском
(Academic Writing Corpus)
� В России человек большебольше (! чащечаще)считается расистом из за действий(Heritage Corpus)
35
RusGram
Corpus-based Russian reference grammar
� traditional академическая грамматика
� morphology (inflection)
� syntax
� + RNC-based statistics
� + lexical anchors in focus
� substandard Russian: negative evidences or "points of future development"?
37
Corpus-based dictionaries
� http://dict.ruslang.ru/
Frequency dictionary of Modern Russian
offline version available from my homepage
New grammatical dictionary
Russian idiomaticity in real usage (with frequences):
Which adjectival intensifier can we use with nouns?
Which verb can we use with abstract nouns?
Framebank (the dictionary of argument-predicate constructions attested in the RNC)
offline release summer 2014
38
Corpus-based dictionaries
In progress: Grammatical forms of Russian lexemes
� Paradigms of verbs, nouns, adjectives
� Distribution by time & text registers
� Lexical classes: comparative study
40
Statistics & offline use
Overall idea: to show patterns in your output
� statistics
� visualization
But: RNC corpus workbench is not adapted to
work with customized set of data
1 step: N-grams
41
N-grams search Beta!
2-, 3-, 4-, 5- word chains
� не до *
� потрясающе (*о | *е)
Most frequent N-grams - ЧАСТОТЫ
In progress: Search by lemma, morphology, semantics,
word formation
In progress: Explore time & text registers
+ in any subcorpus of your choice
In progress: Search with distance btw words (incl.
repetitions)
42
Offline data for advanced users & computational resources
NB! We are linguists, not lawyers: we cannot distribute texts
But: we can share annotations & statistics on this data
So far:
� ЧАСТОТЫ: 2-, 3-, 4-, 5-grams http://ruscorpora.ru/corpora-freq.html
� 1 MW Morphological standard (manually
disambiguated, shuffled sentences)
Plans:
� N-grams for other corpora + annotated data
� POS-annotations etc. V-S-S-CONJ-ADJ-S.
43
studiorum.ruscorpora.ru
A companion web site to the RNC
� Corpus methods in linguistic research
� Corpus in teaching Russian as a second language
� Corpus in teaching linguistics, Russian stylistics, philology and social sciences
� Corpus in teaching Russian in school
� References (incl. PhD manuscripts and term papers)
� Corpus resources
� F.A.Q.
44
Discussion
Any questions?
comments?
complaints?
What would you like to see in the corpus?
Known issues >
45
Known issues
1. A bag of words
� Lemma: дуло 'muzzle'
� Gram: V
2. *базар* (разбазарить, разбазаривать, пробазарить, базарчик, Базаров)
� NB word-formation: just words in the dictionary
3. Search across sentence boundaries
4. Unbalansed portions of data across time
� который
� и, в, на, они
� не
Solution: TBA soon
annotated n-grams database search
48
RNC annotation: the main corpus
Four major annotation layers:
� meta-textual annotation
register/genre, author, creation date, size, etc.
� word-level morphosyntactic annotation
lemma, POS, inflectional categories, distorted or anomalous forms etc.
� accentual annotation
normative place of accent, accentual shifts in fixed expressions
� lexico-semantic annotation
lexical classes of verbs, nouns, pronouns, adjectives and adverbs
+ new! word-formation annotation
prefixes, suffixes, roots
50
Morphological parsing
Zaliznjak's (1967, 1977) formal model of Russian inflection
A set of parsers based on Grammatical dictionary
MYSTEM (Segalovich 2003) and DIALING (Sokirko2004) morphological parsers in use
Lemma, POS and grammatical features:
Examples: взял ‘take.PAST’
<ana lex=“взять" gr="act,indic,m,pf,praet,sg,tran,V"/>
жалеючи ‘pity.GER’
<ana lex=“жалеть" gr="act,anom,ger,ipf,praes,tran,V"/>
Hypotheses for words-not-in-dictionary: "Рогочим"
52
� 6 million corpus of manually disambiguated texts
� Other texts are not disambiguated
� Applying automatic disambiguation techniques (training on the disambiguated corpus and its evaluation)
●Morphosyntactic annotation
●●●● ● ●Future directions
WSDExamplesLexical taxonomy
Russian grammar
RNC
Morphosyntactic annotation & disambiguation
Manually
disambiguated
corpus, 6 million
Non-disambiguated
corpus
53
� The traditions of Moscow lexical semantics (Apresjan1974/1992, Mel’chuk 1996, Paducheva 1974, etc.)
� Dictionaries of lexical classes (Kuznecova 1982, Babenko2001, Shvedova 2004, 2007)
� DB LEXICOGRAPHER: verbs and nouns (Kustova&Paducheva 1994, Kustova 2004, Paducheva 2004, Rakhilina 2000)
Main principles:
� Coarse-grained classification
� Well-known classes traditionally used in linguistic research
� The classification is aimed to explore the semantically motivated peculiarities of Russian grammar
� and allows for identify various constructions in the text
●Morphosyntactic annotation
●●●● ● ●Future directions
WSDExamplesLexical taxonomy
Russian grammar
RNC
Lexical taxonomy
54
Includes 6 independent classifications (some of them hierarchical):
• Category (prime lexical divisions that determine main semantic features: concrete, abstract, proper nouns; qualitative, relative, possessive adjectives);
• Taxonomy (e.g. luk ‘bow’: «weapon», radost’‘joy’: «emotion», bystryj ‘quick’: «speed», staryj‘old’: «age»);
• Mereology (e.g. rukav ‘sleeve’: «parts of clothes», buket ‘bunch’: «sets and aggregates», kaplja ‘drop’: «quanta and portions of stuff»);
• Topology (e.g. kastrjulja ‘pot’: «container», nos‘nose’ «juts», zmeja ‘snake’ «ropes»);
• Evaluation (e.g. blagouxanije ‘odor’: «positive», presmykat’sja ‘lick the boots’: «negative»);
• Derivational classes (e.g. knizhechka ‘little book’: «diminutives», sosnovyj ‘piny’: «adjectives derived from nouns»).
●Morphosyntactic annotation
●●●● ● ●Future directions
WSDExamplesLexical taxonomy
Russian grammar
RNC
RNC semantic database
55
●Morphosyntactic annotation
●●●● ● ●Future directions
WSDExamplesLexical taxonomy
Russian grammar
RNC
Separate entry for each meaning of the word:
Pojas ‘belt’, ‘waist’, ‘zone’1. Category: non-predicate ‘belt of a dress’
Taxonomy: accessoryMereology: part(cloth)Topology: stripe
2. Category: non-predicate ‘to bow from a waist’Mereology: bodypart(human/animal)
3. Category: non-predicate ‘time zone’Taxonomy: spaceTopology: stripe
56
●Morphosyntactic annotation
●●●● ● ●Future directions
WSDExamplesLexical taxonomy
Russian grammar
RNC
57
� All content words (nouns, verbs, adjectives, adverbs, pronouns, numerals) are automatically assigned semantic tagsets
� Currently more than 350 000 entries (ca. 100 000 lemmas) in the database
� Large-scale, word-by-word annotation
� Disambiguation is still needed
●Morphosyntactic annotation
●●●● ● ●Future directions
WSDExamplesLexical taxonomy
Russian grammar
RNC
Lexico-semantic annotation