NOOJ Conference Inalco, Paris June 16th, 2012
description
Transcript of NOOJ Conference Inalco, Paris June 16th, 2012
11
NOOJ Conference NOOJ Conference Inalco, ParisInalco, Paris
June 16th, 2012June 16th, 2012
Vincent BÉNETINALCO
CREE Recherche assistée par ordinateur
Conception and realization Conception and realization of grammatical & lexical resourcesof grammatical & lexical resources
for the Russian languagefor the Russian language
for Max Silberztein’s Nooj software for Max Silberztein’s Nooj software
Russian Module for NooJ: design and implementation
22
Design linguistics resources Design linguistics resources
Description of the realizationDescription of the realization Dictionaries / paradigms /grammarsDictionaries / paradigms /grammars
Job left to be done…Job left to be done…
Russian Module for NooJ:
design and implementation
33
Writing lexical resources for the Russian languageWriting lexical resources for the Russian language
Build dictionairies from textsBuild dictionairies from texts
Create one « small » dictionary and Create one « small » dictionary and many grammars for derivational formsmany grammars for derivational formsраб раб + a (slave) + a (slave) раб раб + o+ oтт ++ а +а + тьть (work) (work)за +за + раб +раб + отот + к+ к ++ аа (salary) (salary)
Complete one « big » existing dictionary Complete one « big » existing dictionary and create manyand create many grammarsgrammars
44
Writing lexical resources for the Russian languageWriting lexical resources for the Russian language
ZALIZNIAK’s grammatical dictionary : 96 000 entriescomplete dictionary, in inverted alphabetical order, with all grammatical annotation
To obtain, to reach :Достигать нсв нп 1a$3 (доcтигнуть//доcтичь) имеется страдDostigat’ ipf nt 1a$3 (dostignut’/dostich’) has a passive form
55
Writing lexical resources for the Russian languageWriting lexical resources for the Russian language
The problem of accent markers was delayed
Encountered problems Classification complete but some tags are absent ( V, N…)Classification based on accent markersA lot Unformal unclassified added annotations
Zalizniak’s dictionary was resorting, its classification was modified, simplified and completed for computer use
66
The design of lexical resources The design of lexical resources for the Russian languagefor the Russian language has consisted in: has consisted in:
33. sorting the dictionary . sorting the dictionary (inverted alphabetical order for each (inverted alphabetical order for each wordword))
1. 1. creatingcreating grammatical tagsgrammatical tags
2. 2. recoding the dictionary with this tagsrecoding the dictionary with this tags
6. 6. problem with problem with ë / eë / e
4. f4. fixing a paradigm model list ixing a paradigm model list ((kartakarta instead ofinstead of zh1a )
5. 5. writing paradigmswriting paradigms
7. a7. allocating models to the wordsllocating models to the words
8. 8. verifying the resultsverifying the results
9. 9. testing with textstesting with texts
10. 10. Correcting and proofreadingCorrecting and proofreading
77
Writing lexical ressources for RussianWriting lexical ressources for Russian
1. Creating tags and properties N, A, V, ADV ….
A_Forme = fc | fl | adv;A_Genre = m | f | n ;A_SGenr = an | inan ;A_Nombre = s | p;A_Cas = Im | Vi | Ro | Da | Tv | Pr | Zv;A_Deg = Comp | Sup ;ADV_Deg = Comp;
V_Pers = 1 | 2 | 3 ;V_Asp = Ipf | Pf ;V_Type = Mvt ;V_Morph = Pvb | Simp | Sufx | PvbSufx ;V_SsAsp = Det | Indet ;V_Temps = Pre | Pa | Fu ;V_Mode = Inf | Ind | Imp | Cond | Ger | Prtp ;V_Voix = Act | Pss ;V_Genre = m | f | n ;V_Nombre = s | p ;V_Constr = intr | tr | sja ;V_Cas = Im | Vi | Ro | Da | Tv | Pr ;
88
Writing lexical ressources for RussianWriting lexical ressources for Russian2. recoding the dictionary
3. Sorting the dictionary to get inverted aphabetical ordering
99
#j1a=karta#jo1a=korova#j2a=nedelja#jo2a=boginja#j3a=kniga#jo3a=sobaka#j4a=tuča#jo4a=kassirša#j5a=ulica#jo5a=volčica#j6a=statuja#jo6a=feja#j7a=linija#jo7a=furija
4. Paradigm model list
карта = <E>/Im+f+s + <B>у/Vi+f+s + <B>ы/Ro+f+s + <B>е/Da+f+s + <B>ой/Tv+f+s + <B>е/Pr+f+s + <B>ы/Im+f+p + <B>ы/Vi+f+p + <B>/Ro+f+p + <B>ам/Da+f+p + <B>ами/Tv+f+p + <B>ах/Pr+f+p ;
5. writing paradigms
Writing lexical Russian resourcesWriting lexical Russian resources
1010
5. Paradigm for verbs
взять = <E>/Inf | <B4>озьму/1+s+Pre | <B4>озьмешь/2+s+Pre | <B4>озьмет/3+s+Pre | <B4>озьмем/1+p+Pre | <B4>озьмете/2+p+Pre | <B4>озьмёшь/2+s+Pre | <B4>озьмёт/3+s+Pre | <B4>озьмём/1+p+Pre | <B4>озьмёте/2+p+Pre | <B4>озьмут/3+p+Pr | <B2>л/m+s+Pa | <B2>ла/f+s+Pa | <B2>ло/n+s+Pa | <B2>ли/p+Pa | <B4>озьми/2+s+Imp | <B4>озьмите/2+p+Imp | <B2>в/Ger | <B2>вши/Ger | <B2>вший/Prtp+Pa+Act+m+s+Im | <B2>вший/Prtp+Pa+Act+m+s+Vi | <B2>вшего/Prtp+Pa+Act+m+an+s+Vi | <B2>вшего/Prtp+Pa+Act+m+s+Ro | <B2>вшему/Prtp+Pa+Act+m+s+Da | <B2>вшим/Prtp+Pa+Act+m+s+Tv | <B2>вшем/Prtp+Pa+Act+m+s+Pr | <B2>вшая/Prtp+Pa+Act+f+s+Im | <B2>вшую/Prtp+Pa+Act+f+s+Vi | <B2>вшую/Prtp+Pa+Act+f+s+Vi | <B2>вшей/Prtp+Pa+Act+f+s+Ro | <B2>вшей/Prtp+Pa+Act+f+s+Da | <B2>вшей/Prtp+Pa+Act+f+s+Tv | <B2>вшею/Prtp+Pa+Act+f+s+Tv | <B2>вшей/Prtp+Pa+Act+f+s+Pr | <B2>вшее/Prtp+Pa+Act+n+s+Im | <B2>вшее/Prtp+Pa+Act+n+s+Vi | <B2>вшего/Prtp+Pa+Act+n+s+Vi | <B2>вшего/Prtp+Pa+Act+n+s+Ro | <B2>вшему/Prtp+Pa+Act+n+s+Da | <B2>вшим/Prtp+Pa+Act+n+s+Tv | <B2>вшем/Prtp+Pa+Act+n+s+Pr | <B2>вшие/Prtp+Pa+Act+p+Im | <B2>вшие/Prtp+Pa+Act+p+Vi | <B2>вших/Prtp+Pa+Act+an+p+Vi | <B2>вших/Prtp+Pa+Act+p+Ro | <B2>вшим/Prtp+Pa+Act+p+Da | <B2>вшими/Prtp+Pa+Act+p+Tv | <B2>вших/Prtp+Pa+Act+p+Pr | <B2>тый/Prtp+Pa+Pss+m+s+Im | <B2>тый/Prtp+Pa+Pss+m+s+Vi | <B2>того/Prtp+Pa+Pss+m+an+s+Vi | <B2>того/Prtp+Pa+Pss+m+s+Ro | <B2>тому/Prtp+Pa+Pss+m+s+Da | <B2>тым/Prtp+Pa+Pss+mo+s+Tv | <B2>том/Prtp+Pa+Pss+mo+s+Pr | <B2>тая/Prtp+Pa+Pss+f+s+Im | <B2>тую/Prtp+Pa+Pss+f+s+Vi | <B2>той/Prtp+Pa+Pss+f+s+Ro | <B2>той/Prtp+Pa+Pss+f+s+Da | <B2>той/Prtp+Pa+Pss+f+s+Tv | <B2>тою/Prtp+Pa+Pss+f+s+Tv | <B2>той/Prtp+Pa+Pss+f+s+Pr | <B2>тое/Prtp+Pa+Pss+n+s+Im | <B2>тое/Prtp+Pa+Pss+n+s+Vi | <B2>того/Prtp+Pa+Pss+n+s+Ro | <B2>тому/Prtp+Pa+Pss+n+s+Da | <B2>тым/Prtp+Pa+Pss+n+s+Tv | <B2>том/Prtp+Pa+Pss+n+s+Pr | <B2>тые/Prtp+Pa+Pss+p+Im | <B2>тые/Prtp+Pa+Pss+p+Vi | <B2>тых/Prtp+Pa+Pss+an+p+Vi | <B2>тых/Prtp+Pa+Pss+p+Ro | <B2>тым/Prtp+Pa+Pss+p+Da | <B2>тыми/Prtp+Pa+Pss+p+Tv | <B2>тых/Prtp+Pa+Pss+p+Pr | <B2>т/Prtp+Pa+Pss+m+s+fc | <B2>та/Prtp+Pa+Pss+f+s+fc | <B2>то/Prtp+Pa+Pss+n+s+fc | <B2>ты/Prtp+Pa+Pss+p+fc;
Writing lexical Russian resourcesWriting lexical Russian resources
1111
Writing lexical ressources for RussianWriting lexical ressources for Russian
6. Problem of letter ë / e (partially solved: two entries or two paradigms)
ёжик,N+m+an+FLX=бульдогёж,N+m+an+FLX=богачежик,N+m+an+FLX=бульдогеж,N+m+an+FLX=богач
жевать = <E>/Inf | <B5>ую/1+s+Pre | <B5>уёшь/2+s+Pre | <B5>уёт/3+s+Pre | <B5>уём/1+p+Pre | <B5>уёте/2+p+Pre | <B5>уешь/2+s+Pre | <B5>ует/3+s+Pre | <B5>уем/1+p+Pre | <B5>уете/2+p+Pre | <B5>уют/3+p+Pre
1212
7. Allocating models to words
Writing lexical Russian resourcesWriting lexical Russian resources
abažur,N+m+inan+FLX=zavodabazinec,N+m+an+FLX=ukrainecabazin,N+m+an+FLX=artistabaz,N+m+inan+FLX=zavodabak,N+m+inan+FLX=čajnikabbat,N+m+an+FLX=artist
8. verifiying paradigms
1313
Writing lexical resources for RussianWriting lexical resources for Russian
9. Testing with russian texts : 9. Testing with russian texts :
« The nose » by Gogol« The nose » by Gogol
« The gambler » by Dostoievsky« The gambler » by Dostoievsky
««The Prisoner of the CaucasusThe Prisoner of the Caucasus » by Tolstoy» by Tolstoy
«« The lady with the dog » by ChekhovThe lady with the dog » by Chekhov
« Short stories » by Harms« Short stories » by Harms
1414
Writing lexical resources for RussianWriting lexical resources for Russian
10. Correcting errors :10. Correcting errors :
-bad encoding (mixed latin/cyrillic letters) A B E K M H O P C y X MOCKBA
- errors in paradigms
- bad allocation of model to words
mobile vowel / palatalization
1515
Improving lexical resourcesImproving lexical resources
- useless words: source of unnecessary ambiguities the names of letters a, б, в, и, к, о, с, у, яarchaic unused words.- repetitions of the same word in different parts of speech ( adjectives / nouns; adjectives / pronouns; interjections/particles/parenthesis )
Increase the number of different models ?Increase the number of different models ?
To avoid generating To avoid generating unexpected or incongruous unexpected or incongruous forms forms or failing to recognize or failing to recognize existing forms.existing forms.Читав ? Читав ? Čitav ? Čitav ? Пиша ? Пиша ? Piša ? Piša ? Счастие ? Счастие ? ŜastiŜastiее ? ?
Suppress word entries Suppress word entries and / orand / or forms ? forms ?
1616
1 COMPILED BASIC DICTIONAIRY 1 COMPILED BASIC DICTIONAIRY containingcontaining :
Available lexical resources for RussianAvailable lexical resources for Russian
1 dictionary of 45,000 nouns (350 paradigms)1 dictionary of 20,000 adjectives (50 paradigms)1 dictionary of 25,000 verbs (600 paradigms)1 dictionary of 880 prepositions & conjunctions, numerals, pronouns , 1600 adverbs, parenthetical words etc…
22 COMPILED ADDITONNALS DICTIONARIES:COMPILED ADDITONNALS DICTIONARIES:(with facultative use)(with facultative use)
1 dictionary of propers nouns ( cities, countries, rivers … first names with diminutives)1 dictionary of substantives-adjectives
1717
Writing Russian grammars for NoojWriting Russian grammars for Nooj
designing disambiguation grammars fordesigning disambiguation grammars for
-grammatical agreement between adjectives & nouns-case usage with numerals -case usage with prepositions-case usage with verbs
- date and time expression- adverbial phrases of time , place …- idiomatic structures ( my name is, I’m.. old- verbs of motion
designing grammars to locate syntagmsdesigning grammars to locate syntagms
1818
Writing Russian grammars for NoojWriting Russian grammars for Nooj
Syntactic grammar for RussianSyntactic grammar for Russian
1919
Writing Russian grammars for NoojWriting Russian grammars for Nooj
Syntactic grammar for Russian Syntactic grammar for Russian
2020
Grammar to locate the verbs of motion
2121
Grammar to locate the verbs of motion
2222
The prepositions in Russian
2323
The disambiguation of « NA » (on, onto)
2424
Annotating and disambiguating texts
the text with its ambiguitiesthe text with its ambiguities : :
2525
Verifying grammarsVerifying grammars
The text was disambiguated with the grammar ofThe text was disambiguated with the grammar of « NA » : « NA » :
2626
The disambiguation of « V » (in, into)
2727
Russian grammars for NoojRussian grammars for Nooj
All these grammars need improvement:All these grammars need improvement:
They are very sensitive to syntactic order :-fail to regognize structures if unusual ( expressive or non standard) order of word in Russian sentences.
There are no grammars (yet) :-to disambiguate adverbs / adjectives -to disambiguate adjectives / nouns-to disambiguate conjunctions / interjections
2828
To get reliable ressources To get reliable ressources for the Russian languagefor the Russian language : :
• Data bank of verified and annotated texts
design and implement:
• Efficient syntactic grammars
• Develop semantic tagging
• Unified or harmonized tags for (slavic, roman, german etc..) languages to allow further multilingual treatment
The job left to be done is toThe job left to be done is to
2929
Russian Module for NooJ
http://www.nooj4nlp.net/pages/russian.htmlhttp://www.nooj4nlp.net/pages/russian.html
3030
NOOJ Conference Inalco NOOJ Conference Inalco June 16th, 2012June 16th, 2012
Russian Module for NooJ: design and implementation
Спасибо за вниманиеСпасибо за вниманиеThank you for your attentionThank you for your attention
Merci de votre attentionMerci de votre attention