NOOJ Conference Inalco, Paris June 16th, 2012

30
1 NOOJ Conference NOOJ Conference Inalco, Paris Inalco, Paris June 16th, 2012 June 16th, 2012 Vincent BÉNET INALCO CREE Recherche assistée par ordinateur Conception and realization Conception and realization of grammatical & lexical resources of grammatical & lexical resources for the Russian language for the Russian language for Max Silberztein’s Nooj software for Max Silberztein’s Nooj software Russian Module for NooJ: design and implementation

description

Russian Module for NooJ: design and implementation. Conception and realization of grammatical & lexical resources for the Russian language for Max Silberztein’s Nooj software. NOOJ Conference Inalco, Paris June 16th, 2012. Vincent BÉNET INALCO CREE Recherche assistée par ordinateur. - PowerPoint PPT Presentation

Transcript of NOOJ Conference Inalco, Paris June 16th, 2012

Page 1: NOOJ Conference Inalco, Paris June 16th, 2012

11

NOOJ Conference NOOJ Conference Inalco, ParisInalco, Paris

June 16th, 2012June 16th, 2012

Vincent BÉNETINALCO

CREE Recherche assistée par ordinateur

Conception and realization Conception and realization of grammatical & lexical resourcesof grammatical & lexical resources

for the Russian languagefor the Russian language

for Max Silberztein’s Nooj software for Max Silberztein’s Nooj software

Russian Module for NooJ: design and implementation

ORDIDOM
Page 2: NOOJ Conference Inalco, Paris June 16th, 2012

22

Design linguistics resources Design linguistics resources

Description of the realizationDescription of the realization Dictionaries / paradigms /grammarsDictionaries / paradigms /grammars

Job left to be done…Job left to be done…

Russian Module for NooJ:

design and implementation

Page 3: NOOJ Conference Inalco, Paris June 16th, 2012

33

Writing lexical resources for the Russian languageWriting lexical resources for the Russian language

Build dictionairies from textsBuild dictionairies from texts

Create one « small » dictionary and Create one « small » dictionary and many grammars for derivational formsmany grammars for derivational formsраб раб + a (slave) + a (slave) раб раб + o+ oтт ++ а +а + тьть (work) (work)за +за + раб +раб + отот + к+ к ++ аа (salary) (salary)

Complete one « big » existing dictionary Complete one « big » existing dictionary and create manyand create many grammarsgrammars

Page 4: NOOJ Conference Inalco, Paris June 16th, 2012

44

Writing lexical resources for the Russian languageWriting lexical resources for the Russian language

ZALIZNIAK’s grammatical dictionary : 96 000 entriescomplete dictionary, in inverted alphabetical order, with all grammatical annotation

To obtain, to reach :Достигать нсв нп 1a$3 (доcтигнуть//доcтичь) имеется страдDostigat’ ipf nt 1a$3 (dostignut’/dostich’) has a passive form

Page 5: NOOJ Conference Inalco, Paris June 16th, 2012

55

Writing lexical resources for the Russian languageWriting lexical resources for the Russian language

The problem of accent markers was delayed

Encountered problems Classification complete but some tags are absent ( V, N…)Classification based on accent markersA lot Unformal unclassified added annotations

Zalizniak’s dictionary was resorting, its classification was modified, simplified and completed for computer use

Page 6: NOOJ Conference Inalco, Paris June 16th, 2012

66

The design of lexical resources The design of lexical resources for the Russian languagefor the Russian language has consisted in: has consisted in:

33. sorting the dictionary . sorting the dictionary (inverted alphabetical order for each (inverted alphabetical order for each wordword))

1. 1. creatingcreating grammatical tagsgrammatical tags

2. 2. recoding the dictionary with this tagsrecoding the dictionary with this tags

6. 6. problem with problem with ë / eë / e

4. f4. fixing a paradigm model list ixing a paradigm model list ((kartakarta instead ofinstead of zh1a )

5. 5. writing paradigmswriting paradigms

7. a7. allocating models to the wordsllocating models to the words

8. 8. verifying the resultsverifying the results

9. 9. testing with textstesting with texts

10. 10. Correcting and proofreadingCorrecting and proofreading

Page 7: NOOJ Conference Inalco, Paris June 16th, 2012

77

Writing lexical ressources for RussianWriting lexical ressources for Russian

1. Creating tags and properties N, A, V, ADV ….

A_Forme = fc | fl | adv;A_Genre = m | f | n ;A_SGenr = an | inan ;A_Nombre = s | p;A_Cas = Im | Vi | Ro | Da | Tv | Pr | Zv;A_Deg = Comp | Sup ;ADV_Deg = Comp;

V_Pers = 1 | 2 | 3 ;V_Asp = Ipf | Pf ;V_Type = Mvt ;V_Morph = Pvb | Simp | Sufx | PvbSufx ;V_SsAsp = Det | Indet ;V_Temps = Pre | Pa | Fu ;V_Mode = Inf | Ind | Imp | Cond | Ger | Prtp ;V_Voix = Act | Pss ;V_Genre = m | f | n ;V_Nombre = s | p ;V_Constr = intr | tr | sja ;V_Cas = Im | Vi | Ro | Da | Tv | Pr ;

Page 8: NOOJ Conference Inalco, Paris June 16th, 2012

88

Writing lexical ressources for RussianWriting lexical ressources for Russian2. recoding the dictionary

3. Sorting the dictionary to get inverted aphabetical ordering

Page 9: NOOJ Conference Inalco, Paris June 16th, 2012

99

#j1a=karta#jo1a=korova#j2a=nedelja#jo2a=boginja#j3a=kniga#jo3a=sobaka#j4a=tuča#jo4a=kassirša#j5a=ulica#jo5a=volčica#j6a=statuja#jo6a=feja#j7a=linija#jo7a=furija

4. Paradigm model list

карта = <E>/Im+f+s + <B>у/Vi+f+s + <B>ы/Ro+f+s + <B>е/Da+f+s + <B>ой/Tv+f+s + <B>е/Pr+f+s + <B>ы/Im+f+p + <B>ы/Vi+f+p + <B>/Ro+f+p + <B>ам/Da+f+p + <B>ами/Tv+f+p + <B>ах/Pr+f+p ;

5. writing paradigms

Writing lexical Russian resourcesWriting lexical Russian resources

Page 10: NOOJ Conference Inalco, Paris June 16th, 2012

1010

5. Paradigm for verbs

взять = <E>/Inf | <B4>озьму/1+s+Pre | <B4>озьмешь/2+s+Pre | <B4>озьмет/3+s+Pre | <B4>озьмем/1+p+Pre | <B4>озьмете/2+p+Pre | <B4>озьмёшь/2+s+Pre | <B4>озьмёт/3+s+Pre | <B4>озьмём/1+p+Pre | <B4>озьмёте/2+p+Pre | <B4>озьмут/3+p+Pr | <B2>л/m+s+Pa | <B2>ла/f+s+Pa | <B2>ло/n+s+Pa | <B2>ли/p+Pa | <B4>озьми/2+s+Imp | <B4>озьмите/2+p+Imp | <B2>в/Ger | <B2>вши/Ger | <B2>вший/Prtp+Pa+Act+m+s+Im | <B2>вший/Prtp+Pa+Act+m+s+Vi | <B2>вшего/Prtp+Pa+Act+m+an+s+Vi | <B2>вшего/Prtp+Pa+Act+m+s+Ro | <B2>вшему/Prtp+Pa+Act+m+s+Da | <B2>вшим/Prtp+Pa+Act+m+s+Tv | <B2>вшем/Prtp+Pa+Act+m+s+Pr | <B2>вшая/Prtp+Pa+Act+f+s+Im | <B2>вшую/Prtp+Pa+Act+f+s+Vi | <B2>вшую/Prtp+Pa+Act+f+s+Vi | <B2>вшей/Prtp+Pa+Act+f+s+Ro | <B2>вшей/Prtp+Pa+Act+f+s+Da | <B2>вшей/Prtp+Pa+Act+f+s+Tv | <B2>вшею/Prtp+Pa+Act+f+s+Tv | <B2>вшей/Prtp+Pa+Act+f+s+Pr | <B2>вшее/Prtp+Pa+Act+n+s+Im | <B2>вшее/Prtp+Pa+Act+n+s+Vi | <B2>вшего/Prtp+Pa+Act+n+s+Vi | <B2>вшего/Prtp+Pa+Act+n+s+Ro | <B2>вшему/Prtp+Pa+Act+n+s+Da | <B2>вшим/Prtp+Pa+Act+n+s+Tv | <B2>вшем/Prtp+Pa+Act+n+s+Pr | <B2>вшие/Prtp+Pa+Act+p+Im | <B2>вшие/Prtp+Pa+Act+p+Vi | <B2>вших/Prtp+Pa+Act+an+p+Vi | <B2>вших/Prtp+Pa+Act+p+Ro | <B2>вшим/Prtp+Pa+Act+p+Da | <B2>вшими/Prtp+Pa+Act+p+Tv | <B2>вших/Prtp+Pa+Act+p+Pr | <B2>тый/Prtp+Pa+Pss+m+s+Im | <B2>тый/Prtp+Pa+Pss+m+s+Vi | <B2>того/Prtp+Pa+Pss+m+an+s+Vi | <B2>того/Prtp+Pa+Pss+m+s+Ro | <B2>тому/Prtp+Pa+Pss+m+s+Da | <B2>тым/Prtp+Pa+Pss+mo+s+Tv | <B2>том/Prtp+Pa+Pss+mo+s+Pr | <B2>тая/Prtp+Pa+Pss+f+s+Im | <B2>тую/Prtp+Pa+Pss+f+s+Vi | <B2>той/Prtp+Pa+Pss+f+s+Ro | <B2>той/Prtp+Pa+Pss+f+s+Da | <B2>той/Prtp+Pa+Pss+f+s+Tv | <B2>тою/Prtp+Pa+Pss+f+s+Tv | <B2>той/Prtp+Pa+Pss+f+s+Pr | <B2>тое/Prtp+Pa+Pss+n+s+Im | <B2>тое/Prtp+Pa+Pss+n+s+Vi | <B2>того/Prtp+Pa+Pss+n+s+Ro | <B2>тому/Prtp+Pa+Pss+n+s+Da | <B2>тым/Prtp+Pa+Pss+n+s+Tv | <B2>том/Prtp+Pa+Pss+n+s+Pr | <B2>тые/Prtp+Pa+Pss+p+Im | <B2>тые/Prtp+Pa+Pss+p+Vi | <B2>тых/Prtp+Pa+Pss+an+p+Vi | <B2>тых/Prtp+Pa+Pss+p+Ro | <B2>тым/Prtp+Pa+Pss+p+Da | <B2>тыми/Prtp+Pa+Pss+p+Tv | <B2>тых/Prtp+Pa+Pss+p+Pr | <B2>т/Prtp+Pa+Pss+m+s+fc | <B2>та/Prtp+Pa+Pss+f+s+fc | <B2>то/Prtp+Pa+Pss+n+s+fc | <B2>ты/Prtp+Pa+Pss+p+fc;

Writing lexical Russian resourcesWriting lexical Russian resources

Page 11: NOOJ Conference Inalco, Paris June 16th, 2012

1111

Writing lexical ressources for RussianWriting lexical ressources for Russian

6. Problem of letter ë / e (partially solved: two entries or two paradigms)

ёжик,N+m+an+FLX=бульдогёж,N+m+an+FLX=богачежик,N+m+an+FLX=бульдогеж,N+m+an+FLX=богач

жевать = <E>/Inf | <B5>ую/1+s+Pre | <B5>уёшь/2+s+Pre | <B5>уёт/3+s+Pre | <B5>уём/1+p+Pre | <B5>уёте/2+p+Pre | <B5>уешь/2+s+Pre | <B5>ует/3+s+Pre | <B5>уем/1+p+Pre | <B5>уете/2+p+Pre | <B5>уют/3+p+Pre

Page 12: NOOJ Conference Inalco, Paris June 16th, 2012

1212

7. Allocating models to words

Writing lexical Russian resourcesWriting lexical Russian resources

abažur,N+m+inan+FLX=zavodabazinec,N+m+an+FLX=ukrainecabazin,N+m+an+FLX=artistabaz,N+m+inan+FLX=zavodabak,N+m+inan+FLX=čajnikabbat,N+m+an+FLX=artist

8. verifiying paradigms

Page 13: NOOJ Conference Inalco, Paris June 16th, 2012

1313

Writing lexical resources for RussianWriting lexical resources for Russian

9. Testing with russian texts : 9. Testing with russian texts :

« The nose » by Gogol« The nose » by Gogol

« The gambler » by Dostoievsky« The gambler » by Dostoievsky

««The Prisoner of the CaucasusThe Prisoner of the Caucasus » by Tolstoy» by Tolstoy

««  The lady with the dog » by ChekhovThe lady with the dog » by Chekhov

« Short stories » by Harms« Short stories » by Harms

Page 14: NOOJ Conference Inalco, Paris June 16th, 2012

1414

Writing lexical resources for RussianWriting lexical resources for Russian

10. Correcting errors :10. Correcting errors :

-bad encoding (mixed latin/cyrillic letters) A B E K M H O P C y X MOCKBA

- errors in paradigms

- bad allocation of model to words

mobile vowel / palatalization

Page 15: NOOJ Conference Inalco, Paris June 16th, 2012

1515

Improving lexical resourcesImproving lexical resources

- useless words: source of unnecessary ambiguities the names of letters a, б, в, и, к, о, с, у, яarchaic unused words.- repetitions of the same word in different parts of speech ( adjectives / nouns; adjectives / pronouns; interjections/particles/parenthesis )

Increase the number of different models ?Increase the number of different models ?

To avoid generating To avoid generating unexpected or incongruous unexpected or incongruous forms forms or failing to recognize or failing to recognize existing forms.existing forms.Читав ? Читав ? Čitav ? Čitav ? Пиша ? Пиша ? Piša ? Piša ? Счастие ? Счастие ? ŜastiŜastiее ? ?

Suppress word entries Suppress word entries and / orand / or forms ? forms ?

Page 16: NOOJ Conference Inalco, Paris June 16th, 2012

1616

1 COMPILED BASIC DICTIONAIRY 1 COMPILED BASIC DICTIONAIRY containingcontaining :

Available lexical resources for RussianAvailable lexical resources for Russian

1 dictionary of 45,000 nouns (350 paradigms)1 dictionary of 20,000 adjectives (50 paradigms)1 dictionary of 25,000 verbs (600 paradigms)1 dictionary of 880 prepositions & conjunctions, numerals, pronouns , 1600 adverbs, parenthetical words etc…

22 COMPILED ADDITONNALS DICTIONARIES:COMPILED ADDITONNALS DICTIONARIES:(with facultative use)(with facultative use)

1 dictionary of propers nouns ( cities, countries, rivers … first names with diminutives)1 dictionary of substantives-adjectives

Page 17: NOOJ Conference Inalco, Paris June 16th, 2012

1717

Writing Russian grammars for NoojWriting Russian grammars for Nooj

designing disambiguation grammars fordesigning disambiguation grammars for

-grammatical agreement between adjectives & nouns-case usage with numerals -case usage with prepositions-case usage with verbs

- date and time expression- adverbial phrases of time , place …- idiomatic structures ( my name is, I’m.. old- verbs of motion

designing grammars to locate syntagmsdesigning grammars to locate syntagms

Page 18: NOOJ Conference Inalco, Paris June 16th, 2012

1818

Writing Russian grammars for NoojWriting Russian grammars for Nooj

Syntactic grammar for RussianSyntactic grammar for Russian

Page 19: NOOJ Conference Inalco, Paris June 16th, 2012

1919

Writing Russian grammars for NoojWriting Russian grammars for Nooj

Syntactic grammar for Russian Syntactic grammar for Russian

Page 20: NOOJ Conference Inalco, Paris June 16th, 2012

2020

Grammar to locate the verbs of motion

Page 21: NOOJ Conference Inalco, Paris June 16th, 2012

2121

Grammar to locate the verbs of motion

Page 22: NOOJ Conference Inalco, Paris June 16th, 2012

2222

The prepositions in Russian

Page 23: NOOJ Conference Inalco, Paris June 16th, 2012

2323

The disambiguation of « NA » (on, onto)

Page 24: NOOJ Conference Inalco, Paris June 16th, 2012

2424

Annotating and disambiguating texts

the text with its ambiguitiesthe text with its ambiguities : :

Page 25: NOOJ Conference Inalco, Paris June 16th, 2012

2525

Verifying grammarsVerifying grammars

The text was disambiguated with the grammar ofThe text was disambiguated with the grammar of « NA » : « NA » :

Page 26: NOOJ Conference Inalco, Paris June 16th, 2012

2626

The disambiguation of « V » (in, into)

Page 27: NOOJ Conference Inalco, Paris June 16th, 2012

2727

Russian grammars for NoojRussian grammars for Nooj

All these grammars need improvement:All these grammars need improvement:

They are very sensitive to syntactic order :-fail to regognize structures if unusual ( expressive or non standard) order of word in Russian sentences.

There are no grammars (yet) :-to disambiguate adverbs / adjectives -to disambiguate adjectives / nouns-to disambiguate conjunctions / interjections

Page 28: NOOJ Conference Inalco, Paris June 16th, 2012

2828

To get reliable ressources To get reliable ressources for the Russian languagefor the Russian language : :

• Data bank of verified and annotated texts

design and implement:

• Efficient syntactic grammars

• Develop semantic tagging

• Unified or harmonized tags for (slavic, roman, german etc..) languages to allow further multilingual treatment

The job left to be done is toThe job left to be done is to

Page 29: NOOJ Conference Inalco, Paris June 16th, 2012

2929

Russian Module for NooJ

http://www.nooj4nlp.net/pages/russian.htmlhttp://www.nooj4nlp.net/pages/russian.html

Page 30: NOOJ Conference Inalco, Paris June 16th, 2012

3030

NOOJ Conference Inalco NOOJ Conference Inalco June 16th, 2012June 16th, 2012

[email protected]

Russian Module for NooJ: design and implementation

Спасибо за вниманиеСпасибо за вниманиеThank you for your attentionThank you for your attention

Merci de votre attentionMerci de votre attention

ORDIDOM