UGTag : morphological analyzer and tagger for Ukrainian language

20
UGTag: morphological analyzer and tagger for Ukrainian language Natalia Kotsyba Andriy Mykulyak Igor V. Shevchenko

description

UGTag : morphological analyzer and tagger for Ukrainian language. Natalia Kotsyba Andriy Mykulyak Igor V. Shevchenko. UGtag a s a set of NLP tools. developed within the Polish-Ukrainian parallel corpus to provide grammatical annotation for its Ukrainian part - PowerPoint PPT Presentation

Transcript of UGTag : morphological analyzer and tagger for Ukrainian language

Page 1: UGTag : morphological analyzer and tagger for Ukrainian language

 UGTag: morphological analyzer and tagger for Ukrainian language

Natalia KotsybaAndriy MykulyakIgor V. Shevchenko

Page 2: UGTag : morphological analyzer and tagger for Ukrainian language

UGtag as a set of NLP tools• developed within the Polish-Ukrainian parallel corpus  to 

provide grammatical annotation for its Ukrainian part• inspired by a functionally similar TaKIPI toolset for Polish• unified output format for both language parts of the corpus • suitable for search with such programmes as PoliqarpDifferences:• interactive annotation of texts with manual disambiguation• modular design allows plugging-in additional grammatical 

dictionaries as well as modification of the existing ones• code for UGTag was written from scratch

Page 3: UGTag : morphological analyzer and tagger for Ukrainian language

Programme architecture

 

Page 4: UGTag : morphological analyzer and tagger for Ukrainian language

UGTag package

• enriches raw texts with grammatical information taken by default from UGD (Ukrainian Grammatical Dictionary)

• data in UGD are stored in a relational database • 180 thousand lemmas• 56 thousand endings• more than 2000 paradigmatic classes• major part of the data was transformed into a set of XML files and adjusted for specific UGTag needs

• any compatible dictionary can be used instead or along

Page 5: UGTag : morphological analyzer and tagger for Ukrainian language

Stages of analysis

• pre-processing stage: tokenization and chunking

• morphological tagging• disambiguation

Page 6: UGTag : morphological analyzer and tagger for Ukrainian language

Process of analysis

 

Page 7: UGTag : morphological analyzer and tagger for Ukrainian language

Stage Role Input Outputreader separate (different) external

representations of the text from its internal representation (one or more character sequences). In other words, it converts text to a standard format.

Text in different formats

One or more sequences of characters (usually one sequence per line of the input file)

tokenizer To split sequences into tokens (smallest meaningful pieces of text)

Character sequence List of tokens

tagger Add morphological information to tokens (additionally it can split or group some tokens based on their meaning – e.g. abbreviations, complex words like „зелено-червоний”)

List of tokens List of tokens with morphological information attached to each token

sentencer Group tokens into sentences List of annotated tokens

List of sentences

disambiguator Choose appropriate grammatical interpretation of the token

List of annotated tokens (optionally augmented with list of sentences)

List of tokens with most probable annotation

writer Convert list of tokens to the format most appropriate for the user

List of tokens, list of sentences

File with annotations in specified format

Page 8: UGTag : morphological analyzer and tagger for Ukrainian language

Premorphological analysisProcedures that do not involve the use of the grammatical dictionary

Reading phase and input formats• plain, HTML or XML texts  XML files structured according to the 

XCES standard• strips all tags from input HTML or XML files and turns them into raw 

texts• user-defined file readers that take into account logical mark-up of 

input XML files and incorporate it into the output XML format• file reader separates the external representation of texts from their 

unified internal representation fed to the tokenizer• extract the text itself, possibly portioning it in chunks for further 

processing. 

Page 9: UGTag : morphological analyzer and tagger for Ukrainian language

Tokenizer

• first divides chunks into blocks delimited by whitespace characters

• block can consist of one or more tokens, e.g. a quote and a word with no white space in between (”token).

• next divides blocks into tokens that are minimal structural units

• five categories of tokens: words, numbers, punctuation marks, whitespace characters and unrecognized tokens

• word is a sequence of alphabetical characters with an optional hyphen

Page 10: UGTag : morphological analyzer and tagger for Ukrainian language

Grammatical dictionary• structure of grammatical information in UGD was rearranged and further 

division into finer categories was carried out and implemented to meet the requirements of the intended tagsets:

• compatible with MULTEXT-EAST, V.4• common tagset for Polish and Ukrainian [Kotsyba, Turska, Shypnivska 

2008] slightly modified and simplified to achieve this compatibility• the category of degree of comparison for adjectives and adverbs was 

reintroduced, and adjectives and adverbs were regrouped and relemmatized accordingly

• category of predicatives was regrouped based on the conclusions in [Derzhanski, Kotsyba 2008]

• word splitting: original UGD  collocations with white space characters or hyphens treated as individual units

• information about those combinations is preserved and can be used for syntactic analysis in the future

Page 11: UGTag : morphological analyzer and tagger for Ukrainian language

 

 

Page 12: UGTag : morphological analyzer and tagger for Ukrainian language

 

 

Page 13: UGTag : morphological analyzer and tagger for Ukrainian language

Morphological analysis• users can watch the progress of tagging as it goes• tagged tokens of different categories are displayed in the screen colour coded

• unrecognized tokens (red)• words with only one available grammatical interpretation (green)

• words with multiple grammatical interpretations (blue)• panel in the top right corner displays grammatical characteristics of the selected item

• manual disambiguation is possible for words with multiple available interpretations

Page 14: UGTag : morphological analyzer and tagger for Ukrainian language
Page 15: UGTag : morphological analyzer and tagger for Ukrainian language

Automatic disambiguation• rudimentary automatic disambiguation based on statistical analysis for a small but frequently used word class of prepositions

• “до” 15 grammatical interpretations, one for preposition and 14 for all possible grammatical characteristics of the invariable noun “до” (musical note)

• “на” colloquial use as interjection• further disambiguation policy foresees combination of rules and statistical analysis of manually disambiguated data

Page 16: UGTag : morphological analyzer and tagger for Ukrainian language

Enriching the dictionary database

• during annotation UGTag automatically creates a list of words not found in the dictionary and displays it to the user allowing him to add them to one of user dictionaries

• list of words not unrecognized by the active built-in dictionary is displayed

• user can select a word from this list and add it to the dictionary

• programme gives hints as to the paradigm of the word• definition of the wordforms can be done manually

Page 17: UGTag : morphological analyzer and tagger for Ukrainian language

Adding a new word

 

Page 18: UGTag : morphological analyzer and tagger for Ukrainian language

Sentencing

• sentence splitting is rule-based and some of those rules require grammatical information

• implemented so far rules are partially based on Rudolf’s work for Polish [Rudolf 2004]

• heuristics that use popular abbreviations and words starting with the capital letter, whose meaning is also taken into the account

Page 19: UGTag : morphological analyzer and tagger for Ukrainian language

Writing phase and writing format

• two output tag formats for resulting XML files• default format is based on TaKIPI 1.8 for Polish, extended for Ukrainian specific features  [Kotsyba, Turska, Shypnivska 2008]

• retains maximum grammatical information that can be provided by Polish and Ukrainian grammatical dictionaries

• MULTEXT-East compatible tagset which a more course granulation of grammatical information,  [Derzhanski, Kotsyba 2009]

Page 20: UGTag : morphological analyzer and tagger for Ukrainian language

Plans for further development• depend on results of extensive experimenting with real corpus texts

• enriching the dictionary database using both manual and automatic ways

• enhancing the quality of automatic disambiguation

• preliminary syntactic parsing, word grouping  complex words like numerals: “двадцять три” (twenty three) currently recognized as separate words (“двадцять” and  “три”), complex passive structures, prepositional phrases, etc.