UGTag : morphological analyzer and tagger for Ukrainian language

UGTag: morphological analyzer and tagger for Ukrainian language

Natalia KotsybaAndriy MykulyakIgor V. Shevchenko

UGtag as a set of NLP tools• developed within the Polish-Ukrainian parallel corpus to

provide grammatical annotation for its Ukrainian part• inspired by a functionally similar TaKIPI toolset for Polish• unified output format for both language parts of the corpus • suitable for search with such programmes as PoliqarpDifferences:• interactive annotation of texts with manual disambiguation• modular design allows plugging-in additional grammatical

dictionaries as well as modification of the existing ones• code for UGTag was written from scratch

Programme architecture

UGTag package

• enriches raw texts with grammatical information taken by default from UGD (Ukrainian Grammatical Dictionary)

• data in UGD are stored in a relational database • 180 thousand lemmas• 56 thousand endings• more than 2000 paradigmatic classes• major part of the data was transformed into a set of XML files and adjusted for specific UGTag needs

• any compatible dictionary can be used instead or along

Stages of analysis

• pre-processing stage: tokenization and chunking

• morphological tagging• disambiguation

Process of analysis

Stage Role Input Outputreader separate (different) external

representations of the text from its internal representation (one or more character sequences). In other words, it converts text to a standard format.

Text in different formats

One or more sequences of characters (usually one sequence per line of the input file)

tokenizer To split sequences into tokens (smallest meaningful pieces of text)

Character sequence List of tokens

tagger Add morphological information to tokens (additionally it can split or group some tokens based on their meaning – e.g. abbreviations, complex words like „зелено-червоний”)

List of tokens List of tokens with morphological information attached to each token

sentencer Group tokens into sentences List of annotated tokens

List of sentences

disambiguator Choose appropriate grammatical interpretation of the token

List of annotated tokens (optionally augmented with list of sentences)

List of tokens with most probable annotation

writer Convert list of tokens to the format most appropriate for the user

List of tokens, list of sentences

File with annotations in specified format

Premorphological analysisProcedures that do not involve the use of the grammatical dictionary

Reading phase and input formats• plain, HTML or XML texts XML files structured according to the

XCES standard• strips all tags from input HTML or XML files and turns them into raw

texts• user-defined file readers that take into account logical mark-up of

input XML files and incorporate it into the output XML format• file reader separates the external representation of texts from their

unified internal representation fed to the tokenizer• extract the text itself, possibly portioning it in chunks for further

processing.

Tokenizer

• first divides chunks into blocks delimited by whitespace characters

• block can consist of one or more tokens, e.g. a quote and a word with no white space in between (”token).

• next divides blocks into tokens that are minimal structural units

• five categories of tokens: words, numbers, punctuation marks, whitespace characters and unrecognized tokens

• word is a sequence of alphabetical characters with an optional hyphen

Grammatical dictionary• structure of grammatical information in UGD was rearranged and further

division into finer categories was carried out and implemented to meet the requirements of the intended tagsets:

• compatible with MULTEXT-EAST, V.4• common tagset for Polish and Ukrainian [Kotsyba, Turska, Shypnivska

2008] slightly modified and simplified to achieve this compatibility• the category of degree of comparison for adjectives and adverbs was

reintroduced, and adjectives and adverbs were regrouped and relemmatized accordingly

• category of predicatives was regrouped based on the conclusions in [Derzhanski, Kotsyba 2008]

• word splitting: original UGD collocations with white space characters or hyphens treated as individual units

• information about those combinations is preserved and can be used for syntactic analysis in the future

Morphological analysis• users can watch the progress of tagging as it goes• tagged tokens of different categories are displayed in the screen colour coded

• unrecognized tokens (red)• words with only one available grammatical interpretation (green)

• words with multiple grammatical interpretations (blue)• panel in the top right corner displays grammatical characteristics of the selected item

• manual disambiguation is possible for words with multiple available interpretations

Automatic disambiguation• rudimentary automatic disambiguation based on statistical analysis for a small but frequently used word class of prepositions

• “до” 15 grammatical interpretations, one for preposition and 14 for all possible grammatical characteristics of the invariable noun “до” (musical note)

• “на” colloquial use as interjection• further disambiguation policy foresees combination of rules and statistical analysis of manually disambiguated data

Enriching the dictionary database

• during annotation UGTag automatically creates a list of words not found in the dictionary and displays it to the user allowing him to add them to one of user dictionaries

• list of words not unrecognized by the active built-in dictionary is displayed

• user can select a word from this list and add it to the dictionary

• programme gives hints as to the paradigm of the word• definition of the wordforms can be done manually

Adding a new word

Sentencing

• sentence splitting is rule-based and some of those rules require grammatical information

• implemented so far rules are partially based on Rudolf’s work for Polish [Rudolf 2004]

• heuristics that use popular abbreviations and words starting with the capital letter, whose meaning is also taken into the account

Writing phase and writing format

• two output tag formats for resulting XML files• default format is based on TaKIPI 1.8 for Polish, extended for Ukrainian specific features [Kotsyba, Turska, Shypnivska 2008]

• retains maximum grammatical information that can be provided by Polish and Ukrainian grammatical dictionaries

• MULTEXT-East compatible tagset which a more course granulation of grammatical information, [Derzhanski, Kotsyba 2009]

Plans for further development• depend on results of extensive experimenting with real corpus texts

• enriching the dictionary database using both manual and automatic ways

• enhancing the quality of automatic disambiguation

• preliminary syntactic parsing, word grouping complex words like numerals: “двадцять три” (twenty three) currently recognized as separate words (“двадцять” and “три”), complex passive structures, prepositional phrases, etc.

UGTag : morphological analyzer and tagger for Ukrainian language

Documents

Transcript of UGTag : morphological analyzer and tagger for Ukrainian language