Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and...

10
Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees of Arabic and Their Annotation in the TrEd Environment Otakar Smrž Petr Pajas

Transcript of Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and...

Page 1: Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.

Prague Arabic Dependency Treebank

Center for Computational LinguisticsInstitute of Formal and Applied LinguisticsCharles University in Prague

MorphoTrees of Arabic and Their Annotation in the TrEd

Environment

Otakar SmržPetr Pajas

Page 2: Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.

September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 2

MorphoTrees … TrEd … ???

MorphoTrees mean turning unorganized sets of complex morphological analyses into hierarchies Intuitive, decision-efficient, multi-purpose, interesting In general, not limited to the language, nor the system

of morphology, nor the levels, nor the implementation TrEd is a fully programmable graphical editor for

tree-like graphs and an excellent suite of tools for data batch processing (local/network) Analytical and tectogrammatical dependency

annotation Viewing and converting of Arabic phrase-structure trees Evaluating and merging of parser/tagger/human results

Page 3: Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.

September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 3

MorphoTrees in TrEd

Files with two types of trees

Criteria & restrictions

Automatic decisions

Hiding modes

Viewing options

Short-cut keys & mouse

Consist-ency checks

Processing & update macros

Page 4: Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.

September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 4

Arabic … the Questions

Is there syntactic difference in sawfa ′arā ′abā ′Aḥmada and sa′as′alu wālidahu ? Is there morphological difference? The only difference is in the use of lexical units and

morphs. The grammatical categories are unchanged, and morphology and syntax should clearly show this.

How do we find syntactic units? How do we get back word-forms from the lexical units and tags?

How much does improper morphological reading disturb consequent syntactic representation? Improper in tags, lemmas, diacritics, or in tokenization?

Page 5: Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.

September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 5

Reminder of the Terms

Grapheme / Phoneme The least units capable

of distinguishing meanings

~ 40 letters, context-dependent forms

28 consonants, 6 vowels Morph

Composition of graphemes / phonemes

Abstract derivational forms

Morpheme The least unit

representing some linguistic meaning

Function of morphs Projection of

grammatical categories

Token The least syntactic

unit Bearer of a uniform

vector of grammatical categories

Page 6: Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.

September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 6

Tim Buckwalter’s Morphology

PADT MorphoTrees are generated based on the information provided by Buckwalter Arabic Morphological Analyzer

+ Updateable stem-based lexicon, finite-state model, implementation in Perl and published under GNU GPL

– Morphs, mapping only to Quasi-Functional Morphology The tokenization, clustering, modeling of

conditionality, …

(wabijAnibihA) [jAnib_1] wa/CONJ + bi/PREP +

jAnib/NOUN + i/CASE_DEF_GEN + hA/POSS_PRON_3FSC--------- wa CONJ and

P--------- bi PREP at

N-------2R jAnib+i NOUN+CASE_DEF_GEN side of

S----3FS2- hA POSS_PRON_3FS her

Page 7: Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.

September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 7

Xerox Morphological Analyzer

Page 8: Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.

September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 8

MorphoTrees Hierarchy

MorphoTrees of Arabic propose these levels Entity – the analyzed elements of the discourse Partitioning to the standard forms of the tokens Non-vocalized standard orthographical forms Lemmas/identifiers of lexical units Tokens – syntactic units including the form and the tag

Independence on the language / implementation More/different levels, inclusion of spelling variations, … Annotation of various tagsets, other features of tokens

Efficiency of decision-making Distance between analyses becomes recognized

Page 9: Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.

September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 9

MorphoTrees Annotation

Selecting the leaves that correspond to the proper reading of the tokens constituting the entity Quick use of keyboard and/or mouse for annotations

Restricting the tree according to the criteria/categories required by the context Natural control over the inheritance of restrictions

Employing automatic restrictions and annotation actions, both generic and linguistic

Learning about the discriminative categories and “human tagging”

Page 10: Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and Applied Linguistics Charles University in Prague MorphoTrees.

September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 10

Discussion and Conclusion

MorphoTrees Imporant in morphological annotation and in

evaluation PADT 1.0 provides 148 000 annotated tokens

Functional Morphology … more in Prague Arabic Dependency Treebank:

Development in Data and Tools Even its approximation is promising and welcome Feature-Based Tagger trained on Penn ATB 2

3.6% error rate in major part-of-speech (15 values) 10.8% in the full tagset (317 evidenced combinations) 0.8–0.6% error rate in tokenization of the input