Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and...
-
Upload
pierce-lawrence -
Category
Documents
-
view
220 -
download
5
Transcript of Prague Arabic Dependency Treebank Center for Computational Linguistics Institute of Formal and...
Prague Arabic Dependency Treebank
Center for Computational LinguisticsInstitute of Formal and Applied LinguisticsCharles University in Prague
MorphoTrees of Arabic and Their Annotation in the TrEd
Environment
Otakar SmržPetr Pajas
September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 2
MorphoTrees … TrEd … ???
MorphoTrees mean turning unorganized sets of complex morphological analyses into hierarchies Intuitive, decision-efficient, multi-purpose, interesting In general, not limited to the language, nor the system
of morphology, nor the levels, nor the implementation TrEd is a fully programmable graphical editor for
tree-like graphs and an excellent suite of tools for data batch processing (local/network) Analytical and tectogrammatical dependency
annotation Viewing and converting of Arabic phrase-structure trees Evaluating and merging of parser/tagger/human results
September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 3
MorphoTrees in TrEd
Files with two types of trees
Criteria & restrictions
Automatic decisions
Hiding modes
Viewing options
Short-cut keys & mouse
Consist-ency checks
Processing & update macros
September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 4
Arabic … the Questions
Is there syntactic difference in sawfa ′arā ′abā ′Aḥmada and sa′as′alu wālidahu ? Is there morphological difference? The only difference is in the use of lexical units and
morphs. The grammatical categories are unchanged, and morphology and syntax should clearly show this.
How do we find syntactic units? How do we get back word-forms from the lexical units and tags?
How much does improper morphological reading disturb consequent syntactic representation? Improper in tags, lemmas, diacritics, or in tokenization?
September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 5
Reminder of the Terms
Grapheme / Phoneme The least units capable
of distinguishing meanings
~ 40 letters, context-dependent forms
28 consonants, 6 vowels Morph
Composition of graphemes / phonemes
Abstract derivational forms
Morpheme The least unit
representing some linguistic meaning
Function of morphs Projection of
grammatical categories
Token The least syntactic
unit Bearer of a uniform
vector of grammatical categories
September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 6
Tim Buckwalter’s Morphology
PADT MorphoTrees are generated based on the information provided by Buckwalter Arabic Morphological Analyzer
+ Updateable stem-based lexicon, finite-state model, implementation in Perl and published under GNU GPL
– Morphs, mapping only to Quasi-Functional Morphology The tokenization, clustering, modeling of
conditionality, …
(wabijAnibihA) [jAnib_1] wa/CONJ + bi/PREP +
jAnib/NOUN + i/CASE_DEF_GEN + hA/POSS_PRON_3FSC--------- wa CONJ and
P--------- bi PREP at
N-------2R jAnib+i NOUN+CASE_DEF_GEN side of
S----3FS2- hA POSS_PRON_3FS her
September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 7
Xerox Morphological Analyzer
September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 8
MorphoTrees Hierarchy
MorphoTrees of Arabic propose these levels Entity – the analyzed elements of the discourse Partitioning to the standard forms of the tokens Non-vocalized standard orthographical forms Lemmas/identifiers of lexical units Tokens – syntactic units including the form and the tag
Independence on the language / implementation More/different levels, inclusion of spelling variations, … Annotation of various tagsets, other features of tokens
Efficiency of decision-making Distance between analyses becomes recognized
September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 9
MorphoTrees Annotation
Selecting the leaves that correspond to the proper reading of the tokens constituting the entity Quick use of keyboard and/or mouse for annotations
Restricting the tree according to the criteria/categories required by the context Natural control over the inheritance of restrictions
Employing automatic restrictions and annotation actions, both generic and linguistic
Learning about the discriminative categories and “human tagging”
September 22, 2004 MorphoTrees of Arabic and Their Annotation in the TrEd Environment 10
Discussion and Conclusion
MorphoTrees Imporant in morphological annotation and in
evaluation PADT 1.0 provides 148 000 annotated tokens
Functional Morphology … more in Prague Arabic Dependency Treebank:
Development in Data and Tools Even its approximation is promising and welcome Feature-Based Tagger trained on Penn ATB 2
3.6% error rate in major part-of-speech (15 values) 10.8% in the full tagset (317 evidenced combinations) 0.8–0.6% error rate in tokenization of the input