Structure and Analysis of the Arabic Verb

6
AN HPSG ANALYSIS OF ARABIC VERB Md. Shariful Islam Bhuyan * , and Reaz Ahmed * * Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka-1000, Bangladesh [email protected] , reaz@ cse.buet.ac.bd ABSTRACT In spite of being a successful syntactic theory in many respects, Head-driven Phrase Structure Grammar (HPSG) has inadequate coverage for morphological constructions, especially for nonconcatenative morphology, which is prominent in the Semitic languages such as Arabic, Hebrew etc. In this paper, we extend the HPSG framework to support rich nonconcatenative morphology of the verbal system of Arabic, the best instance of nonconcatenative morphology among the living languages. We also introduce necessary features for syntactic and semantic aspects of an Arabic Verb. Keywords: Nonconcatenative Morphology, Head-driven Phrase Structure Grammar, Arabic Verbal Morphology, Constraint-based Grammar 1. INTRODUCTION Broad-coverage precision grammar [1]-[3] and computational lexicon development for deep linguistic processing is a research-intensive area with several potential applications [4]. Amidst the vast literature on formal linguistic theory [5], Head-driven Phrase Structure Grammar (HPSG) [6] has a unique position since it combines the best features of the contemporary approaches as well as establishes an integrated framework for cross-layer representation comprising phonology, morphology, syntax, semantics, pragmatics and discourse. Although, HPSG successfully describes numerous syntactic and semantic phenomena, it lacks rigorous analyses for morphological phenomena, especially for non-concatenative morphology [7]. Nonconcatenative morphology illustrates an interesting paradigm of morphological operations, which is prominent in the Semitic languages such as Arabic, Hebrew etc [10]-[11]. Among the living languages, Arabic demonstrates the best instance of nonconcatenative morphology. Arabic verb system exhibit both concatenative and nonconcatenative morphology, capable of lexically expressing diverse syntactic and semantic phenomena. Formalisms of existing morphological analyzers for Arabic are not powerful enough to capture this higher layer diversity. In this paper, we extend the HPSG framework to support rich nonconcatenative morphology for the first comprehensive HPSG- construction of Arabic verbal system. 2. ARABIC VERBAL SYSTEM Arabic language exhibits an extremely rich morphology [8]-[9]. Both concatenative and nonconcatenative operations take place in the formation of an Arabic word. Inflection is made by concatenative operations whereas derivation is made by non-concatenative operations. Morpho-syntactic operations performed over the morphemes come with two flavors: concatenative and nonconcatenative. Concatenative operations are those where morphemes are linearly concatenated. For example: i. Prefixation: clear | unclear ii. Suffixation: walk | walked iii. Circumfixation: mind | unmindful Nonconcatenative operations are those where morphemes are nonlinearly embedded. For example: i. Infixation: kataba | kattaba ii. Simulfixation: eat | ate iii. Modification: man | men iv. Suppletion: go | went There are many other morpho-syntactic operations also. In this paper, we mainly focus on nonconcatenative operation and give a mathematical formalism to capture their rich diversity. Arabic word formation is an excellent example of nonconcatenative root-pattern morphology. A combination of root letters are plugged in a variety of morphological pattern with priory fixed letters and particular vowel melody that gives rise to corresponding syntactic and semantic phenomena. To feel the richness of Arabic morphological patterns, which we call “measure” in this paper, following example is given. Here, the root letters ‘k’, ‘t’, ‘b’ bearing a concept of writing, are plugged in various measures to get a myriad of syntactic and semantic phenomena. The measures with a particular semantic paradigm are called “Form”. Arabic has many forms. Among them, ten forms are used regularly. The root letters ‘k’, ‘t’, ‘b’ can be plugged in among nine of them. i. Form I (Transitive): kataba – He wrote

Transcript of Structure and Analysis of the Arabic Verb

Page 1: Structure and Analysis of the Arabic Verb

AN HPSG ANALYSIS OF ARABIC VERB

Md. Shariful Islam Bhuyan*, and Reaz Ahmed*

*Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka-1000, Bangladesh

[email protected], reaz@ cse.buet.ac.bd

ABSTRACT In spite of being a successful syntactic theory in many respects, Head-driven Phrase Structure Grammar (HPSG) has inadequate coverage for morphological constructions, especially for nonconcatenative morphology, which is prominent in the Semitic languages such as Arabic, Hebrew etc. In this paper, we extend the HPSG framework to support rich nonconcatenative morphology of the verbal system of Arabic, the best instance of nonconcatenative morphology among the living languages. We also introduce necessary features for syntactic and semantic aspects of an Arabic Verb. Keywords: Nonconcatenative Morphology, Head-driven

Phrase Structure Grammar, Arabic Verbal Morphology, Constraint-based Grammar

1. INTRODUCTION Broad-coverage precision grammar [1]-[3] and computational lexicon development for deep linguistic processing is a research-intensive area with several potential applications [4]. Amidst the vast literature on formal linguistic theory [5], Head-driven Phrase Structure Grammar (HPSG) [6] has a unique position since it combines the best features of the contemporary approaches as well as establishes an integrated framework for cross-layer representation comprising phonology, morphology, syntax, semantics, pragmatics and discourse. Although, HPSG successfully describes numerous syntactic and semantic phenomena, it lacks rigorous analyses for morphological phenomena, especially for non-concatenative morphology [7]. Nonconcatenative morphology illustrates an interesting paradigm of morphological operations, which is prominent in the Semitic languages such as Arabic, Hebrew etc [10]-[11].

Among the living languages, Arabic demonstrates the best instance of nonconcatenative morphology. Arabic verb system exhibit both concatenative and nonconcatenative morphology, capable of lexically expressing diverse syntactic and semantic phenomena. Formalisms of existing morphological analyzers for Arabic are not powerful enough to capture this higher layer diversity. In this paper, we extend the HPSG framework to support rich nonconcatenative morphology for the first comprehensive HPSG-construction of Arabic verbal system.

2. ARABIC VERBAL SYSTEM Arabic language exhibits an extremely rich morphology [8]-[9]. Both concatenative and nonconcatenative operations take place in the formation of an Arabic word. Inflection is made by concatenative operations whereas derivation is made by non-concatenative operations. Morpho-syntactic operations performed over the morphemes come with two flavors: concatenative and nonconcatenative. Concatenative operations are those where morphemes are linearly concatenated. For example:

i. Prefixation: clear | unclear

ii. Suffixation: walk | walked

iii. Circumfixation: mind | unmindful Nonconcatenative operations are those where morphemes are nonlinearly embedded. For example:

i. Infixation: kataba | kattaba

ii. Simulfixation: eat | ate

iii. Modification: man | men

iv. Suppletion: go | went

There are many other morpho-syntactic operations also. In this paper, we mainly focus on nonconcatenative operation and give a mathematical formalism to capture their rich diversity. Arabic word formation is an excellent example of nonconcatenative root-pattern morphology. A combination of root letters are plugged in a variety of morphological pattern with priory fixed letters and particular vowel melody that gives rise to corresponding syntactic and semantic phenomena. To feel the richness of Arabic morphological patterns, which we call “measure” in this paper, following example is given. Here, the root letters ‘k’, ‘t’, ‘b’ bearing a concept of writing, are plugged in various measures to get a myriad of syntactic and semantic phenomena. The measures with a particular semantic paradigm are called “Form”. Arabic has many forms. Among them, ten forms are used regularly. The root letters ‘k’, ‘t’, ‘b’ can be plugged in among nine of them.

i. Form I (Transitive): kataba – He wrote

Page 2: Structure and Analysis of the Arabic Verb

ii. Form II (Causative): kattaba – He caused to write

iii. Form III (Ditransitive): kaataba – He corresponded

iv. Form IV (Factitive): aktaba – He dictated

v. Form V (Reflexive): takattaba – It was written on its own

vi. Form VI (Reciprocity): takaataba – They wrote to each other

vii. Form VII (Submissive): inkataba – He was subscribed

viii. Form VIII (Reciprocity): iktataba – They wrote to each other

ix. Form X (Control): istaktaba – He asked to write

The above example illustrates the derivational paradigm of Arabic word. However, there is also an inflectional paradigm, which is governed by the agreement information. Every entry of the table 1, can take fourteen inflectional form according to there number gender and person. For imperfect form, there are three such inflectional paradigms. Table 2 and 3 show the inflectional paradigm for active perfect and passive perfect entry of form I. An Arabic word can encode a complete sentence. For example, sayaktubuhu - He will write it. We can break the word in the following component.

(Writing concept)

root prefix sa-yaktubu-hu suffix

(future particle) measure (it-attached pronoun)

(3rd/sg/masc/ind/perf/act/form-I)

From the diagram, we can conclude that an Arabic word has four components. We can break the word in the following components.

1) Prefix: sa – the particle indicating future

2) Suffix: hu – the object pronoun attached as a clitic

3) Root: k, t, b – the root letters bearing the concept of writing

4) Measure: ya_ _u_u – bearing the syntactic and semantic information of the event

It may be possible to concatenate multiple prefixes and suffixes. However, there must be a single measure and single set of root letters, where the measure packages syntactic and semantic features and root supplies the core concept. Here the measure indicates that – the actor is in 3rd person, singular number, masculine gender; the verb is in indicative case, active voice and derived form I; it also indicates that the event has not yet been completed. If we plug in another set of root letters, for example, n, S, r – which bears the concept of helping, we get sayanSuruhu – He will help him. In our analysis, a measure may contain two parts – stem-measure and affix-measure. From the above tables, we note a crucial point that, for a particular inflectional paradigm, a certain portion of the word, containing all the root letters, is always constant. For table 2 and 3, these are “katab” and “kutib” respectively. We call this fixed portion, the “stem-measure” and the remaining part, containing prefix and/or suffix, the “affix-measure”.

Table 2 Inflectional Paradigm of Form I-Active-

Perfect Ind/Sub/Juss Singular Dual Plural

3rd/Masc. katab-a katab-aa katab-uua 3rd/Fem. katab-at katab-ataa katab-na

2nd/Masc. katab-ta katab-tuma katab-tum2nd/Fem. katab-ti katab-tuma katab-tunna

1st katab-tu katab-na katab-na

Depending on this analysis, we can give the following model of an Arabic word.

A root-derived Arabic word =

Prefix + affix-measure (stem-measure (Root)) + Suffix

Table 1 Derivational Paradigm of root “ktb” FORM I FORM II FORM III FORM …

Active perfect katab-a kattab-a kaatab-a … Passive perfect kutib-a kuttib-a kuutib-a …

Active imperfect ya-ktub-u yu-kattib-u yu-kaatib-u … Passive imperfect yu-ktab-u yu-kattab-u yu-kaatab-u … Active imperative u-ktub kattib kaatib … Passive Imperative litu-ktab litu-kattab litu-kaatab …

Verbal noun kitaab-atun ta-ktiib-un kitaab-un … Active participle kaatib-un mu-kattib-un mu-kaatib-un … Passive participle ma-ktuub-un mu-kattab-un mu-kaatab-un … Locative participle ma-ktab-un … … …

Instrumental participle mi-ktab-un … … … … … … … …

Page 3: Structure and Analysis of the Arabic Verb

Table 3 Inflectional Paradigm of Form I-Passive-Perfect

Ind/Sub/Juss Singular Dual Plural 3rd/Masc. kutib-a kutib-aa kutib-uua 3rd/Fem. kutib-at kutib-ataa kutib-na

2nd/Masc. kutib-ta kutib-tuma kutib-tum 2nd/Fem. kutib-ti kutib-tuma kutib-tunna

1st kutib-tu kutib-na kutib-na There are syntactic and semantic features, which governs the derivational and inflectional paradigms for Arabic roots. With a linguistic investigation, we have listed some features that will be used in this paper. Attributes in the table 4 and 5, govern the derivational and inflectional paradigm for an Arabic root respectively.

Table 4 Attributes Governing Derivational

Paradigm Attribute Values

POS noun, verb, particle FORM I, II, III, IV, … VOICE active, passive VFORM perfect, imperfect, imperative

Table 5 Attributes Governing Inflectional

Paradigm Attribute Values

MODALITY emphatic, uncertainty MOOD indicative, subjunctive, jussive

PERSON 1st, 2nd, 3rd NUMBER singular, dual, plural GENDER masculine, feminine

CASE nominative, accusative, genitiveDEFINITENESS definite, indefinite

POLARITY affirmative, negative

This is not the whole story of Arabic morphology. Another facet of Arabic morphology is the concept of root class. We call a set of roots, which share a common derivational and inflectional paradigm, a root class. Depending on the characteristics of root letters, the class is determined. The roots ‘k’, ‘t’, ‘b’ and ‘n’, ‘S’, ‘r’ both are member of same root class – the sound root class. 3. AN HPSG PRIMER Natural languages are generally consists of two components. First, the utterances that can be used by human. Second, the linguistic rules that license those utterances. For example, in English, ”He writes books”, ”writes books”, ”writes” – all are valid utterances. However, “Writes he books”, “writes he”, “rwite” are not valid, since the rules do not license them. HPSG is a mathematical theory for natural languages that formally captures these two core linguistic components. Utterances are modeled using a mathematical object Sign (a formal representation of linguistic objects phrase, words, etc.) and rules are captured using another

mathematical object Construct (a formal representations of grammar rules or schema that are used to license signs). Both sign and construct are described using feature structure - a collection of features of corresponding linguistic objects along with their values. These features and their values constitute a very detailed type hierarchy (see figure 1). We use constructional HPSG [6] in this paper.

Figure 1: An HPSG Type Hierarchy

An utterance can have linguistic feature spanning multiple layers, e.g. phonological, morphological, syntactic, semantic, pragmatic and others. To capture these features, the description of a typical HPSG sign looks like figure 2. To capture grammatical rules, the feature structure of a construct has a mother (MTR) feature and a daughters (DTRS) feature. The value of the MTR is a sign and the value of the DTRS is a nonempty list of signs. the description of a typical HPSG construct looks like figure 3. The licensing of signs follows The Sign Principle which states that “Every sign must be lexically or constructionally licensed, where, lexically licensed only if it satisfies some lexical entry, and constructionally licensed only if it is the mother of some construct” [6].

Figure 2: An HPSG Sign From the type hierarchy of figure 1, we can see that there are two type of feature structure. Functions are the feature structure that is described using an attribute value matrix (AVM). They maps features to feature structure. Atoms are atomic types that can be used as the value of features. Notable functions are sign, cxt (construction), lexeme, phrase and others. To model a linguistic phenomenon we first need to identify the involved signs with their hierarchy. Next, we need to design functional feature structures for them with linguistically motivated features.

Page 4: Structure and Analysis of the Arabic Verb

Figure 3: An HPSG Construction Then we also need to define the necessary constructs as well as the atomic type hierarchy. In the next section, we build these ingredients for Arabic verbs. 4. ARABIC IN HPSG Here we give the attribute value matrix (AVM) for an Arabic verb kataba – “He wrote”, in active form and its corresponding passive form kutiba – “It was written”, using our analysis. In the figure 4, we have three features associated with morphology. First, the feature TYPE, which denotes the associated root class. Arabic roots are classified into several root class according to their derivational and inflectional paradigm. This feature affects both root and measure. Therefore, it has been taken out to a first level morphological feature. In this case, its value is sound. Next, the feature ROOT, which has a list of root letters as well as the CONTENT feature, which gives the semantic contribution made by root letters. In this case, its value is structure-shared with the write-fr in the

Figure 4: An HPSG Sign for kataba

FRAMES feature. Next, the feature MEASURE, which contains the morphological, syntactic and semantic information contributed by measure. First, the feature FORM, which denotes the semantic paradigm. kataba

Figure 5: An HPSG Sign for kutiba is a form-I derivative. Next, the feature PATTERN that captures the stem along with the root letters using structure sharing. Then, the feature CAT, which contains the syntactic category for this measure. Its value is structure-shared with the syntactic feature CAT. Finally, the feature PNG, which captures the PERSON, NUMBER and GENDER information of our semantic actor in the case where it is not syntactically realized. We present the syntactic and semantic information using the SYN and SEM feature, for the word - kataba. First, the CAT features identifies the syntactic category of - kataba. It contains the VFORM and VOICE feature of Arabic, which governs the derivational paradigm of verb lexeme. In this case, their values are perfect and active respectively. Next, the VAL feature that captures the subcategorization of verbs. VAL is a list of signs, which are required by the syntactic head. In this case, the verb - kataba, requires an object. The verb - write is a transitive verb that takes an object. We should note that the hidden pronoun - he is encoded by the inflectional morphology, when no explicit subject is used. The semantic actor is not realized syntactically. So, the verb only subcategorizes for syntactic object. We can also see the constraints imposed over the object. In this version, its syntactic head should be a noun phrase with the value of its CASE feature set to

Page 5: Structure and Analysis of the Arabic Verb

accusative. The negative value of the OPT feature indicates that this object is not optional, rather required to be syntactically correct.

Figure 6: An HPSG Sign for yaktubu Next, we need to consider some semantic features. Here, we use a type feature version of predicate logic to capture semantics of natural language. First, we consider the INDEX feature, which is a reference to a discourse entity. Then, the PNG feature, which capture the semantics of PERSON, NUMBER and GENDER. Next, the FRAMES feature, which serves as a bag for elementary-predicates to describe the situation at hand. For example, in the case of kataba, the event of writing is expressed. The event is completed in the past and there is a discourse referent to the actor. To capture the core event, write-predicate is introduced. To capture the temporal constraint, we use the perfect-predicate. Finally, to express the actor of the event, the hidden pronoun, we introduce a discourse referent with corresponding PNG feature. Predicates have their respective arguments. write-predicate has a situation hook, expressed by the feature SIT. There are two semantic role associated with this predicate. First, we consider the role of writer, who plays a doer role, expressed by the feature ACTOR. Second, we consider the role of written, who plays an undergoer role, expressed by the feature UNDGR. The perfect-predicate takes a situation hook as an argument, which is expressed as the feature ARG. We use the technique of

co-indexing for sharing semantic objects. The discourse referent predicate is actually the actor of the write-predicate. To denote this constraint, the INDEX value of hidden pronoun and the ACTOR value of the write-predicate are co-indexed, both are given the value i. This is an example of reference co-indexing. We also use event co-indexing. The event hook SIT of write-predicate, situation hook of the entire scenario and argument ARG of the perfect-predicate, all are co-indexed and expressed using the value s. Another important issue of HPSG representation is the syntax-semantics interface. In this example, this is done by co-indexing the INDEX value of the syntactic object and the UNDGR value of the write-predicate with a value j. This indicates that the syntactic object is our semantic undergoer whereas from our previous discussion we can note that the semantic actor is not syntactically realized.

Figure 7: An HPSG Sign for yuktabu In the figure 5, we show the HPSG representation of the passive - kutiba. We identify the associated changes for this conversion. The PATTERN feature is changed to capture the derivational morphological operation. Next change can be found obviously in the feature VOICE, changing its value to passive. Unlike English, which can have a prepositional complement in passives, Arabic passives do not subcategorize for a subject or any other argument. For this reason, the VAL list is empty. Moreover, the discourse referent in the feature FRAMES is now co-indexed with the UNDGR feature of the write-predicate, expressed by the value j. Semantic actor now completely unknown by not having

Page 6: Structure and Analysis of the Arabic Verb

any syntactic or semantic reference, which is a distinctive property of Arabic passives. In figure 6 and 7, we also give the HPSG sign for yaktubu and yuktabu, - active imperfect and passive imperfect form of kataba. There is a newly introduced feature is MOOD that take the value indicative in this case as well as VFORM change to imperfect. An imperfect-predicate denotes the non-completion aspect of the event. 5. CONCLUSION In this paper, we give the proposal how to capture nonconcatenative morphology, especially Arabic verb morphology within the framework of HPSG. There are lot of works to do in the future. To construct matrix from table 1 we need to cope with a wide range of diversity that an Arabic verb can take. Results will be immensely helpful for the construction of resource grammar for languages with rich nonconcatenative morphology.

REFERENCES

[1] Copestake A., and Flickinger D. “An open-source grammar development environment and broad-coverage English grammar using HPSG,” Second conference on Language Resources and Evaluation, 2000.

[2] Marimon M., Bel N., Espeja S., and Seghezzi N., “The Spanish Resource Grammar: pre-processing strategy and lexical acquisition,” ACL Workshop on Deep Linguistic Processing, 2007.

[3] Comrie B., Fabri R., Hume B., Mifsud M., Stolz T., and Vanhove M., (Eds), “Towards an HPSG Analysis of Maltese,” 1st International Conference on Maltese Linguistics, 2007.

[4] Bond F., Oepen S., Siegel M., Copestake A., and Flickinger D., “Open source machine translation with DELPH-IN,” Open-Source Machine Translation Workshop at the 10th Machine Translation Summit, pp. 15-22, 2005.

[5] Sells P., Lectures on Contemporary Syntactic Theories, Stanford: CSLI Publications, 1985.

[6] Sag I., and Wasow T., Syntactic Theory: A Formal Introduction, Stanford: CSLI Publications, 1999.

[7] Bird S., and Klein E., “Phonological Analysis in Typed Feature Systems”, Computational Linguistics, vol. 20, pp. 55-90, 1994.

[8] Beesley K., “Finite-State Morphological Analysis and Generation of Arabic at Xerox Research: Status and Plans in 2001”, ACL Workshop on Arabic Language Processing: Status and Prospects, pp. 1–8, 2001.

[9] Smrž O., Functional Arabic Morphology. Formal System and Implementation, PhD Dissertation, Charles University in Prague, 2007.

[10] Riehemann S., A Constructional Approach to Idioms and Word Formation, PhD Dissertation, Stanford University, 2001.

[11] Riehemann S., “Type-Based Derivational Morphology,” Journal of Comparative Germanic Linguistics, vol. 2, pp. 49-77, 1998.