PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report...

22
PAN LOCALIZATION PROJECT REPORT ON (Tagset for Mongolian) Phase 1.1 September, 27 th 2007 CENTER FOR RESEARCH ON LANGUAGE PROCESSING NATIONAL UNIVERSITY OF MONGOLIA, ULAANBAATAR MONGOLIA Report: Tagset for Mongolian 1

Transcript of PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report...

Page 1: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

PAN LOCALIZATION PROJECT

REPORT ON (Tagset for Mongolian)

Phase 1.1

September, 27th 2007

CENTER FOR RESEARCH ON LANGUAGE PROCESSING NATIONAL UNIVERSITY OF MONGOLIA, ULAANBAATAR MONGOLIA

Report: Tagset for Mongolian 1

Page 2: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

Table of Contents Table of Contents ............................................................................................................................ 2 1. Abstract ..................................................................................................................................... 3 2. Introduction ............................................................................................................................... 4 3. Overview and explanation of part-of-speech and their tags ..................................................... 6

Common Noun (NC) .................................................................................................................... 6 Proper noun (NP)......................................................................................................................... 6 Numeral (NN) ............................................................................................................................... 6 Transitive Verb (Vt) ...................................................................................................................... 6 Intransitive Verb (Vin) .................................................................................................................. 6 Auxiliary Verb (AUX) .................................................................................................................... 6 Adjective ADJ............................................................................................................................... 7 Adverb (ADV)............................................................................................................................... 7 Modal Word (MW)........................................................................................................................ 7 Modal Morph (MM)....................................................................................................................... 8 Non-possessive pronoun (NPPN)................................................................................................ 8 Possessive pronoun (PPN).......................................................................................................... 8 Pro-numeral (PNN) ...................................................................................................................... 8 Pro-adjective (PADJ) ................................................................................................................... 9 Pro-adverb (PADV) ...................................................................................................................... 9 Pro-verb (PV) ............................................................................................................................... 9 Conjunction (CJ) .......................................................................................................................... 9 Interjection (INT) .......................................................................................................................... 9 Abbreviation (ABR) ...................................................................................................................... 9 Punctuation (PUN) ..................................................................................................................... 10

4. Conclusion .............................................................................................................................. 11 Appendix A .................................................................................................................................... 14 Appendix B .................................................................................................................................... 16 Appendix C .................................................................................................................................... 18

Report: Tagset for Mongolian 2

Page 3: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset is a high level tagset and for tagging a Mongolian corpus, which is under building. For testing the tagset, around 1000 words have been tagged with it.

Report: Tagset for Mongolian 3

Page 4: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

2. Introduction It is estimated that approximately 6,000 languages are currently spoken in the world. Since these languages have both commonalities and unique peculiarities, it is possible to categorize such features etymologically as well as typologically. As to its origins, the Mongolian language belongs to the Altaic language family, and typologically, it is an agglutinative language. That is to say, in the formation of words, a base or stem is followed by derivational and inflectional suffixes, which are concatenated linearly. This is considerably different from the inflectional morphology of Indo-European languages and the isolating morphology of Sinitic languages. Mongolian has a long history as well as rich lexicon. With the intention of utilizing technology to study the Mongolian language, and to get in step with international developments in this new era of information technology, we propose this tag set of Mongolian words. In classifying the words, we have primarily taken into consideration their forms and distribution, as well as their meanings and functions. Altogether, we have created 20 tags, which may each be further broken up by meanings and functions. For example, proper nouns are divided into the following categories:

1. Personal names (anthroponyms) 2. Toponyms 3. Hydronyms 4. Animal names (zoonyms) 5. Names of institutions 6. Names of books, newspapers, and journals 7. Neologisms 8. Names of planets and stars

These 20 tags may be divided into the following foundation categories:

1. Noun (NC, NP, NN) 2. Verb (Vt, Vin, AUX) 3. Ad-word (ADJ, ADV) 4. Modal (MW, MM) 5. Pro-word (NPPN, PPN, PNN, PADJ, PADV, PV ) 6. Conjuction (CJ) 7. Interjection (INT) 8. Abbreiation (ABR) 9. Punctuation (PUN)

In traditional Mongolian linguistics, there are three schools of classifying words—the Roman-European, Indo-Tibetan, and structuralist. The structuralist method has been followed by many scholars in classifying Mongolian words, including N. N. Poppe (1960), John Charles Street (1963), Sh. Luwsanwandan (1967), T. A. Bertagaev (1969), P. Byambasan (1989), and Ts. Önörbayan (2004). We largely follow Ts. Önörbayan’s dual system classification, though with slight modification. For example, in this recent classification, words and morphemes such as mash (very), nen (even more), tun (extraordinary), xaga (to pieces), xuga (to piece), tsoo (through), shüü (really), daa (indeed), bidz (probably), magad (undoubted(ly)/probably), and ünexeer (truly), which add various kinds of modal meanings to the basic meaning of a sentence or its parts, and which are known variously throughout the literature as “adverbs” (Mo. daiwar üg, lit. “extra word”), “uninflected words” (Mo. nöxtsölgüi üg), “particles” (Mo. sul üg, literally “weak word”), “onomatopoeic words” (Mo. awia duuraix üg), and “verb-like adverbs” (Mo. üiliin dür baidliin daiwar üg), are termed “decorative words” (in Mongolian, chimex üg). Since these “decorative words” often occur at word boundaries and sentence margins, we have generally divided them into sentential particles and modal particles.

Report: Tagset for Mongolian 4

Page 5: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

However, since words such as mash (very), nen (even more), and tun (extraordinary) only appear before, and modify adjectives, we have included them in the adjective category. Likewise, since xaga (to pieces), xuga (to piece), and, tsoo (through) occur before verbs and have the same function as adverbs, we have included these words in the adverb category. Within these two classifications we have made further semantic classifications as well. First time, we have made deeper classification on parts-of-speech and created 61 tags (see Appendix B). When we was tagging a sample with 700 words, it was arguably to tag some word. For example, the main difficulty was occurred because of decorative words include both word and morpheme. So we discussed and reduced on it many times, finally we created 16 tags. After creating the first 16 general tags, we manually tagged the approximately 1,500 words in the editorial-article in issue No. 2 (17219) of the 1989 Ünen newspaper1, classifying the words with these 16 tags. Therefore, we changed the tags again and shortened them to sixteen, and completed our previously mentioned first article, in which we manually assigned word class tags to approximately 1,500 words in total. However, since tagging all pro-words with the PW tag was ambiguous, and insufficient for expressing exactly which word class the pro-word was representing, it became necessary to change the pro-word tag. For this reason, we classified PW into six categories – NPPN (non-possessive pronoun), PPN (possessive pronoun), PNN (pro-numeral), PV (pro-verb), PADJ (pro-adjective) and PADV (pro-adverb). The defective verbs, as well as auxiliary verbs, were likewise included in this classification. This is because the defective verbs a- and bö- preserve some qualities of auxiliary verbs. Thus, auxiliary verbs and defective verbs have common signs, which they are subordinated to noun and verb and show their tenses, aspects and states. Some scholars consider that defective verbs and auxiliary verbs should be considered distinct categories, but other scholars consider the defective verbs a- and bö- as early forms of the auxiliary verbs (bai-, bol-). Compared with part-of-speech classification of some Altaic family languages such as Japanese, Korean, Turkish and English, the final 20 tags commonly correspond. We chose the editorial-article in the newspaper Ünen because it is an excellent specimen of Mongolian writing, which brought the most important information and news to its readers at the time, and which was free of any typographic mistakes or stylistic errors

1 We chose the editorial-article in the newspaper Ünen because it is an excellent specimen of Mongolian writing, which brought the most important information and news to its readers at the time, and which was free of any typographic mistakes or stylistic errors

Report: Tagset for Mongolian 5

Page 6: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

3. Overview and explanation of part-of-speech and their tags

Common Noun (NC) Nouns, which are fully inflected according to number, case and possession, are classified into common and proper nouns [12]. Common nouns are nouns that denote any or all members of a class. Common nouns are divided into human, animal, plant, sensation, mind, etc according to their meanings.

Proper noun (NP) Proper nouns are nouns which represent unique entities [12]. Proper nouns are common in all languages and only differ cross-linguistically as to their subcategories. Proper nouns inflect according to case and possession. Proper nouns are divided into the anthroponyms, toponyms, hydronyms, zoonyms and names of books, periodicals and institutions. Proper nouns sometimes function as adjectives. They function as such when they express an array of things or a certain feature, having lost its main meaning. Proper nouns also play the role of adjectives when they are formed with plural suffixes. In Cyrillic orthography, the first letter of adjectives and common nouns is written in lowercase, to distinguish them from their proper noun equivalents.

Numeral (NN) Numerals are words which represent numbers. Comparing Mongolian with other languages (e.g. Korean, Turkish, and English) and dialects (e.g. Inner Mongolian dialects) reveals that we may classify numerals into categories such as cardinal numerals, ordinal numerals, collective numerals, approximative numerals, distributive numerals, multiplicative numerals, and partitive numerals.

Transitive Verb (Vt) Transitive verbs are verbs which may govern nouns with accusative case inflections and which serve as objects [9]. Transitive verbs are found throughout languages of the world.

Intransitive Verb (Vin) Intransitive verbs are verbs which may govern any nouns that are not serving as objects and that have any other inflections than that of the accusative case [9]. Intransitive verbs are common throughout languages of the world.

Auxiliary Verb (AUX) The verbs rooted bai-, bol-, а- and бö- assist nouns and verbs in forming both nominal and verbal predicates, and in expressing the tense, aspect, and mood of the predicate [3]. In the preclassical period, the auxiliary verbs a- and bö- were used instead of today’s bai- and bоl-. A- and bö- are an abundance of examples of defective verbs in the old literary Mongolian language. They commonly occur in both spoken and written Mongolian. The helping verbs bai-, bol- and defective verbs a-, bö- assist nouns and verbs in expressing the, aspact, and mood of the predicate. In the period, the auxiliary verbs a- and bö- were used instead of today’s bai-, bol- [15].

Report: Tagset for Mongolian 6

Page 7: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

Adjective ADJ Adjectives modify nouns and specify attributes and qualities of things or people, usually making their meaning more specific. In Mongolian, adjectives largely modify only nouns, just as in English.

Since as early as the 1930s, linguists in Mongolia and abroad who have studied Mongolian parts of speech have classified adjectives variously as “quality nouns” (Mo. chanarïn ner) [16, 20], “property nouns” (Mo. xürteel ner) [17, 21, 22, 23], “symbolic nouns” (Mo. beleg temdgiin ner) [4], “marker nouns” (Mo. temdeg ner) [2, 12], “nouns of markers” (Mo. temdgiin ner) [7, 11], “marker word” (Mo. temdeg üg) [6], “non-numeral property nouns” (Mo. toonï bus xürteel ner) [5], “noun marker words” (Mo. neriin temdeg üg) [15], “noun attributes” (Mo. neriin todotgol) [24], “attributive nouns” (Mo. xawsral ner) [13, 14, 25], etc. However, we prefer to term this category “adjective” (Mo. ner xawsral). This is because these words are subordinated to nouns and because they only modify nouns. The difference between our classification and other classifications of this category is that our usage of the term “adjective” (Mo. ner xawsral), also includes such morphemes as emphatic particles (Mo. xüch nemegdüülsen sul üg), “uninflecting words” (Mo. nöxtsölgüi üg) [8], and “premodifier decorative words” (Mo. ugtan chimex üg) [12], e.g. mash (very), nen (even more), tun (extraordinary), ülemj (big, very, large).

Adverb (ADV) Adverbs are words, which refer to the duration, place, direction, or situation of actions which have been completed, are in progress, or which will occur in the future. Such words always modify only verbs. In Mongolian language, there are a few original adverbs that don’t serve as an adjective or as a noun, only exist as adverbs. Also we have passive-root adverbs and these adverbs can only be used with certain verbs. In traditional Mongolian linguistic research, such words have been largely termed “extra words” (Mo. daiwar üg) and have been subdivided into categories such as situational, temporal, and spatial. Inner Mongolian dialects, as well as languages such as Turkish, Japanese, Korean, and Nepali have such “extra words” (Mo. daiwar üg) as independent parts of speech. In our classification of such words as “adverbs” (Mo. üil xawsral), we have been including some adjectives which come before the verb and modify it, including such onomatopoeic words as xaga (to pieces), xuga (to piece), tsoo (through), tsöm (to prierce), newt (through), chag chag (the sound of o’clock), and jiw jiw (the sound of lark). This is because these words come only before the verb and modify a given action. Onomatopoeic words are often considered types of “extra words” (Mo. daiwar üg) [12, 23], particles (Mo. sul üg, literally “weak words”) [4], or “uninflecting words” (Mo. nöxtsölgüi üg) [8].

Modal Word (MW) Modal words are auxiliary words which add various modal meanings to the parts of a sentence or to the sentence as a whole. In this category, we have included “post-sentential decorative words” (Mo. ögüülberiig dagan chimex üg), e.g. xeregtei (necessary/should) and yostoi (must), as well as “pre-sentential decorative words” (Mo. ögüülberiig ugtan chimex üx), e.g. medeej (of course), chuxamdaa (properly), and yalanguya (specially). Since these words function primarily to express the speaker’s mood, we have termed them modal word. Modal word is similar to modal morph in expressing various modal meanings. But modal word and modal morph are different a lot from each other for their structures (1) and distributions (2). Therefore, it’s very important to distinguish them.

(1) In teams of the structure modal word separates out into its root, derivational and inflectional suffixes. (2) Modal word alternates freely its distribution and expresses

Report: Tagset for Mongolian 7

Page 8: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

independently modal ideas and extra meanings in a sentence depending on which content is considered as important like other parts of sentence.

Modal Morph (MM) Modal morphs are auxiliary morphemes which add various modal meanings to the basic meaning of a sentence or to parts of it. They are not full words, since they do not denote things and simply express emotion. In earlier literature, modal morphs have been variously termed “uninflecting words” (Mo. nöxtsölgüi üg), “particles” (Mo. sul üg, lit. “weak words”), and “sentence endings” (Mo. ögüülberiin nöxtsöl), all included in the category of “extra words” (Mo. daiwar üg). Within our classification of sentence particles, we have included words such as shüü (really), daa (indeed), bidz (probably), vii (aha), yüm (really), uu, be, and yuu (question modal morphs), which Ts. Önörbayan has termed “post-sentential decorative words” (Mo. ögüülberiig dagan chimex üg). This is because these morphs do not add semantic information to the individual words they appear next to, but rather, add various modal nuances to the sentence as a whole. Other morphs that are are connected to words or phrases include the negative morphs es and ül (no/not). The restrictive focus morph l and the additive focus morph ch (with) are also included in this category. Modal morphs are not common in languages of the world, and are related to the agglutinative morphology of Mongolian.

Modal morph is similar to modal word in expressing various modal meanings. Thus, distinguisting between modal word and morph is important. But modal morph and modal word are different a lot from each other for their structures (1) and distributions (2). Therefore, it’s very important to distinguish them.

(1) Modal morph doesn’t take a suffix. Because it is not a word. It is just a free morph. (2) The modal morph doesn’t express an additional information and modal meaning in а

sentenсe. It expresses modal meaning with together appropriate words only.

Non-possessive pronoun (NPPN) Pronouns can be divided into possessive pronouns and non-possessive pronouns, in accordance with their distribution in the noun phrase of the sentence. Non-possessive pronouns represent either common or proper nouns. In this classification, we include pronouns that are related to people, animals, and things except personal pronouns which take genitive case endings, such as bi ‘I’, chi ‘you (singular, familiar)’, ta ‘you (singular, respectful)’, bid ‘we’, ta nar ‘you (plural)’, ter ‘he, she, it’, ted ‘they’, xen ‘who’, yuu ‘what’, etc.

Possessive pronoun (PPN) Pronouns can be divided into possessive and non-possessive pronouns, in accordance with their distribution in the noun phrase of the sentence. Possessive pronouns represent nouns which relate to the person only. In this classification, we have included the genitive forms of personal pronouns, including minii ‘my’, manai ‘our’, chinii ‘your (singular, familiar)’, tanii ‘your (singular, respectful)’, etc.

Pro-numeral (PNN) Pro-numerals are words that represent the quantity, order, combination, outline, distribution, or wholeness of people, animals, things, or phenomena. Also included in this category are such interrogative words as xed ‘how many’, xichneen ‘how many’ and such demonstrative words as öd töd ‘quite a lot of’ and ödii tödii ‘quite a lot of’.

Report: Tagset for Mongolian 8

Page 9: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

Pro-adjective (PADJ) The pro-adjectives represent a feature of people, animals and things. This classification includes such interrogative pro-words as ali ‘which’, yamar ‘what kind of’, etc., as well as demonstrative pro-words such as iim ‘this way, in this manner’, tiim ‘that way, in that manner’, etc.

Pro-adverb (PADV) The pro-adverbs represent places, time, purpose, cause and manner of the actions. Also included in this classification are pro-words asking about or describing verbal tenses or locations, e.g. end ‘here’, tend ‘there’, ödiid ‘now, at this time’, tödiid ‘then, at that time’, xedzee ‘when’, xaana ‘where’, etc.

Pro-verb (PV) The pro-verb represents an action of people, animals and things. This classification also includes the demonstrative pro-verbs inge- ‘to do this way, to do in this manner’, teg- ‘to do that way, to do in that manner’ and such interrogative pro-verbs as yaa- ‘to do in what manner, how’, xerx- ‘to do what’, etc.

Conjunction (CJ) Conjunctions are words that join independent sentences or independent parts of sentences. In our classification, we have treated both conjunctions which join sentences, e.g. tiim uchraas (that’s why), and üünchlen (thus), as well as conjunctions joining words within sentences, e.g. tuxai (about), talaar (on, with regard to), and tuld ((in order) to). Conjunctions may be further classified into contrastive conjunctions and coordinating conjunctions. We have also included the verbal stem ge- as a conjunction in our classification. Linguists and researchers have termed this category variously as “conjunctions” (Mo. xolboos) [6, 12, 18, 23] and “connective words” (Mo. xolbox üg) [2, 12], “connective morphemes” (Mo. xolbox büteewer) [1] and “extra-sentential conjunctions” (Mo. ögüülberiin gadaad xolboos) [1].

Interjection (INT) Interjections (Mo. ayalga üg) are words that have no grammatical connection to the utterance they occur in and only express emotion on the part of the speaker. Interjections are traditionally known as “non-structural sentences” (Mo. bütets bus ögüülber) [1, 8], “accent sentences” (Mo. ayalga üg) [6, 17, 20, 23], and “accent word-sentences” or “interjection-sentences” (Mo. ayalga üg-ögüülber) [19], all of which should be seen as “accent sentences” (Mo. ayalga üg). This is because interjections such as tii tii (brr), yoo yoo (ouch), and pööx (ugh; whoa) can fully express a speaker’s emotions at the sentential level.

Abbreviation (ABR) In most of languages, there are rules, which are used to abbreviate words, in their speaking and writing. They are as follow:

Words are abbreviated as their first letter in writing and pronounced their full form. For example, UIH (УИХ, read as Ulsïn Ix Xural ‘State Great Khural), OHU (ОХУ, read as Orosïn Xolboonï Uls ‘Russian Federation’), BSSHUY (БСШУЯ, read as Bololwsrol Soyol Shinjlex Uxaanï Yaam ‘Ministry of Education, Culture, and Science’)

Words are abbreviated as their first letter in writing and pronounced as their abbreviations. For example, Nüb (НҮБ, abbreviation for Negdsen Ündestnii Baiguullaga); and APU (АПУ), an abbreviation for Arxi Piwo Undaa ‘Liquor, Beer, and Beverages (a company name)’ is read as Apuu.

Report: Tagset for Mongolian 9

Page 10: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

Words are abbreviated as their first syllable and read them as abbreviation. For example, MONEL (МОНЕЛ), an abbreviation for Mongol Elektronik ‘Electronics Company’; and MONTSAME (МОНЦАМЭ, an abbreviation for Mongoliin Tsahilgaan Medee ‘Mongolian Tele News’)

Punctuation (PUN) Punctuation usage in Mongolian has essentially the same system as that of English and some other languages, and is used in accordance with grammar, semantics, and accent. The following punctuation symbols are used in Mongolian, (? ! . , ... : ; - ∼ § () {} [] “” «»).

Report: Tagset for Mongolian 10

Page 11: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

4. Conclusion We designed the 20 tags and classified them primarily according to word forms and structures. Of these, 14 tags largely correspond to their equivalents in English—NC (common noun), NP (proper noun), NN (numeral), ADJ (adjective), ADV (adverb), Vt (transitive verb), Vin (instransitive verb), AUX (auxiliary verb), PPN (possessive noun), NPPN (non-possessive pronoun), CJ (conjunction), INT (interjection), ABR (abbreviation), PUN (punctuation), and PNN (pro-numeral). The remaining six, however, differ from English—PNN (pro-numeral), PADJ (pro-adjective), PADV (pro-adverb), PV (pro-verb), MM (modal morph), and MW (modal word). The part of speech tags which we have designed are different from the previous classification with regard to its nominalization and content, because the tags we have designed only depend on language-processing research. When we designed the tags, we researched other Altaic languages, such as varieties of Inner Mongolian, Turkish, Japanese, and Korean (see Appendix C), as well as other languages such as English and Nepali. We mainly used and studied books concentrating on the Mongolian language, and also books on language research and study, grammar books, linguistic and non-linguistic dictionaries, research articles, and books on Mongolian syntax and modality. The Mongolian part of speech tags have been revised and shortened several times. As a result of all this, the 20 tags that were designed generally match those proposed for other languages. This may be taken as evidence that the basic Mongolian tags were processed correctly. Also we tagged around 1500 words text and classified these words according to their POS tags to check the tagset (see Appendix A). NOTE: The files, part of speech and summary of part of speech, are attached.

Report: Tagset for Mongolian 11

Page 12: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

References:

1. Badzarragchaa M.Монгол хэлний өгүүлбэр зүй (Syntax of Mongolian language), Ulaanbaatar, 2005.

2. Byambasan P. “Орчин цагийн монгол хэлний үгсийг аймаглах тухай асуудалд”,

Шинжлэх ухаан амьдрал (“On the question of the classification of the part of speech”, Science and life), № 2, Ulaanbaatar 1989.

3. Byambasan P et al., Орчин цагийн монгол хэлний үг зүйн байгуулалт

(Morphology’s structure of modern Mongolian), Ulaanbaatar, 1987.

4. Išdorj M. Монгол хэлний дүрэм (Rules of Mongolian language), Ulaanbaatar, 1930.

5. John Charles Street. Khalkha Structure, Bloomington,1963.

6. Luwsanwandan Sh. Монгол хэлний зүй (Mongolian grammar), Ulaanbaatar, 1939.

7. Luwsanwandan Sh. Орчин цагийн монгол хэл (Modern Mongolian language), Beejin [Beijing], 1961.

8. Luwsanwandan Sh. “Монгол хэлний үгсийг аймаглах тухай асуудалд”, Хэл зохиол судлал (“On the question of the classification of the parts of speech”, Studies of language and literature), Ulaanbaatar, 1968.

9. Luwsanwandan Sh. Орчин цагийн монгол хэлний бүтэц (Modern Mongolian stucture), Ulaanbaatar, 1999.

10. Mönkh-Amgalan Yu. Монгол хэлний баймжийн ай (Modal category of Mongolian

language), Ulaanbaatar, 1998.

11. Nadmid J, Janchiwdorj Ts, Ragchaa B. Монгол хэлний зүй (Mongolian grammar), Ulaanbaatar, 1960.

12. Önörbayan Ts. Орчин цагийн монгол хэлний үг зүй (Morphology of modern Mongolian), Ulaanbaatar, 2004.

13. Oxford, Monsudar Ehglish-Mongolian Dictionary, First edition, Ulaanbaatar, 2006.

14. Oyuuntsetseg J. Англи хэлний зүй (English grammar), Ulaanbaatar, 2005.

15. Rita Kullmann, Tserenpil D. Монгол хэлний зүй (Mongolian grammar), Hong Kong, 1996.

16. Poppe N.N “О частях речи в монгольском языке”, Советское востоковедение (“On the parts of speech in Mongolian languages”, Russian oriental studies), М, 1940.

17. Poppe N. N. Khalkha-Mongolisch grammatic, Wiesbaden, 1951.

18. Poppe N. N. Buriad grammar, 1960.

19. Pürew-Ochir B. Орчин цагийн монгол хэлний өгүүлбэр зүй (Syntax of modern Mongolian), Ulaanbaatar, 1997

20. Sanžeev G. D. Грамматика бурят-монгольского языка (Grammar of Buriad - Mongolian), М-Л, 1941.

21. Sanžeev G. D, Bertagaev T. A, Cïdėndambaev C. B. Грамматикабурятского языка, Фонетика и морфология (Buriad grammar. Phonetics and morphology), М, 1962.

Report: Tagset for Mongolian 12

Page 13: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

22. Šmidt Ya. Грамматика Монгольского языка (Mongolian grammar),СПБ, 1832.

23. Todaeva B. X. Грамматика современного монгольского языка. Фонетика и морфология (Grammar of modern Mongolian. Phonetics and morphology), М, 1951.

24. Tömörtogoo D. Хэлшинжилэлийн нэр томъёоны хураангуй толь (A concise dictionary of the linguistic terms), Ulaanbaatar, 2004.

25. Enkhbat D. Хэл шинжлэлийн англи-орос-монгол толь бичиг (English - Russian - Mongolian linguistic terms), Ulaanbaatar, 2003.

Report: Tagset for Mongolian 13

Page 14: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

Appendix A Piece of the list of the tagged words with NC tag from an editorial article. <word id="1">ярианаас</word> <word id="2">явцад</word> <word id="3">явдлаас</word> . . . <word id="323">аж ахуйг</word> <word id="324">аж амьдрал</word> Piece of the list of the tagged words with NP tag from a editorial article. <word id="1">Ардын хянан шалгах хороодын</word> <word id="2">Ардьн депутатуудын хурал</word> <word id="3">Батмөнх</word> . . . <word id="10">Хурлыг</word> <word id="11">Хурлын</word> Piece of the list of the tagged words with NN tag from a editorial article. <word id="1">V</word> <word id="2">Дөчөөд</word> <word id="3">нэг</word> <word id="4">нэгд</word> <word id="5">таван</word> <word id="6">тэргүүнээ</word> <word id="7">хоёр</word> Piece of the list of the tagged words with Vin tag from a editorial article. <word id="1">ажиллагч</word> <word id="2">ажилладаг</word> <word id="3">ажиллаж</word> . . . <word id="53">шинэчлэгдэж</word> <word id="54">эхэлсэн</word> Piece of the list of the tagged words with Vt tag from a editorial article. <word id="1">авах</word> <word id="2">авснаас</word> <word id="3">авч</word> . . . <word id="155">эхлэх нь </word> <word id="156">явуулдаг</word> Piece of the list of the tagged words with ADJ tag from a editorial article. <word id="1" pos="ADJ">ажил хэрэгч</word> <word id="2" pos="ADJ">ажил хэрэгчээр</word> <word id="3" pos="ADJ">албан</word> . . . <word id="83" pos="ADJ">эрхтэй</word> <word id="84" pos="ADJ">язгуур</word>

Report: Tagset for Mongolian 14

Page 15: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

Piece of the list of the tagged words with MW tag from a editorial article. <word id="1">ер нь</word> <word id="2">ёстой</word> <word id="3">жинхэнэ ёсоор</word> <word id="4">зайлшгүй</word> <word id="5">Зөвхөн</word> <word id="6">зүйтэй</word> <word id="7">ихзвчлэн</word> <word id="8">нөгөө талаар</word> <word id="9">нэг ёсондоо</word> <word id="10">товчоор хэлбэл </word> <word id="11">төдий</word> <word id="12">тухайлбал</word> <word id="13">тэр дундаа</word> <word id="14">улмаар</word> <word id="15">үүнээс болж</word> <word id="16">учиртай</word> <word id="17">харин</word> <word id="18">хэрэгтэй</word> <word id="19">Чухамхүү</word> <word id="20">шаардлагатай</word> <word id="21">юуны өмнө</word> <word id="22">ялангуяа</word> Piece of the list of the tagged words with PADJ tag from a editorial article. <word id="1">Ийм</word> <word id="2">уг</word> <word id="3">энэ</word> <word id="4">Энэ</word> Piece of the list of the tagged words with PADV tag from a editorial article. <word id="1">тэндхийн</word> <word id="2">энд</word>

Report: Tagset for Mongolian 15

Page 16: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

Appendix B First tagset for Mongolian Category Sub-Category POS name Tag Example

хүний овог нэр (anthroponymy) NPa Зундуйн Энхбаяр газар орны нэр (toponymy) NPt Чингэлтэй, Булган усны нэр (hydronymy) NPh Сэлэнгэ, Хөвсгөл амьтны нэр (zoonymy) NPz Банхар, Си си албан байгууллагын нэр ( name of the organization)

NPno Залуучууд, Хатагтай

Ном,сонин,сэтгүүлийн нэр (name of the book)

NPnb Өнөөдөр, Гялбаа, Тунгалаг Тамир

Шинэ нэрлэлт /нээлт, хамтлаг бүтээгдэхүүн,чуулга/ (new nomination)

NPnn

Чингис хаан, Жигмэд Тогмид

Оноосон нэр (Proper noun) NP

од гариг (cosmonymy) NPc Буд, Ангараг хүн (human) NCh хаан,ерөнхийлөгч адгуус (animals) NCa баавгай, хонь ургамал (vegetation) NCv мод,навч хүний бүтээсэн зүйл (people made)

NCpm онгоц,бичиг, гэр

адгуусны бүтээсэн зүйл (animals made)

NCam нүх, үүр

эс бүтээсэн зүйл (inmade) NCin тэнгэр,цөл, чулуу, нулимс, хумхи

сэтгэхүй (thinking) NCth бодол, таамаг, дуртгал

мэдрэхүй (feeling) NCf айдас, баяр, гуниг зөн билэг (presentiment) NCp зүүд, зөн, заяа

Ерийн нэр (Common noun) NC

хий үзэгдэл (hallucination) NCha буг, чөтгөр Орших орон (Locative) NLL наана, цаана,

дээр, доор Орны нэр (Local noun) NL

Хөдлөх орон (Movement)

NLM

наагуур, цаагуур, дээгүүр, доогуур нааш, арагш, дээш

Өнгөрсөн (Past tense) NTpa урьд, эрт, эдүгээ

Одоо (Present tense) NTpr одоо, өнөө, эдүгээ

Цагийн нэр (Tense) NT

Ирээдүй (Future tense) NTf хожим, дараа, маргааш, удахгүй

Үндсэн (Cardinal) NNca нэг, арав Дэс (Ordinal) NNo нэг дэхи,

аравдугаар Хам (Collective) NNco арвуул Тойм (Approximate) NNa арваад Түгээх (Persentage) NNp арваад,

орчим,гаруй

Нэр үг (Noun) N

Тооны нэр (Nomber) NN

Дахих (Enumetive) NNe арав дахин, арван удаа

Report: Tagset for Mongolian 16

Page 17: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

Бутархай (Fraction) NNf аравны нэг

Тэмдэг нэр (Adjective) NA

Шинж (indication) Чанар (quality) Хэлбэр дүрс (shape) Өнгө (colour) Зүс (colour of the animal) Орон зай (space) Цаг хугацаа (time)

NAi NAq NAsh NAc NAca NAs NAt

сайхан, муухай сайн муу, мэргэн гурвалжин, хурц хар, шаравтар алаг, тарлан, зээрд алс хол, ойр дөт

асуух (interrogative)

NPNNint хэн, юу, хэд,ямар?

биеийн (personal) NPNNp би, чи та заах (demonstrative) NPNNd энэ, өнөө, уг, тус тодорхойгүй (indefinite)

NPNNind хэн ч, юуч,аливаа

ялгах (disjunctive) NPNNd бүх, цөм, нийт

Нэр: (Noun)

өөрийн ( reflexive) NPNNr өөрөө, өөрсдөө

асуух (interrogative) NPNVint яа-, хэрх-

Төлөөний нэр (Pronoun)/Мон.Төлөөний үг (Pro-word)

NPN

Үйл (Verb)

заах (demonstrative) NPNVd ингэ-, тэг-, чингэ-

тусах (transitive) Vt идэх, өмсөх,барих

Төгс хувилах (perfect inflective) эс тусах (intransitive) Vin унтах, босох,

очих

Үйл үг (Verb)

Дутмаг хувилах (defective verb)

VD а-,бө-

гишүүдийг зэрэгцүүлэн холбох CWCM ба, буюу, болон зэрэгцүүлэх (coordinate) өгүүлбэрийг зэрэгуүүлэн холбох CWCS гэвч, гэтэл, хэрэв

гишүүдийг угсруулан холбох CWSM орчим, гаруй, дахин

Холбох үг (connective word) Угсруулах

(subrodinate) өгүүлбэрийг угсруулан холбох CWSS тулд, төлөө, учир гишүүнийг угтан чимэх PPreM маш,нэн,тун,хага Угтан чимэх

үг (Prefix particle )

өгүүлбэрийг угтан чимэх PPreS ер нь, ялангуяа, чухамдаа

гишүүнийг даган чимэх PPosM ч, л, биш

Чимэх үг (Particle)

Даган чимэх үг (Post particle )

өгүүлбэрийг даган чимэх PPosS шүү, даа, даг, уу, бэ, хэрэгтэй

Гадаад үг (forign word)

FW FW тариф, пальто, клуб, ресторан

мэдрэх (feel) IF ёо ёо, ай, тий тий зөвших /батлах/ (prove) IP мөн, за, тийм илэрхийлэх (express) IE ашгүй, хөөрхий,

чааваас, гялай

Аялга үг (Interjection)

I

харилцах (communicate) IC Май, бүүвэй, еэ еэ бөө вөө,өөв, сөөг

Report: Tagset for Mongolian 17

Page 18: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

Appendix C Comparison with Inner Mongolian, Turkish, Japanese and Korean for Part of Speech of Mongolian

Part of Speech

Mоngоlian

Inner

Mongolian

Turkish

Japanese

Korean

1. Noun + + + + + 2. Proper noun + + + + + 3. Anthroponomy + + + 4. Toponym + + + 5. Hydronomy + 6. Zoonymy +

7. Name of the organization + + +

8. Name of the book + 9. New nomination + 10. cosmonymy + 11. Terminology + 12. Countable + 13. Uncountable + 14. Dependent noun +

15. Expression before numerals +

16. Adnominal + + 17. Common noun + + + 18. Human + 19. Animals + 20. Vegetation + 21. People made + 22. Animals made + 23. Unmade + 24. Thinking + 25. Feeling +

26. Presentiment

+

27. hallucination +

28. After+noun from adjective +

29. After+noun from noun +

30. Diminutive + 31. Pronoun + + + + + 32. Interrogative + + + + 33. Personal + + + + 34. Demonstrative + + + + 35. Indefinite + + +

Report: Tagset for Mongolian 18

Page 19: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

36. Disjunctive + 37. Reflexive + + + 38. Quantity + + 39. Numeral + + + + + 40. Cardinal + + + 41. Ordinal + + + + 42. Collective + + + 43. Approximate + + 44. Percentage + + 45. Enumative + + 46. Fraction + + 47. Dissatisfy + 48. Real + 49. Time +

50. Unit + 51. Adjective + + + + + 52. Determination + 53. Past + 54. Present + 55. Future + 56. Indication + 57. Quality + + 58. Shape + 59. Colour + 60. Colour of animal + 61. space + 62. Time + 63. Adnominal

adjective +

64. Adjectival noun followed by “na” or “zero”

+

65. Adjectival noun followed by “na”, “no”, “taru”, “zero”

+

66. Classical inflection of adjectival noun is NARI and TARI

+

67. Comparative + 68. Distinguish + 69. After+ADJ derived

from nouns /With/ +

70. After+ADJ derived from nouns /Without/

+

71. After+ADJ derived from nouns /SuitableFor/

+

72. After+ADJ derived from nouns /InBetween/

+

73. After+ADJ derived +

Report: Tagset for Mongolian 19

Page 20: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

from nouns /Relation/

74. Local noun + 75. Locative local noun + 76. Movement local

noun +

77. Tense + + 78. Past tense + 79. Present tense + 80. Future tense + 81. Tense and local

noun +

82. For characteristic of adjectives +

83. For characteristic of adverbs +

84. For characteristic of nouns +

85. Measurement + 86. Noun measurement + 87. Verbal

measurement +

88. Forign word + + 89. Verb + + + + + 90. Transitive verb + + + 91. Intransitive verb + + +

92. Transitive and

intransitive verb +

93. Defective verb + 94. Proverb + 95. Connective verb + + 96. Vowel-Stem verb + 97. 2-DAN verb in

classical Japanese +

98. 4-DAN verb in classical Japanese +

99. Consonant-Stem verb +

100. Verbal noun + 101. SURU-Verb 102. Irregular verb + 103. Continuative (only

in ADJACE.DBF) +

104. Auxiliary verb + + 105. After+Become

derived from nouns or adjective “become”

+

106. After+“acquire” derived from nouns or adjective

+

107. Copula + + + 108. Conjunction + + +

Report: Tagset for Mongolian 20

Page 21: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

109. Interjection + + + + 110. Feeles interjection + 111. Proves interjection + 112. Expresses

interjection +

113. Communicates interjection +

114. Suol interjection + 115. Treating interjection + 116. Treat words + 117. Exclamation + 118. Modal worb + 119. Modal morph + 120. Question + 121. Particle + + + 122. Interrogative

particle +

123. Affirmative particle + 124. Negative particle + 125. Emprathic particle + 126. Imagine particle + 127. Specify particle + 128. Possessive + 129. particles same as

phrase +

130. particles same as words meaning +

131. Postposition + + 132. Particle

postposition +

133. Connective postposition +

134. Auxiliary postposition +

135. Liken postposition + 136. Tense postposition + 137. Standart

postposition +

138. Cause postposition + 139. Infinitive

postposition +

140. Adverb + + + + 141. Standart adverb + 142. Tense adverb + 143. Mood adverb + 144. Pre-Verb adverb + 145. Pre-Sentense

adverb +

146. Conjunctive adverb + 147. After+ADV derived

from verb /“AfterDoingSo”/

+

148. After+ADV derived +

Report: Tagset for Mongolian 21

Page 22: PAN LOCALIZATION PROJECT Phase 2/CCs/Mongolia/NUM... · 2012-03-26 · 1. Abstract This report presents the first tagset, consisting of 20 tags, for Mongolian language. This tagset

from verb /“SinseDoingSo”/

149. After+ADV derived from verb /“As”/ +

150. After+ADV derived from verb /“When”/ +

151. After+ADV derived from verb /“ByDoingSo”/

+

152. After+ADV derived from verb /“While”/ +

153. After+ADV derived from verb /“WithoutHavingDoneSo”/

+

154. After+ADV “ly” derived from verb +

Report: Tagset for Mongolian 22