ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage...

23
© ISO 2008 – All rights reserved ISO TC 37/SC 4 N 482 Date: 2008-06-24 ISO/CD 24614-1 ISO TC 37/SC 4/WG 2 Secretariat: KATS Language resource management — Word segmentation of written texts for mono-lingual and multi-lingual information processing — Part 1: Basic concepts and general principles Gestion des resource des langues — Segmentation des mots dans textes uni-lingues et multi-lingues écrits — Partie 1: Notions fondamentaux et principes généraux Warning This document is not an ISO International Standard. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an International Standard. Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation. Document type: International Standard Document subtype: Document stage: (30) Committee Document language: E /home/website/convert/temp/convert_html/5f2d156038c0280ae576a189/document.doc ST D Version 2.1c2

Transcript of ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage...

Page 1: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

© ISO 2008 – All rights reserved

ISO TC 37/SC 4 N 482

Date:   2008-06-24

ISO/CD 24614-1

ISO TC 37/SC 4/WG 2

Secretariat:   KATS

Language resource management — Word segmentation of written texts for mono-lingual and multi-lingual information processing — Part 1: Basic concepts and general principles

Gestion des resource des langues — Segmentation des mots dans textes uni-lingues et multi-lingues écrits —Partie 1: Notions fondamentaux et principes généraux

Warning

This document is not an ISO International Standard. It is distributed for review and comment. It is subject to change without notice and may not be referred to as an International Standard.

Recipients of this draft are invited to submit, with their comments, notification of any relevant patent rights of which they are aware and to provide supporting documentation.

Document type:   International StandardDocument subtype:   Document stage:   (30) CommitteeDocument language:   E

/tt/file_convert/5f2d156038c0280ae576a189/document.doc  STD Version 2.1c2

Page 2: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

ISO/CD 24614-1

Copyright notice

This ISO document is a working draft or committee draft and is copyright-protected by ISO. While the reproduction of working drafts or committee drafts in any form for use by participants in the ISO standards development process is permitted without prior permission from ISO, neither this document nor any extract from it may be reproduced, stored or transmitted in any form for any other purpose without prior written permission from ISO.

Requests for permission to reproduce this document for the purpose of selling it should be addressed as shown below or to ISO's member body in the country of the requester:

[Indicate the full address, telephone number, fax number, telex number, and electronic mail address, as appropriate, of the Copyright Manger of the ISO member body responsible for the secretariat of the TC or SC within the framework of which the working document has been prepared.]

Reproduction for sales purposes may be subject to royalty payments or a licensing agreement.

Violators may be prosecuted.

II © ISO 2008 – All rights reserved

Page 3: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

ISO/CD 24614-1

Contents Page

Foreword................................................................................................................................................. ivIntroduction.............................................................................................................................................. v1 Scope........................................................................................................................................... 12 Normative references................................................................................................................. 13 Terms and definitions.................................................................................................................24 Basic framework of word segmentation.................................................................................104.1 Essential concept systems related to word segmentation...................................................104.2 Metamodel of word segmentation...........................................................................................115 General principles in word segmentation...............................................................................125.1 The universal principle of morphology...................................................................................125.2 Principles for validating a word...............................................................................................125.2.1 Principles from the linguistic perspective..............................................................................125.2.2 Principles from the practical (pragmatic) perspective...........................................................135.3 The full entry principle of the lexicon......................................................................................135.4 Principles for word segmentation result.................................................................................135.5 Principle of full coverage and consistency in applying this standard to text......................14Bibliography............................................................................................................................................. 15

© ISO 2008 – All rights reserved III

Page 4: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

ISO/CD 24614-1

Foreword

ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.

International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.

The main task of technical committees is to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International Standard requires approval by at least 75 % of the member bodies casting a vote.

Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights.

ISO 24614-1 was prepared by Technical Committee ISO/TC 37, Terminology and other language and content resources, Subcommittee SC 4, Language resource management.

This second/third/... edition cancels and replaces the first/second/... edition (), [clause(s) / subclause(s) / table(s) / figure(s) / annex(es)] of which [has / have] been technically revised.

ISO 24614 consists of the following parts, under the general title Language resource management — Word segmentation of written texts for mono-lingual and multi-lingual information processing:

Part 1: Basic concepts and general principles

Part 2: Word segmentation for Chinese, Japanese and Korean

Part 3: Word segmentation for other languages

IV © ISO 2008 – All rights reserved

Page 5: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

ISO/CD 24614-1

Introduction

Word segmentation remains a challenging technology in natural language processing. This topic becomes much more complicated before any natural language in which word boundaries of its written text cannot be fully identified by typographic properties(like spaces in English), for example, Chinese, Japanese, Korean, Thai, Vietnamese, and Mongolian.

In real practice, there are great concerns on what should be the right outcome made through the process of word segmentation applied to a text. Standards will be needed for pursuing the consistency in word segmentation within/among texts to the maximum extent so as to meet the requirements from a variety of applications in language information processing, -- both mono-lingual and multi-lingual. The applications of such standards include but not limited to natural language processing, information retrieval, search engine, question-answering, machine translation and machine aided translation, pre-processing of text-to-speech, post-processing of speech recognition, OCR and other character input methods, proof reading, digital library, terminology and ontology, semantic web, eBusiness and eCommerce, content management, and natural-language-based computer-aided eLearning (including language learning and second language learning). They shall also be helpful for orthographic processing (Romanization) of text in some languages such as Chinese.

© ISO 2008 – All rights reserved V

Page 6: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

COMMITTEE DRAFT ISO/CD 24614-1

Language resource management — Word segmentation of written texts for mono-lingual and multi-lingual information processing — Part 1: Basic concepts and general principles

1 Scope

This standard is the first part, i.e. part 1, of the series of ISO standards that are targeted at word segmentation in a written language of which word boundaries cannot fully be identified by typographic properties(like spaces in English), examples of such languages may include but not limited to Chinese, Japanese, Korean, Thai, Vietnamese, and Mongolian.

In this part the emphasis is put on the basic concepts and general principles of word segmentation. The document shall not account for word segmentation algorithms, despite that all related factors have more or less been involved here. For example, lexicon has been specified in the document while it is a necessary factor for algorithm design and implementation.

In the real application, particularly when dealing with the aspects of representing lexical items in lexicons and word segmentation result in sentences, this standard should be used in close conjunction with ISO 24613 Language resource management - Lexical markup framework (LMF) [at FDIS stage], ISO 16642:2003 Terminology Markup Framework, ISO 12620 Terminology and other language resources ― Data categories for electronic lexical resources (DCR) and ISO 24611 Language resource management -- Morpho-syntactic annotation framework [at DIS stage].

2 Normative references

The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.

ISO 1087-1:2000, Terminology – Vocabulary – Part 1: Theory and application

ISO 1087-2:1999, Terminology – Vocabulary – Part 2: Computer application.

ISO/IEC 11179-3:2003, Information Technology – Data management and interchange – Metadata Registries (MDR) – Part 3: Registry Metamodel (MDR3)

ISO FDIS 24613 Language resource management — Lexical markup framework (LMF)

ISO 12620, Terminology and other language resources ― Data categories for electronic lexical resources (DCR)

ISO DIS 24611 Language resource management -- Morpho-syntactic annotation framework

© ISO 2008 – All rights reserved 1

Page 7: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

COMMITTEE DRAFT ISO/CD 24614-1

3 Terms and definitions

For the purposes of this document, the terms and definitions given in ISO 1087-1 and -2 as well as the following terms and definitions apply.

3.1morphologystudy of the structure and formation of word

NOTE There are two sub-types of morphology, lexical morphology (3.17) and inflectional morphology (3.18).

(cf. ISO 24613 FDIS 3.31)

3.2wordbasic grammatical unit and relatively independent carrier of meaning of a language that can stand alone to make up sentences

NOTE A word, in general, is intuitively and mentally available for native speakers, with acoustic and semantic identity, morphological stability, and syntactic mobility. It is codified in the lexicon (3.5), with at least a part of speech (3.16). A word may consist of a single morpheme (3.8) or a combination of morpheme (3.8)s. In this standard, word is used to refer to both lexeme (3.3) and word form (3.4).

(cf. ISO 24611 DIS 3.36)

3.3 lexemeabstract unit generally associated with a set of word forms (3.4) sharing a common meaning

NOTE A lexeme may be a part of another lexeme, as a consequence of derivation (3.25) and compounding (3.28). In this standard, lexeme is defined in a broad way including not only word (3.2)s precisely defined in linguistics, but also multi-word expressions some of which may not be regarded as word from linguistic point of view (e.g., phrasal compounds (3.31), idioms, proverbs and familiar quotations).

(cf. ISO 24613 FDIS 3.24) (cf. ISO 24611 DIS 3.17)

3.4word formform that a lexeme (3.3) takes when used in a sentence

EXAMPLE (English) find, found, and finding are word forms of the lexeme (3.3) FIND (In writing here, lexemes are generally distinguished by the use of capital letters).

(cf. ISO 24613 FDIS 3.46) (cf. ISO 24611 DIS 3.30)

3.5lexiconresource containing lexical items (3.6) for a language

(cf. ISO 24613 FDIS 3.27) (cf. ISO 24611 DIS 3.18)

3.6lexical itemlinguistic unit basically refers to lemma (3.7) but can also refer to word form (3.4), morpheme (3.8) and other item included in the lexicon (3.5) for the purpose of information processing

© ISO 2008 – All rights reserved 2

Page 8: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

3.7 lemmalemmatised formcanonical formcitation form of a lexeme (3.3) in the lexicon (3.5)

NOTE Lemma is that word form (3.4) of a lexeme (3.3) which is conventionally chosen to represent the lexeme (3.3). And, the lemma of a lexical item (3.6) can be derived from the lemmas of its constituent elements, if applicable.

EXAMPLE (English) find is the citation form, or lemma, of the lexeme (3.3) FIND. The lemma of kicked the bucket is kick the bucket.

(cf. ISO 24613 FDIS 3.23) (cf. ISO 24611 DIS 3.16)

3.8 morphemeabstract unit of a language that is used, as a basic phonological and smallest meaningful element, to constitute lexemes (3.3)

NOTE There are two sub-types of morpheme, free morpheme (3.9) and bound morpheme (3.10).

(cf. ISO 24611 DIS 3.19)

3.9 free morpheme morpheme (3.8) that can stand by itself

EXAMPLE (English) BOY; (Chinese) 猪(pig).

3.10 bound morphememorpheme (3.8) that appears only together with one or several other morpheme (3.8)s

EXAMPLE (Chinese) 伟 -- it means great by its character (3.38) meaning, but cannot stand by itself as a word (3.2) in text. Instead, it is used as a constituent element of many words, such as 伟大(great), 伟人(giant) and 雄伟(majesty).

3.11 morphform that is used to represent a morpheme (3.8) phonetically and phonologically

EXAMPLE (English) The morphs of the plural morpheme (3.8) ‘s’ are –s, -en, and –Φ (as in boys, oxen, and sheep). Thus, the word (3.2) boys consists of two morphs: boy and -s, whereas the morphemes corresponding to boy and s are BOY and ‘s’ respectively.

(cf. ISO 24613 FDIS 3.30)

3.12 realizationlinguistic process in which abstract entities are realized by entities which have a form

NOTE A lexeme (3.3) is realized by word forms (3.4), and a morpheme (3.8) is realized by morphs. In the process of realization, for an abstract entity, some degree of variations or transformations may appear according to its linguistic context.

3.13 lemmatizationprocess of determining the lemma (3.7) for a given word form (3.4) in a sentence, usually accompanied by determining its part of speech (3.16)

© ISO 2008 – All rights reserved 3

Page 9: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

EXAMPLE (English) The lemma (3.7) for finding is determined as find, and that for kicked the bucket is determined as kick the bucket by the process of lemmatization.

3.14 rootportion of a word (3.2) which remains when all inflectional and derivational affixes (3.23) have been removed

EXAMPLE (English) The root of the word (3.2) destabilized is stabil-, derived from removing the derivational affixes (3.23) de- and -ize, as well as the inflectional suffix -(e)d.

3.15 stemportion of a word (3.2) which remains when all inflectional affixes (3.22) have been removed

NOTE A stem consists minimally of a root, but may be analyzable into one or many roots, together with the associated derivational affixes (3.23). If a stem does not occur by itself in a meaningful way in a language, it is referred to as a bound morpheme (3.10).

EXAMPLE (English) The stem of the word (3.2) destabilized is stabilize, derived from removing the inflectional suffix -(e)d.

(cf. ISO 24613 FDIS 3.39)

3.16 part of speechlexical category word classcategory assigned to a lexeme (3.3) based on its grammatical properties

(cf. ISO 24611 DIS 3.27)

3.17 lexical morphologybranch of morphology (3.1) that deals with word formation (3.19)

3.18 inflectional morphologybranch of morphology (3.1) that deals with inflection (3.20)

3.19 word formation process of creating/building word (3.2)s in a language

NOTE The great majority of word-formation can be subsumed under the processes of derivation (3.25), compounding (3.28), abbreviation (3.33) and borrowing (3.34). Word formation is a lexical process.

3.20 inflectionprocess in which an word form (3.4) is made up by adding an inflectional affix (3.22) to a stem (3.15)or a lexeme (3.3)

NOTE Inflection is a grammatical process, rather than a lexical process.

(cf. ISO 24613 FDIS 3.20) (cf. ISO 24611 DIS 3.12)

3.21 affixbound morpheme (3.10) which may be added to a stem (3.15)or a lexeme (3.3)

4 © ISO 2008 – All rights reserved

Page 10: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

NOTE Affixes can be classified into three main sub-types according to their placement on the stem (3.15)or lexeme (3.3): prefix, suffix and infix. Affixes can also be of two categories, inflectional or derivational.

(cf. ISO 24613 FDIS 3.3)

3.22 inflectional affixaffix (3.21) that can produce word forms (3.4) of a lexeme (3.3)

3.23 derivational affixaffix (3.21) that can produce a new lexeme (3.3) from an existing one

3.24 affixationprocess in which an affix (3.21) is added to a stem (3.15)or lexeme (3.3)

NOTE There are three main sub-types of affixation, prefixation, suffixation and infixation. Affixation can be inflectional, derivational, or the mixture of both in some manner(e.g., agglutinative).

(cf. ISO 24613 FDIS 3.4)

3.25 derivationprocess of word formation (3.19) in which a derivational affix (3.23) is added to a stem (3.15)or a lexeme (3.3) to create a new lexeme (3.3)

NOTE Derivation is a lexical process.

(cf. ISO 24613 FDIS 3.12)

3.26 conversionzero derivationprocess of word formation (3.19) in which a lexeme (3.3) is created from an existing lexeme (3.3) without any change in form, but often with change in part of speech (3.16)

3.27 reduplicationprocess in which the entire word (3.2), or part of it, is repeated

NOTE Reduplication is used both in inflections (3.20) to convey a grammatical function, and in lexical derivation (3.25) to create new words. Reduplication position may be initial, final, or internal. It can be in some cases viewed as a special way of making affixes (3.21), both inflectional and derivational.

3.28 compositioncompoundingprocess of word formation (3.19) in which new lexeme (3.3)s are formed by adjoining at least two lexeme (3.3)s.

NOTE Composition is a lexical process. It should not be confused with derivation (3.25), where a bound morpheme (3.10) is added to a stem (3.15)or a lexeme (3.3).

(cf. ISO 24613 FDIS 3.9)

3.29 compoundlexeme (3.3) resulting from the process of composition (3.28)

© ISO 2008 – All rights reserved 5

Page 11: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

NOTE A compound may be endocentric if it has a head, i.e. the fundamental part that contains the basic meaning of the whole compound, and modifiers, which restrict this meaning, or exocentric if it does not have a head. And, a compound can be rather long. There are two main sub-types of compound according to their degree of lexicalization (3.53), word-compound (3.30) and phrasal compound (3.31).

(cf. ISO 24613 FDIS 3.10)

3.30 word-compoundcompound (3.29) whose overall meaning is often not predictable from its constituent elements

NOTE Word-compound is a sub-type of word (3.2) strictly defined in linguistics.

3.31 phrasal compoundcompound (3.29) used steadily and frequently in a language, although its overall meaning is predictable from its constituent elements

NOTE Phrasal compound might be thought of as phrases by some linguists. In practice, there is neither a clear cut between word-compound (3.30) and phrasal compound nor a clear cut between phrasal compound and phrase due to the fuzziness of semantic predictability and the degree of lexicalization (3.53), although the cut could be clear theoretically. Lexico-statistics (3.51), word frequency (3.52) in particular, will play an important role in this respect.

EXAMPLE (English) Apple pie is a phrasal compound composed of two words Apple and pie. (Chinese) 猪肉(pork) is a phrasal compound composed of two single-character words 猪(pig) and 肉(meat).

3.32 multiword expressionMWEunit made up of a sequence of two or more words and, used steadily and frequently in a language

NOTE A multiword expression can be a compound (3.29) (both word-compound (3.30) and phrasal compound (3.31)), an idiom, a fragment of a sentence, or a sentence (e.g., a proverb and a familiar quotation). It is not always possible to assign an MWE with a part of speech (3.16).

(cf. ISO 24613 FDIS 3.32) (cf. ISO 24611 DIS 3.22)

3.33 abbreviationprocess of word formation (3.19) in which a shortened form of a word (3.2), phrase or term which represents its full form is created by omitting words or letters/characters (3.38) from the full form

NOTE Abbreviation is a lexical process.

3.34 borrowingprocess of word formation (3.19) in which a linguistic expression is borrowed from one language to another language, usually when no term exists for the new object or concept

NOTE borrowing is a lexical process.

3.35 loan wordword (3.2) resulting from the process of borrowing (3.34)

3.36 proper nounproper namename of unique entity

6 © ISO 2008 – All rights reserved

Page 12: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

EXAMPLE Names of person, place, and organization are typical proper nouns.

3.37 graphemesmallest distinctive unit in a writing system of a language

NOTE There are two major types of writing systems: non-phonological systems (pictographic, ideographic, cuneiform, and logographic), and phonological systems (syllabic, alphabetic). A grapheme often represents a morpheme (3.8) or a whole word (3.2) in non-phonological systems whereas represents a phoneme or a syllable (3.39) in phonological systems. In non-phonological systems (e.g. Chinese, Japanese Kanji) and syllabic phonological systems (e.g. Japanese kana, Korean Hangul), grapheme is conventionally called character (3.38), or character component in particular cases (e.g. Korean Hangul), whereas in alphabetic phonological systems (e.g. English), grapheme is conventionally called letter. The number of graphemes in phonological systems usually ranges from 20-30 to several dozens and, that in non-phonological systems is usually several thousand or more.

3.38 charactergrapheme (3.37)(including the so-called letter and character in writing systems), number, space, punctuation mark, or other symbol that can be processed in computers

NOTE The list of characters, or character set, is defined by ISO/IEC 10646.

3.39 syllablebasic phonetic-phonological unit of word (3.2) or of speech that can be identified intuitively

NOTE Most of Chinese characters (3.38) are monosyllabic and monomorphemic.

3.40 typelinguistic unit representing a defined class

3.41 tokenoccurrence of a type (3.40) in text

NOTE If the class is defined as all the word forms (3.4) of a lexeme (3.3), then the linguistic unit, i.e., the lexeme (3.3), in this setting is called word type, and all the occurrences of these word forms (3.4) are called word token of this lexeme (3.3).

(cf. ISO 24611 DIS 3.28)

3.42 tokenizationprocess of splitting up a character (3.38) string into a sequence of tokens (3.41) in terms of the tokens (3.41) defined

NOTE Tokenization is applicable to text in natural language as well as text in artificial language.

(cf. ISO 24611 DIS 3.29)

3.43 word segmentation unitWSUunit that includes: (1) all the lexical items (3.6) in the lexicon (3.5); and (2) all the word forms (3.4), numeric strings, foreign character (3.38) strings, bound morphemes (3.10)(including affixes (3.21)), punctuation marks and miscellaneous items that may appear in text

© ISO 2008 – All rights reserved 7

Page 13: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

3.44 word segmentationtokenization (3.42) in which a natural language text is split into a sequence of word segmentation units (3.43)

NOTE If the object of tokenization (3.42) is natural language text, and tokens (3.41) are basically defined as word forms (3.4), then the tokenization (3.42) in this setting would be almost identical to word segmentation.

3.45 word structureinternal structure of word (3.2) resulting from the morphological analysis

NOTE In agglutinative languages, e.g., Korean, Japanese and Turkish, a word may consist of a sequence of morpheme (3.8)s, with a comparatively high morpheme-per-word ratio, where each affix (3.21) involved, both derivational and inflectional, typically expresses a particular grammatical meaning in a clear one-to-one way. The structure of a word in these languages can be very sophisticated, with free morphemes (3.9) and separate affixes (3.21) as its constituent elements.

3.46 word segmentation ambiguity text fragment for which at least two different word sequences over it can be found by string matching with lexical items (3.6) in the lexicon (3.5)

NOTE Word segmentation ambiguities can affect the accuracy of a word segmentation (3.44) program. Their resolution is difficult for computers but fundamentally easy for human annotators. It is a basic concern in algorithm design for word segmentation (3.44).

3.47 corpuscollection of text concerning actual language use, usually electronically stored and processed

3.48 representative corpus of a languagelarge enough and well balanced corpus (3.47) appropriate for depicting the whole picture of the language use

3.49 raw corpuscorpus (3.47) without any linguistic processing or annotation

3.50 annotated corpuscorpus (3.47) with linguistic annotation at a certain linguistic level

NOTE Annotated corpora at the level of word segmentation (3.44) are key resources for this standard.

3.51 lexico-statisticsstatistics that may aid the quantitative study of morphology (3.1)

NOTE Frequency (3.52) of word is one of the commonly used lexico-statistics.

3.52 frequencynumber of times a type (3.40) occurs in a corpus (3.47)

NOTE Word frequencies are established by means of frequency counts on the basis of a corpus (3.47) annotated at the word segmentation (3.44) level. In general, adequate estimation of frequencies can be derived directly from the annotated representative corpus of the language. Improved estimation might be achieved by considering a variety of related factors, for example, frequency distribution over different corpora, with different genres, and in different period of time. Evidence of this kind can help in the selection of lexical items (3.6) into the lexicon (3.5), particularly in quantifying their degree of lexicalization (3.53).

8 © ISO 2008 – All rights reserved

Page 14: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

3.53 lexicalizationprocess of making a word to express a concept

NOTE A possible word is said lexicalized if it has become an established word. Possible words may be lexicalized if their meaning is no longer the sum of the meanings of their parts, or if they are unproductive in formation, and may also be lexicalized in other ways, for example, phrasal compounds (3.31), even with a quite productive formation or a lack of semantic idiosyncrasy. In the above cases, the degree of lexicalization varies from high to low, with fully semantic idiosyncrasy as one extreme(high), and with phrasal compound (3.31) as another extreme(low).

EXAMPLE (ENGLISH) The degree of lexicalization for honeymoon is high, while that for apple pie is medium, and that for pear pie is low.

3.54 unknown wordword that appears in text but out of the lexicon (3.5)

NOTE Unknown words can significantly affect the accuracy of a word segmentation (3.44) program. Their resolution is difficult for computers but fundamentally easy for human annotators. It is a basic concern in algorithm design for word segmentation (3.44).

3.55 homographlexical ambiguity in which two lexemes are orthographically identical but have different meanings

3.56 orthographystudy of correct spelling according to established usage in a language

(cf. ISO 24613 FDIS 3.34)

3.57 transcription1 process and result of representing speech sounds in phonetic symbols in a systematic and consistent way

2 process and result of converting speech sounds described in one writing system to an equivalent representation of the same speech sounds described in another writing system

NOTE (transcription1) Chinese is transcribed according to Pinyin system, and Japanese according to romaji system.

(cf. ISO 24613 FDIS 3.43) (cf. ISO 24611 DIS 3.34)

3.58 transliterationprocess and result of conversion (3.26) of one writing system into another by converting each character (3.38) of the source language into a character (3.38) of the target language

(cf. ISO 24613 FDIS 3.44) (cf. ISO 24611 DIS 3.35)

3.59 romanization representation of words written in a non-Latin script by means of the Latin alphabet, through either transliteration (3.58) or transcription (3.57)

NOTE Romanization, orthography (3.56) and word segmentation (3.44) are three related issues in languages that need word segmentation (3.44). In general, the correct Romanization and the correct orthography (3.56) of a text are heavily depending on the correct word segmentation (3.44) of the text.

EXAMPLE (Chinese) Pinyin; (Japanese) romaji.

(cf. ISO 24611 DIS 3.31)

© ISO 2008 – All rights reserved 9

Page 15: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

4 Basic framework of word segmentation

4.1 Essential concept systems related to word segmentation

The following four concept systems, among the terms defined in section 3, are critical to word segmentation (3.43), as illustrated in Figure 1, 2, 3 and 4.

Figure 1 — The concept system of abstract and concrete entities in morphology of languages

Figure 2 — The concept system of morphology in languages

10 © ISO 2008 – All rights reserved

Page 16: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

Figure 3 — The concept system of multiword expressions (3.32) in languages

Figure 4 — The concept system of WSUs (3.43) in languages

4.2 Metamodel of word segmentation

The following components and resources should be well defined and carefully prepared for performing word segmentation (3.44):

1) A lexicon (3.5), with high coverage to text

2) A complete affix (3.21) list, including prefix, suffix and infix

3) A complete bound morpheme (3.10) list, other than affixes (3.21)

4) Specification for morphology (3.1) of the language: to specify what word segmentation (3.44) result should be on the basis of language-dependent phenomena, under the principles set up in section 5

5) A representative corpus of the language: to support the quantitative analysis of the lexicon (3.5)

© ISO 2008 – All rights reserved 11

Page 17: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

Figure 5 outlines the process of word segmentation (3.44):

Figure 5 — The word segmentation process

As can be seen in Figure 5, the lexicon (3.5) plays a central role in word segmentation (3.44), -- in fact, it serves as a basis and a gold-standard to keep consistencies in word segmentation (3.44) to the maximum extent.

NOTE (1)Two words which are homographic should keep two separate entries in the lexicon (3.5) (2) Lexical items (3.6) in the lexicon (3.5) can be pre-annotated with their word structures (3.45), if any.

5 General principles in word segmentation

5.1 The universal principle of morphology

All languages have words and all languages have morphemes (3.8).

NOTE This principle is a foundation for this standard.

5.2 Principles for validating a word

For some languages, the boundary between words and phrases is fuzzy(for example, compounds and phrases in Chinese). This has seriously affected the quality of lexicon (3.5) and thus the quality of word segmentation (3.44). To handle this key issue, principles from two perspectives are set up.

5.2.1 Principles from the linguistic perspective

In general, all the linguistic principles concerning word-formation hold, including but not limited to the following:

1) Principle of bound morpheme (3.10): If a bound morpheme (3.10) is attached to a word, then the result is a word.

2) Principle of lexical integrity hypothesis: syntactic rules may not refer to the internal structure of a word. If a word candidate satisfies this principle, then it is likely to be a word.

3) Principle of unpredictability of a word meaning from its subparts: If a word candidate has a property of semantic unpredictability, then it is a word.

12 © ISO 2008 – All rights reserved

Text

Word segmentation

Sequence of WSUs, with structure

Lexicon

Page 18: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

4) Principle of idiomatization: If a word candidate has the property of idiomatization, then it is a word.

5) Principle of collocation: If a word candidate has the property of collocation, then it is likely to be a word.

6) Principle of unproductivity: If a word candidate is unproductive in formation, then it is likely to be a word.

NOTE Some of the principles above, i.e., (1), (3) and (4), are definite whereas some others, i.e., (2), (5) and (6), only give a tendency for a word candidate towards a word (leading to an MWE (3.32)). The principles can be used independently, possibly in an overlapping and sometimes even incompatible way. The principle (2) is likely to be a sufficient condition, rather than a necessary condition, for example, considering a word 洗澡 (have a bath) in Chinese, where洗(wash) is a free morpheme (3.9) and 澡(bath) a bound morpheme (3.10), but 洗 and澡 can be quite easily inserted by some linguistic constituents in a sentence. The principles (5) and (6) need the cooperation from the degree of lexicalization (3.53).

5.2.2 Principles from the practical (pragmatic) perspective

1) Principle of frequency (3.52): Frequency is a basic criterion for quantifying the degree of lexicalization (3.53) of a word candidate.

2) Gestalt principle (from cognitive science): Things are likely to be perceived as a whole. This principle gives an evidence for including some perceivable phrasal compounds (3.31) into the lexicon (3.5) even though they seem to be free combinations of their perceivable constituent parts.

3) Principle of prototype members in categories (from cognitive linguistics): According to the prototype theory regarding the mental lexicon (3.5), prototype members in categories are more salient than non-prototype members; They are more accurately remembered in short-term memory and more easily retained and accessed in long-term memory for human-beings. This principle provides a rationale for including some phrasal compounds (3.31) which can serve as prototypes in a productive word-formation pattern, like apple pie in English and 猪肉 (pork) in Chinese with the pattern “fruit + pie” and “animal + meat” respectively, into the lexicon (3.5).

4) Principle of language economy: If the inclusion of a word candidate into the lexicon (3.5) can decrease the difficulty of linguistic analysis for it, then it is likely to be a word. For example, 大中小学in Chinese (university, middle school, and primary school) is an abbreviation (3.33) with a quite complex formation “big middle small school”, where “big school” means “university”, and “small school” means “primary school”. 大中小学 is not easy to be identified as a word if it is not registered in the lexicon (3.5).

5.3 The full entry principle of the lexicon

All the lexical items (3.6) which ‘exist’ in the language use can be included in the lexicon (3.5), if needed by practical applications. The lexicon (3.5) should be dynamic, being adapted to the changes of language use.

5.4 Principles for word segmentation result

1) Principle of granularity: The words identified by word segmentation (3.44) may have structures, if needed, instead of simply inserting spaces in between. The generated word structures (3.45) introduce granularities and thus high degree of flexibility to the result of word segmentation (3.44), so as to meet various requirements for word segmentation (3.44) ranging from information retrieval to machine translation.

2) Principle of scope maximization of affixations (3.24): All the affixes (3.21) adjunct to a stem (3.15)or a lexeme (3.3) in text should be grouped as a whole WSU (3.43), together with its word structure (3.45).

© ISO 2008 – All rights reserved 13

Page 19: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

3) Principle of scope maximization of compounding (3.28): If a compound (3.29) covers another compound (3.29) in text, according to the lexicon (3.5), then the longer compound (3.29) should be regarded as an WSU (3.43), together with its word structure (3.45) involving the shorter one.

4) Principle of segmentation for numeric strings, foreign character (3.38) strings, isolated bound morphemes (3.10), punctuation marks and miscellaneous items in text: Basically each item should be regarded as an WSU (3.43), with proper treatment by convention of the language.

5.5 Principle of full coverage and consistency in applying this standard to text

The standard should be used in a consistent way to cover any text in the language.

14 © ISO 2008 – All rights reserved

Page 20: ISO TC 37/SC 4 N xxsemanticweb.kaist.ac.kr/research/...tc37_sc4_N482_CD_2…  · Web viewLanguage resource management — Word segmentation of written texts for mono-lingual and

Bibliography

[1] ISO 639-1:2002, Codes for the representation of names of languages – Part 1: Alpha-2 Code.

[2] ISO 639-2:1998, Code for the representation of languages – part 2: Alpha-3 Code.

[3] ISO 639-3:2005, Codes for the representation of languages – Part 3: Alpha-3 Code for the comprehensive coverage of languages

[4] ISO 704:2000, Terminology work – Principles and methods

[5] ISO 860:1996, Terminology work – Harmonization of concepts and terms

[6] ISO/IEC 10646-1:2003, Information technology – Information technology -- Universal Multiple-Octet Coded Character Set (UCS)

[7] ISO 12620: 1999, Computer applications in terminology – Data categories

[8] ISO 16642:2003, Computer applications in terminology – TMF (Terminological Markup Framework)

[9] Britannica Online Encyclopedia, http://www.britannica.com

[10] David Crystal, The Cambridge Encyclopedia of Language, Cambridge University Press, 1997

[11] Douglas Biber et al., Corpus Linguistics, Cambridge University Press, 1998

[12] F. Ungerer, H. J. Schmid, An Introduction to Cognitive Linguistics, Addison Weley Longman Limited, 1996

[13] Graeme Kennedy, An Introduction to Corpus Linguistics, Addison Weley Longman Limited, 1998

[14] Hadumod Bussmann, Routledge Dictionary of Language and Linguistics, Routledge, 1996

[15] Jack Richards et al., Longman Dictionary of Applied Linguistics, Longman Group Limited, 1985

[16] James Allen, Natural Language Understanding, Addison Wesley, 1994

[17] Jerome L. Packard, The Morphology of Chinese: An Linguistic and Cognitive Approach, Cambridge University Press, 2000

[18] Keith Johnson, Helen Johnson, Encyclopedia Dictionary of Applied Linguistics: A Handbook for Language Teaching, Blackwell Publishers Ltd, 1999

[19] Mark Aronoff, Janie Rees-Miller, The Handbook of Linguistics. Blackwell Publishers Ltd, 2001

[20] P. H. Matthews, Morphology, Cambridge University Press, 1991

[21] Stuart C. Poole, An Introduction to Linguistics, Macmillan Publishers Ltd, 1999

[22] Wikipedia, http://www.wikipedia.org

[23] Zhu Dexi, Lecture on Grammar, Commercial Press, 2003 (in Chinese)

© ISO 2008 – All rights reserved 15