BabelStone1357 : Tibetan : Precomposed · PDF fileBabelStone1357 : Tibetan : Precomposed...

27
1 of 27 1/2/2003 2:31 PM Tibetan Script : Precomposed Tibetan Background Document N2558 In the Proposal on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP (Document N2558) presented by the Chinese government in 6th December 2002 for consideration by WG2, 956 precomposed Tibetan glyphs are proposed for inclusion in the ISO/IEC 10646 standard (and therefore also for inclusion in Unicode). Existing Character Encoding Model The Tibetan script is currently catered for in ISO/IEC 10646 and Unicode by the Tibetan block (codepoints U+0F00 through U+0FFF). Tibetan is written horizontally from left to right, but characters may combine to form vertical "stacks" of consonant and vowel elements within the horizontal flow of text, and in the existing character encoding model for Tibetan, such stacks are encoded as a sequence of one or many Unicode characters in this order : One base consonant in the range U+0F40 through U+0F6A (this is the first consonant in the stack, reading from top to bottom, and will either be the root consonant or a superfixed head consonant RA, LA or SA) Zero or many subjoined consonants in the range U+0F90 through U+0FBC (these will be the root consonant underneath a superfixed head consonant RA, LA or SA, and/or one of the subjoined consonants WA, YA, RA, LA or HA, and/or potentially any consonant in a consonant cluster transliterating a word from Sanskrit or some other foreign language). Zero or one vowel lengthener letter, the so-called "a-chung" [U+0F71]. Zero or many of the subjoined or superjoined vowels signs I [U+0F72], reversed I [U+0F80], U [U+0F74], E [U+0F7A], EE [U+0F7B], O [U+0F7C] or OO [U+0F7D] (no vowel sign indicates an implicit A vowel, whilst two or more vowel signs are only used in shorthand abbreviations) or the Virama or Halanta sign [U+0F84] that indicates that the stack continues horizontally (this sign may be used when transliterating Sanskrit mantras, but is only very rarely employed). Zero or one special signs used in transliterating Sanskrit words, such as the Anusvara [U+0F7E] and Candrabindu [U+0F83]. In addition the consonant modifier mark TSA -PHRU [U+0F39] may be inserted into this sequence immediately after the consonant it modifies (this sign is normally only used with the letters PHA and BA to represent the non-Tibetan sounds of FA and VA respectively). The existing Unicode character encoding model is able to represent any conceivable stack with the exception of highly unusual stacks that contain more than one consonant-vowel combination in a vertical arrangement (these contravene the normal rules of Tibetan writing, and are considered beyond the scope of plain text rendering – no such compound stacks are included in the Chinese proposal to encode BrdaRten characters). Proposed Character Encoding Model The Chinese proposal is to encode the vast majority of vertical stacks that are normally encountered as individual precomposed characters represented by a single codepoint rather than a sequence of two or more codepoints. The 956 proposed precomposed characters all comprise at least two consonant and/or vowel elements, so that minimal stacks comprising a single base consonant only (as well as the prefixed letters GA, DA, BA, MA and -A and the postfixed letters GA, NGA, DA, NA, BA. MA, -A, RA, LA and SA) would continue to be represented using the existing codepoints U+0F40 through U+0F69. The 956 proposed precomposed characters cover all the stacks that would normally be used in writing native Tibetan, both colloquial and literary (including orthographic forms such as reversed I that are found in the earliest Tibetan texts), as well as the great majority of complex stacks used to tranliterate Sanskrit words that may be encountered in religious texts. The only commonly found glyphs that are not included in the proposal are those for the non-Tibetan syllables FA, FI, FU, FE, FO and VA, VI, VU, VE, VO (used for transliterating foreign words) that are composed by the application of the consonant modifier mark TSA -PHRU [U+0F39] to the consonants PHA and BA respectively. This is because this method of representing the sounds of F and V are not used within the People's Republic of China (instead the letter H with a subjoined letter PH is used to represent the sound of F).

Transcript of BabelStone1357 : Tibetan : Precomposed · PDF fileBabelStone1357 : Tibetan : Precomposed...

  • BabelStone1357 : Tibetan : Precomposed Tibetan file:///C:/Documents%20and%20Settings/[email protected]/Desktop/brd...

    1 of 27 1/2/2003 2:31 PM

    Tibetan Script : Precomposed TibetanBackgroundDocument N2558In the Proposal on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP (Document N2558) presented by theChinese government in 6th December 2002 for consideration by WG2, 956 precomposed Tibetan glyphs are proposed forinclusion in the ISO/IEC 10646 standard (and therefore also for inclusion in Unicode).

    Existing Character Encoding ModelThe Tibetan script is currently catered for in ISO/IEC 10646 and Unicode by the Tibetan block (codepoints U+0F00 throughU+0FFF).Tibetan is written horizontally from left to right, but characters may combine to form vertical "stacks" of consonant andvowel elements within the horizontal flow of text, and in the existing character encoding model for Tibetan, such stacks areencoded as a sequence of one or many Unicode characters in this order :

    One base consonant in the range U+0F40 through U+0F6A (this is the first consonant in the stack, reading from topto bottom, and will either be the root consonant or a superfixed head consonant RA, LA or SA)Zero or many subjoined consonants in the range U+0F90 through U+0FBC (these will be the root consonantunderneath a superfixed head consonant RA, LA or SA, and/or one of the subjoined consonants WA, YA, RA, LA orHA, and/or potentially any consonant in a consonant cluster transliterating a word from Sanskrit or some otherforeign language).Zero or one vowel lengthener letter, the so-called "a-chung" [U+0F71].Zero or many of the subjoined or superjoined vowels signs I [U+0F72], reversed I [U+0F80], U [U+0F74], E[U+0F7A], EE [U+0F7B], O [U+0F7C] or OO [U+0F7D] (no vowel sign indicates an implicit A vowel, whilst two ormore vowel signs are only used in shorthand abbreviations) or the Virama or Halanta sign [U+0F84] that indicatesthat the stack continues horizontally (this sign may be used when transliterating Sanskrit mantras, but is only veryrarely employed).Zero or one special signs used in transliterating Sanskrit words, such as the Anusvara [U+0F7E] and Candrabindu [U+0F83].

    In addition the consonant modifier mark TSA -PHRU [U+0F39] may be inserted into this sequence immediately after theconsonant it modifies (this sign is normally only used with the letters PHA and BA to represent the non-Tibetan sounds ofFA and VA respectively).The existing Unicode character encoding model is able to represent any conceivable stack with the exception of highlyunusual stacks that contain more than one consonant-vowel combination in a vertical arrangement (these contravene thenormal rules of Tibetan writing, and are considered beyond the scope of plain text rendering no such compound stacksare included in the Chinese proposal to encode BrdaRten characters).

    Proposed Character Encoding ModelThe Chinese proposal is to encode the vast majority of vertical stacks that are normally encountered as individualprecomposed characters represented by a single codepoint rather than a sequence of two or more codepoints. The 956proposed precomposed characters all comprise at least two consonant and/or vowel elements, so that minimal stackscomprising a single base consonant only (as well as the prefixed letters GA, DA, BA, MA and -A and the postfixed lettersGA, NGA, DA, NA, BA. MA, -A, RA, LA and SA) would continue to be represented using the existing codepoints U+0F40through U+0F69.The 956 proposed precomposed characters cover all the stacks that would normally be used in writing native Tibetan, bothcolloquial and literary (including orthographic forms such as reversed I that are found in the earliest Tibetan texts), as wellas the great majority of complex stacks used to tranliterate Sanskrit words that may be encountered in religious texts. Theonly commonly found glyphs that are not included in the proposal are those for the non-Tibetan syllables FA, FI, FU, FE,FO and VA, VI, VU, VE, VO (used for transliterating foreign words) that are composed by the application of the consonantmodifier mark TSA -PHRU [U+0F39] to the consonants PHA and BA respectively. This is because this method ofrepresenting the sounds of F and V are not used within the People's Republic of China (instead the letter H with asubjoined letter PH is used to represent the sound of F).

    [email protected]/03-002Andrew West

  • BabelStone1357 : Tibetan : Precomposed Tibetan file:///C:/Documents%20and%20Settings/[email protected]/Desktop/brd...

    2 of 27 1/2/2003 2:31 PM

    It should be noted that this proposal does not include any glyphs that could not be encoded using the existing Tibetancharacter encoding model.Although the great majority of texts, secular and religious, could be encoded using only these 956 precomposed stacksand the existing base consonants (U+0F40 through U+0F69), they do not represent the complete repertoire of all possibleconsonant-vowel stacks. This means that it would still be necessary to revert to the current character encoding model toencode unusual forms that are not represented by these 965 precomposed characters.For example, this proposal includes a single, apparantly arbitrary, example of a consonant plus triple E vowel (Glyph 107)that is found only in Tibetan shorthand abbreviations, but many other consonant plus multiple vowel sign shorthandabbreviations that are frequently encountered in prayer flags and elsewhere are not covered by this proposal.This means that some texts would have to be encoded using a mixture of pecomposed glyphs and decomposed charactersequences. It also means that there is a strong possibility that extensions to these 956 precomposed characters may beproposed in the future, in which case the BrdaRten characters would no longer all be encoded in a sequence thatcorresponds to the dictionary collation order.However, the main issue with this proposal is that, if accepted, there would be two co-existing and equally legitimatecharacter encoding models for Tibetan, and a text could be encoded using either model, or even a mixture of the twomodels. This would cause havoc for text processing applications, and yet bring no appreciable benefit to the end-user ofTibetan word processing systems it takes exactly the same effort for a user to select or otherwise input a Tibetan stackwhether it is encoded as a single character or as a sequence of several characters (the encoding should be opaque to theordinary user). Neither would this situation facilitate Tibetan information processing (as claimed in Document N2558), asapplications would still have to be able to cope with decomposed Tibetan character sequences, which would only add tothe complexity of Tibetan text processing, rather than reduce it.The introductory text of Document N2558, entitled Explanation on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP, is quoted in full below without further comment or criticism.

    Explanation on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP1. IntroductionThe written form of Tibetan language could be regarded as a horizontal stream of basic Tibetan characters and BrdaRtencharacters without vertical combining. The BrdaRten characters are vertically pre-composed Tibetan characters. Therepertoire of BrdaRten characters proposed in this document are those stably structured, widely and frequently usedBrdaRten with high usage coverage to modern Tibetan, including modern Tibetan characters, frequently used Sanskritcharacters, borrowed character used to represent modern terms by pronunciations. This document proposes 956BrdaRten characters.192 Tibetan characters encoded in ISO/IEC 10646, are Tibetan letters, Sanskrit letters, punctuations, astronomy andspecial symbols. They enable to represent thousands Tibetan characters using dynamic combing methods with minimumcode points. However, because of technical reasons, this encoding scheme is not compatible with traditional education,publication and electronic desktop publishing systems. It seems still quite difficult to properly solve the problems withTibetan information interchange and processing.1) The biggest difficulty for Tibetan information processing is the vertical composition of Tibetan characters. After thecomposition, each component would be changed greatly in shape and size, especially the vertical composition of theSanskrit would reach to 7 layers where each letter requires different spans in height and width at the same layer that isquite hard to be dealt with. Up to now, there is no report showing any system platform has implemented Tibetan processingsystem using dynamic combining method.2) In the implementation level 1 of ISO/IEC 10646, one code corresponds one character but one Tibetan BrdaRtencharacter needs several codes to represent with very length which is a big block to the implementation of Tibetan system.In practical applications, the bilingual processing such as Tibetan-Chinese or Tibetan-English at the same level ofimplementation is an underlying requirement of Tibetan users.Since 1990s, from DOS to Windows, both domestic and overseas applications have been using Tibetan BrdaRtencharacter set at implementation level 1. For example, the Founder desktop publishing system for Tibetan is based onBrdaRten characters which has become the de-facto industry standard for Tibetan information interchange and processingin China and even outside of China.Tibetan BrdaRten characters are structure-stable characters widely used in education, publication, classics documentation

  • BabelStone1357 : Tibetan : Precomposed Tibetan file:///C:/Documents%20and%20Settings/[email protected]/Desktop/brd...

    3 of 27 1/2/2003 2:31 PM

    including Tibetan medicine. The electronic data containing BrdaRten characters are estimated beyond billions. Once theTibetan BrdaRten characters are encoded in BMP, many curre