BabelStone1357 : Tibetan : Precomposed Tibetan Tibetan Script : Precomposed Tibetan Background...

download BabelStone1357 : Tibetan : Precomposed Tibetan Tibetan Script : Precomposed Tibetan Background Document

of 30

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of BabelStone1357 : Tibetan : Precomposed Tibetan Tibetan Script : Precomposed Tibetan Background...

  • BabelStone1357 : Tibetan : Precomposed Tibetan file:///C:/Documents%20and%20Settings/

    1 of 30 1/22/2003 9:00 AM

    Tibetan Script : Precomposed Tibetan Background Document N2558 In the Proposal on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP (Document N2558) presented by the Chinese government in 6th December 2002 for consideration by WG2, 956 precomposed Tibetan glyphs are proposed for inclusion in the ISO/IEC 10646 standard (and therefore also for inclusion in Unicode).

    Existing Character Encoding Model The Tibetan script is currently catered for in ISO/IEC 10646 and Unicode by the Tibetan block (codepoints U+0F00 through U+0FFF). Tibetan is written horizontally from left to right, but characters may combine to form vertical "stacks" of consonant and vowel elements within the horizontal flow of text, and in the existing character encoding model for Tibetan, such stacks are encoded as a sequence of one or many Unicode characters in this order:

    One base consonant in the range U+0F40 through U+0F6A (this is the first consonant in the stack, reading from top to bottom, and will either be the root consonant or a superfixed head consonant RA, LA or SA) or one of the superfixed transliteration letters U+0F88 through U+0F8B that are most commonly found in Kalachakra literature. Zero or many subjoined consonants in the range U+0F90 through U+0FBC (these will be the root consonant underneath a superfixed head consonant RA, LA or SA, and/or one of the subjoined consonants WA, YA, RA, LA or HA, and/or potentially any consonant in a consonant cluster transliterating a word from Sanskrit or some other foreign language). Zero or one vowel lengthener letter, the so-called "a-chung" [U+0F71]. Zero or many of the subjoined or superjoined vowels signs I [U+0F72], reversed I [U+0F80], U [U+0F74], E [U+0F7A], EE [U+0F7B], O [U+0F7C] or OO [U+0F7D] (no vowel sign indicates an implicit A vowel, whilst two or more vowel signs are only used in shorthand abbreviations) or the Virama or Halanta sign [U+0F84] that indicates that the stack continues horizontally (this sign may be used when transliterating Sanskrit mantras, but is only very rarely employed). Zero or one special signs used in transliterating Sanskrit words, such as the Anusvara [U+0F7E] and Candrabindu [U+0F83].

    In addition the consonant modifier mark TSA -PHRU [U+0F39] may be inserted into this sequence immediately after the consonant it modifies (this sign is normally only used with the letters PHA and BA to represent the non-Tibetan sounds of FA and VA respectively). The existing Unicode character encoding model is able to represent any conceivable stack with the exception of highly unusual stacks that contain more than one consonant-vowel combination in a vertical arrangement (these contravene the normal rules of Tibetan writing, and are considered beyond the scope of plain text rendering – no such compound stacks are included in the Chinese proposal to encode BrdaRten characters).

    Proposed Character Encoding Model The Chinese proposal is to encode the vast majority of vertical stacks that are normally encountered as individual precomposed characters represented by a single codepoint rather than a sequence of two or more codepoints. The 956 proposed precomposed characters all comprise at least two consonant and/or vowel elements, so that minimal stacks comprising a single base consonant only (as well as the prefixed letters GA, DA, BA, MA and -A and the postfixed letters GA, NGA, DA, NA, BA. MA, -A, RA, LA and SA) would continue to be represented using the existing codepoints U+0F40 through U+0F69. The 956 proposed precomposed characters cover all the stacks that would normally be used in writing native Tibetan, both colloquial and literary (including orthographic forms such as reversed I that are found in the earliest Tibetan texts), as well as the great majority of complex stacks used to tranliterate Sanskrit words that may be encountered in religious texts. The only commonly found glyphs that are not included in the proposal are those for the non-Tibetan syllables FA, FI, FU, FE, FO and VA, VI, VU, VE, VO (used for transliterating foreign words) that are composed by the application of the consonant modifier mark TSA -PHRU [U+0F39] to the consonants PHA and BA respectively. This is because this method of representing the sounds of F and V are not used within the People's Republic of China (instead the letter HA with a Andrew C West L2/03-002R

  • BabelStone1357 : Tibetan : Precomposed Tibetan file:///C:/Documents%20and%20Settings/

    2 of 30 1/22/2003 9:00 AM

    subjoined letter PHA is used to represent the sound of F). It should be noted that this proposal does not include any glyphs that could not be encoded using the existing Tibetan character encoding model. Although the great majority of texts, secular and religious, could be encoded using only these 956 precomposed stacks and the existing base consonants (U+0F40 through U+0F69), they do not represent the complete repertoire of all possible consonant-vowel stacks. This means that it would still be necessary to revert to the current character encoding model to encode unusual forms that are not represented by these 965 precomposed characters. For example, this proposal includes a single, apparantly arbitrary, example of a consonant plus triple E vowel (Glyph 107) that is found only in Tibetan shorthand abbreviations, but many other consonant plus multiple vowel sign shorthand abbreviations that are frequently encountered in prayer flags and elsewhere are not covered by this proposal. This means that some texts would have to be encoded using a mixture of pecomposed glyphs and decomposed character sequences. It also means that there is a strong possibility that extensions to these 956 precomposed characters may be proposed in the future, in which case the BrdaRten characters would no longer all be encoded in a sequence that corresponds to the dictionary collation order. However, the main issue with this proposal is that, if accepted, there would be two co-existing and equally legitimate character encoding models for Tibetan, and a text could be encoded using either model, or even a mixture of the two models. This would cause havoc for text processing applications, and yet bring no appreciable benefit to the end-user of Tibetan word processing systems — it takes exactly the same effort for a user to select or otherwise input a Tibetan stack whether it is encoded as a single character or as a sequence of several characters (the encoding should be opaque to the ordinary user). Neither would this situation facilitate Tibetan information processing (as claimed in Document N2558), as applications would still have to be able to cope with decomposed Tibetan character sequences, which would only add to the complexity of Tibetan text processing, rather than reduce it. The introductory text of Document N2558, entitled Explanation on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP, is quoted in full below without further comment or criticism.

    Explanation on Tibetan BrdaRten Characters Encoding for ISO/IEC 10646 in BMP 1. Introduction The written form of Tibetan language could be regarded as a horizontal stream of basic Tibetan characters and BrdaRten characters without vertical combining. The BrdaRten characters are vertically pre-composed Tibetan characters. The repertoire of BrdaRten characters proposed in this document are those stably structured, widely and frequently used BrdaRten with high usage coverage to modern Tibetan, including modern Tibetan characters, frequently used Sanskrit characters, “borrowed character” used to represent modern terms by pronunciations. This document proposes 956 BrdaRten characters. 192 Tibetan characters encoded in ISO/IEC 10646, are Tibetan letters, Sanskrit letters, punctuations, astronomy and special symbols. They enable to represent thousands Tibetan characters using dynamic combing methods with minimum code points. However, because of technical reasons, this encoding scheme is not compatible with traditional education, publication and electronic desktop publishing systems. It seems still quite difficult to properly solve the problems with Tibetan information interchange and processing. 1) The biggest difficulty for Tibetan information processing is the vertical composition of Tibetan characters. After the composition, each component would be changed greatly in shape and size, especially the vertical composition of the Sanskrit would reach to 7 layers where each letter requires different spans in height and width at the same layer that is quite hard to be dealt with. Up to now, there is no report showing any system platform has implemented Tibetan processing system using dynamic combining method. 2) In the implementation level 1 of ISO/IEC 10646, one code corresponds one character but one Tibetan BrdaRten character needs several codes to represent with very length which is a big block to the implementation of Tibetan system. In practical applications, the bilingual processing such as Tibetan-Chinese or Tibetan-English at the same level of implementation is an underlying requirement of Tibetan users. Since 1990s, from DOS to Windows, both domestic and overseas applications have been using Tibetan BrdaRten character set at implementation level 1. For example, the Founder desktop publishing system for Tibetan is based on BrdaRten characters which has become the de-facto industry standard for Tibetan information interchange and processing in China and even outside of China.

  • BabelStone1357 : Tibetan : Precomposed Tibetan file:///C:/Documents%20and%20Settings/

    3 of 30 1/22/2003 9:00 A