Sorting Unicode Tibetan using a Multi-Weight Collation...

33
Sorting Unicode Tibetan using a Sorting Unicode Tibetan using a Multi Multi - - Weight Collation Algorithm Weight Collation Algorithm Robert R. Chilton Robert R. Chilton Technical Director, Technical Director, The Asian Classics Input Project (ACIP) The Asian Classics Input Project (ACIP) [email protected] [email protected] www.asianclassics.org www.asianclassics.org

Transcript of Sorting Unicode Tibetan using a Multi-Weight Collation...

Page 1: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Sorting Unicode Tibetan using a Sorting Unicode Tibetan using a MultiMulti--Weight Collation AlgorithmWeight Collation Algorithm

Robert R. ChiltonRobert R. ChiltonTechnical Director,Technical Director,

The Asian Classics Input Project (ACIP)The Asian Classics Input Project (ACIP)[email protected]@well.com

www.asianclassics.orgwww.asianclassics.org

Page 2: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Why do we need a sorting Why do we need a sorting algorithm for Unicode Tibetan?algorithm for Unicode Tibetan?

Tibetan script encoded in Unicode and Tibetan script encoded in Unicode and ISO/IEC 10646ISO/IEC 10646Full support of Tibetan within a computer Full support of Tibetan within a computer environment also requires:environment also requires:–– Keyboard(sKeyboard(s) or other input methods) or other input methods–– Rendering: readable, printable display of the Rendering: readable, printable display of the

encoded Tibetanencoded Tibetan--script data (fonts, etc.)script data (fonts, etc.)–– Collation rules for generating culturally Collation rules for generating culturally

acceptable sortingacceptable sorting

Page 3: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Previous efforts to sort Tibetan dataPrevious efforts to sort Tibetan data

Utilized a singleUtilized a single--weight sorting model weight sorting model

Generally adequate for sorting native Generally adequate for sorting native Tibetan orthographies within a specific Tibetan orthographies within a specific application/environmentapplication/environment

Not able to robustly handle foreign Not able to robustly handle foreign transcriptions and other nontranscriptions and other non--standard standard orthographiesorthographies

Page 4: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Designed for Romanized or fontDesigned for Romanized or font--encoded encoded Tibetan, not Unicode TibetanTibetan, not Unicode Tibetan

Treat TibetanTreat Tibetan--script sorting in an exclusive, script sorting in an exclusive, special case fashion; such proprietary special case fashion; such proprietary sorting methods are not likely to be widely sorting methods are not likely to be widely implementedimplemented

Page 5: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Features of multiFeatures of multi--weight sorting weight sorting methodologymethodology

WellWell--understood and widely implemented*understood and widely implemented*Uses a Uses a collation element tablecollation element table to achieve to achieve culturally acceptable sortingculturally acceptable sortingEnables searching at different degrees of Enables searching at different degrees of precision (e.g., caseprecision (e.g., case--sensitive searches)sensitive searches)

*References:*References:

•• ISO/IEC 14651 (2001ISO/IEC 14651 (2001--04) Ed. 1.004) Ed. 1.0Information technology Information technology ---- International string ordering and comparison International string ordering and comparison ----Method for comparing character strings and description of the coMethod for comparing character strings and description of the common mmon template template tailorabletailorable ordering.ordering.

•• Unicode Technical Standard #10:Unicode Technical Standard #10: Unicode Collation Algorithm (UCA).Unicode Collation Algorithm (UCA).

Page 6: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Advantages for implementers and Advantages for implementers and users of Unicode Tibetanusers of Unicode Tibetan

Collation element table for Tibetan can Collation element table for Tibetan can ““plug intoplug into”” existing sort logic at the existing sort logic at the operating system leveloperating system levelRobust searching and sorting of Tibetan Robust searching and sorting of Tibetan data thus becomes automatically available data thus becomes automatically available to all compliant applications running within to all compliant applications running within that operating system environment that operating system environment

Page 7: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

The same collation element table can be The same collation element table can be used across multiple platforms used across multiple platforms –– resulting resulting in consistent sorting of Tibetan data within in consistent sorting of Tibetan data within different different operating system environmentsoperating system environments

Page 8: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Moving from methodology to Moving from methodology to algorithm: an overviewalgorithm: an overview

A look at dictionary sortingA look at dictionary sortingUnderstanding the multiUnderstanding the multi--weight sorting weight sorting ((““international string orderinginternational string ordering””) model and ) model and extending this model to Tibetanextending this model to TibetanDetermining the collation elements needed Determining the collation elements needed for sorting Unicode Tibetanfor sorting Unicode Tibetan

Page 9: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Dictionary sorting of TibetanDictionary sorting of TibetanGeneral agreement (since 1900) on dictionary General agreement (since 1900) on dictionary order of native orthographiesorder of native orthographies–– Main exception: treatment of Main exception: treatment of wazurwazur

Lack of consensus on details of sorting foreignLack of consensus on details of sorting foreign--origin orthographiesorigin orthographiesUniversal agreement that all entries must appear Universal agreement that all entries must appear under one of the 30 letters under one of the 30 letters ((ཀ་ཀ་ to to ཨ་ཨ་))–– Example of Example of ††ññ -- (Sanskrit: (Sanskrit: ““skandhaskandha””): sorts under the ): sorts under the

collation slot for collation slot for ††

Page 10: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

How foreign words are sorted in How foreign words are sorted in dictionaries generallydictionaries generally

Words from foreign languages are sorted Words from foreign languages are sorted according to the sort rules of the according to the sort rules of the dictionarydictionary’’s language (and not the sort s language (and not the sort rules of the origin language)rules of the origin language)–– a Danish word beginning with a Danish word beginning with åå appears after appears after

letter letter ZZ in a Danish dictionaryin a Danish dictionary–– the same word is sorted under letter the same word is sorted under letter AA in an in an

English dictionaryEnglish dictionary

Page 11: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Extending this Extending this convention to Tibetan, all convention to Tibetan, all words in a Tibetan dictionary words in a Tibetan dictionary –– including including foreign words foreign words –– are sorted under 30 lettersare sorted under 30 lettersExtending this convention still further, all Extending this convention still further, all vowel signs are treated in terms of the 5 vowel signs are treated in terms of the 5 standard Tibetan vowelsstandard Tibetan vowels–– implicit vowel implicit vowel ཨ་ཨ་–– 4 explicit vowel signs4 explicit vowel signs

Page 12: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

The multiThe multi--weight sorting model for weight sorting model for international string orderinginternational string ordering

Weights are generally assigned at three Weights are generally assigned at three (or more) levels(or more) levelsIn Latin scripts these levels correspond to:In Latin scripts these levels correspond to:

1. alphabetic ordering = primary level1. alphabetic ordering = primary level2. diacritic ordering = secondary level2. diacritic ordering = secondary level3. case ordering = tertiary level3. case ordering = tertiary level

(Additional levels (Additional levels may be used for tiemay be used for tie--breaking breaking between strings not distinguished at the first between strings not distinguished at the first three levels)three levels)

Page 13: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Examples in Latin scriptExamples in Latin script

rolerole and and rulerule differ at the primary (alphabetic) differ at the primary (alphabetic) levellevel

role role and and rôlerôle differ at the secondary (diacritic) differ at the secondary (diacritic) levellevel

rolerole and and ROLEROLE differ at the tertiary (case) leveldiffer at the tertiary (case) level

rolerole and and RÔLE RÔLE differ at both the secondary level differ at both the secondary level and the tertiary leveland the tertiary level

Page 14: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Extending the multiExtending the multi--weight model weight model to Tibetanto Tibetan

KK-- and and MM -- differ at the primary leveldiffer at the primary level

K-K- andand Kı-Kı- differ at the secondary leveldiffer at the secondary level

K-K- and and ff -- differ at the tertiary leveldiffer at the tertiary level

K-K- and and gÛgÛ -- differ at both the secondary differ at both the secondary

level and the tertiary levellevel and the tertiary level

Page 15: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Determining collation elements for Determining collation elements for Unicode Tibetan: an overviewUnicode Tibetan: an overview

Prescripts in Tibetan orthographies Prescripts in Tibetan orthographies The Unicode model for encoding Tibetan The Unicode model for encoding Tibetan scriptscriptWhat is a collation element?What is a collation element?167 primary167 primary--weighted collation elementsweighted collation elements9 secondary9 secondary--weighted collation elementsweighted collation elements

Page 16: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Prescripts in Tibetan orthographiesPrescripts in Tibetan orthographies

In English, all 26 letters always have In English, all 26 letters always have primary weight; thus primary weight; thus ““atat”” sorts far away sorts far away from from ““vatvat””In Tibetan, letters written before the radical In Tibetan, letters written before the radical letter have less than primary weight; thus letter have less than primary weight; thus ཀནཀན,, སྐནསྐན,, བཀནབཀན,, and and བསྐནབསྐན sort relatively near sort relatively near to each other, underto each other, under letter letter ཀ་ཀ་

Page 17: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

11 possible prescripts (or 11 possible prescripts (or ““prepre--radicalsradicals””) ) might occur before the radical letter:might occur before the radical letter:–– 5 prefix letters: 5 prefix letters: གག དད བབ མམ འའ–– 3 head letters: 3 head letters: རར ལལ སས–– 3 two3 two--letter sequences of letter sequences of བབ prefix followed by prefix followed by

one of the head letters, i.e.: one of the head letters, i.e.: བརབར བལབལ བསབས

Page 18: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Grammar rules define which radical letters Grammar rules define which radical letters can take which prescriptscan take which prescripts–– For letterFor letter ཀཀ་་ there are 7 possible prescripts:there are 7 possible prescripts:

N@N@ T@ T@ áá ññ †† TáTá T†T†No radical letter can take all 11 prescript No radical letter can take all 11 prescript forms (and some take none at all)forms (and some take none at all)

Page 19: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

The Unicode encoding model for The Unicode encoding model for Tibetan scriptTibetan script

193 distinct characters defined in Unicode193 distinct characters defined in UnicodeThe 30 letters (along with conjunct and reversed The 30 letters (along with conjunct and reversed forms) are encoded twice: in nominal position forms) are encoded twice: in nominal position and in orthographicand in orthographic--subjoined position subjoined position –– reflects the fact that the Tibetan script is written from reflects the fact that the Tibetan script is written from

top to bottom as well as from left to righttop to bottom as well as from left to right

Example of Example of Tå‰P-Tå‰P- encoded as 6 characters:encoded as 6 characters:

བབ ++ རར ++ ྟྟ ++ ེེ ++ ནན ++ ་་0F56 0F62 0F9F 0F7A 0F53 0F0B 0F56 0F62 0F9F 0F7A 0F53 0F0B

Page 20: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

What is a collation element?What is a collation element?

A A collation elementcollation element enables clustering of enables clustering of multiple Unicode characters such that they multiple Unicode characters such that they can be treated together as a single item can be treated together as a single item for determining sort weightsfor determining sort weightsSingle characters also function as collation Single characters also function as collation elementselementsThe weights assigned to the collation The weights assigned to the collation elements determine their sort (or collation) elements determine their sort (or collation) order relative to one another order relative to one another

Page 21: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Defining Tibetan preDefining Tibetan pre--scribed radical scribed radical sequences as collation elements sequences as collation elements For letter For letter ཀཀ་་ we can define each of the 7 we can define each of the 7 prescriptprescript ++ radical clusters (radical clusters (N@N@ T@T@ áá ññ ††TáTá T†T†)) as a collation element (alsoas a collation element (alsocalled a called a ““collation graphemecollation grapheme””))

We can then assign sort weights to these We can then assign sort weights to these collation elements such that they sort in a collation elements such that they sort in a culturally acceptable relative orderculturally acceptable relative order

Page 22: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

PrimaryPrimary--weighted collation weighted collation elementselements

30 nominal letters: 30 nominal letters: ཀ་ཀ་ to to ཨ་ཨ་ –– which may be which may be either radical letters either radical letters ((མིང་གཞི་མིང་གཞི་)) or suffix lettersor suffix letters

103 multi103 multi--letter preletter pre--scribed radical formsscribed radical forms–– In many of these 103 forms the prescript is written at In many of these 103 forms the prescript is written at

the head line (encoded as 1 or 2 nominal characters) the head line (encoded as 1 or 2 nominal characters) and the radical letter is encoded as a subjoined and the radical letter is encoded as a subjoined charactercharacter

–– In the example of In the example of Tå‰P-Tå‰P-, the radical letter , the radical letter KK is encoded is encoded

in subjoined positionin subjoined position

Page 23: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Defining the 4 explicit vowels as Defining the 4 explicit vowels as collation elementscollation elements

As collation elements, suffix letters cannot be As collation elements, suffix letters cannot be distinguished from bare radicals* distinguished from bare radicals* Because a nominal letter serving as a radical Because a nominal letter serving as a radical letter carries the implicit vowel letter carries the implicit vowel ཨ་ཨ་,, the 4 explicit the 4 explicit vowels must be given primary weights; and must vowels must be given primary weights; and must be weighted heavier than the nominal letters be weighted heavier than the nominal letters ----since a radical letter marked with an explicit since a radical letter marked with an explicit vowel will sort vowel will sort afterafter the same letter not marked the same letter not marked by an explicit vowelby an explicit vowel

* in a stateless implementation* in a stateless implementation

Page 24: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Defining the 30 postDefining the 30 post--radical letters radical letters as collation elementsas collation elements

PostPost--radicals = the 30 letters in subjoined radicals = the 30 letters in subjoined position (when not functioning as the radical position (when not functioning as the radical letter in a preletter in a pre--scribed radical form)scribed radical form)–– Requires maximumRequires maximum--length substring matchinglength substring matching

Only 4 postOnly 4 post--radical (subscribed) letters occur in radical (subscribed) letters occur in native Tibetan orthographiesnative Tibetan orthographies::

ྭྭ ྱྱ ྲྲ ླླ

Remaining 26 are required to treat nonRemaining 26 are required to treat non--native native orthographies in a consistent mannerorthographies in a consistent mannerMust be given primary weights; and heavier than Must be given primary weights; and heavier than the 4 explicit vowelsthe 4 explicit vowels

Page 25: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

Relative order of the 167 primaryRelative order of the 167 primary--weighted collation elementsweighted collation elements

First: 30 nominal letters and 103 multiFirst: 30 nominal letters and 103 multi--letter preletter pre--scribed radical forms (= 133 collation elements)scribed radical forms (= 133 collation elements)–– given sort weights such that the 103 pregiven sort weights such that the 103 pre--scribed scribed

radical forms are interleaved as appropriate with the radical forms are interleaved as appropriate with the 30 nominal letters30 nominal letters

Next: 4 explicit vowelsNext: 4 explicit vowelsNext: 30 postNext: 30 post--radical letters (i.e., in orthographic radical letters (i.e., in orthographic subscribed position) subscribed position) Thus, total collation slots at the primaryThus, total collation slots at the primary--weight weight level: 133 + 4 + 30 = 167level: 133 + 4 + 30 = 167

Page 26: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

SecondarySecondary--weighted collation weighted collation elementselements

These 9 have no primary weightThese 9 have no primary weight

–– 4 combining marks: 4 combining marks:

◌྄◌྄ ◌ཱ◌ཱ ◌༹◌༹ ◌ཿ◌ཿ

–– 5 signs: 5 signs: ྅྅ ྈྈ ྉྉ ྊྊ ྋྋ

Page 27: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

The remaining 120 Unicode The remaining 120 Unicode Tibetan charactersTibetan characters

30 + 4 + 30 + 9 = 73 of the 193 Unicode Tibetan 30 + 4 + 30 + 9 = 73 of the 193 Unicode Tibetan characters have been treated above, leaving 120characters have been treated above, leaving 120

59 of these 120 have a primary weight (in addition 59 of these 120 have a primary weight (in addition to a secondary and/or tertiary weight):to a secondary and/or tertiary weight):–– 19 can be decomposed into simple elements and thus 19 can be decomposed into simple elements and thus

need not be treated in the collation element tableneed not be treated in the collation element table–– 9 are variants (primary and 9 are variants (primary and tertiary weighted) of certain tertiary weighted) of certain

of the 30 nominal lettersof the 30 nominal letters–– 3 are variants 3 are variants (primary and (primary and tertiary weighted) of certain tertiary weighted) of certain

of the 4 explicit vowelsof the 4 explicit vowels

Page 28: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

–– 8 are 8 are variants (primary and variants (primary and tertiary weighted) of tertiary weighted) of certain of the 30 subscribed letterscertain of the 30 subscribed letters

–– 20 are the digits and half20 are the digits and half--digitsdigits

The remaining 61 characters are punctuation The remaining 61 characters are punctuation marks and other symbols which generally have marks and other symbols which generally have no impact on dictionary sort order and thus have no impact on dictionary sort order and thus have no primary, secondary or tertiary weight no primary, secondary or tertiary weight

Page 29: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

AppendicesAppendices

•• The Unicode (and ISO/IEC 10646) The Unicode (and ISO/IEC 10646) charactercharacter--encoding chart for Tibetanencoding chart for Tibetan–– highlighting characters in example: highlighting characters in example: Tå‰P-Tå‰P-

•• An ordered list of collation elements for Unicode Tibetan

Page 30: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

The Unicode Standard 3.0, Copyright © 1991-2000, Unicode, Inc. All rights reserved 435

0F7FTibetan0F00

0F0 0F1 0F2 0F3 0F4 0F5 0F6 0F7

�������

��� ���

����������

�����

!"#$%&'()*+,-./0

123456

78

9:;<=>?@

ABCDEFGH

IJKLMNO

PQRSTUVWXYZ[\]^_

`abcdefghij

klmno

pq

rs

tuvwxy

0F00

0F01

0F02

0F03

0F04

0F05

0F06

0F07

0F08

0F09

0F0A

0F0B

0F0C

0F0D

0F0E

0F0F

0F10

0F11

0F12

0F13

0F14

0F15

0F16

0F17

0F18

0F19

0F1A

0F1B

0F1C

0F1D

0F1E

0F1F

0F20

0F21

0F22

0F23

0F24

0F25

0F26

0F27

0F28

0F29

0F2A

0F2B

0F2C

0F2D

0F2E

0F2F

0F30

0F31

0F32

0F33

0F34

0F35

0F36

0F37

0F38

0F39

0F3A

0F3B

0F3C

0F3D

0F3E

0F3F

0F40

0F41

0F42

0F43

0F44

0F45

0F46

0F47

0F49

0F4A

0F4B

0F4C

0F4D

0F4E

0F4F

0F50

0F51

0F52

0F53

0F54

0F55

0F56

0F57

0F58

0F59

0F5A

0F5B

0F5C

0F5D

0F5E

0F5F

0F60

0F61

0F62

0F63

0F64

0F65

0F66

0F67

0F68

0F69

0F6A

0F71

0F72

0F73

0F74

0F75

0F76

0F77

0F78

0F79

0F7A

0F7B

0F7C

0F7D

0F7E

0F7F

0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

F

Robert
V
Robert
b
Robert
t
Robert
S
Robert
Page 31: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

The Unicode Standard 3.0, Copyright © 1991-2000, Unicode, Inc. All rights reserved436

0FFFTibetan0F80

0F8 0F9 0FA 0FB 0FC 0FD 0FE 0FF

��

��

��

��

!

"#

$

%

&

'(

)*

+

,

-

.

/

0

1

2

3

45

6

7

8

9

:

;

<

=

>

?

@

A

B

C

D

E

F

G

H

0F80

0F81

0F82

0F83

0F84

0F85

0F86

0F87

0F88

0F89

0F8A

0F8B

0F90

0F91

0F92

0F93

0F94

0F95

0F96

0F97

0F99

0F9A

0F9B

0F9C

0F9D

0F9E

0F9F

0FA0

0FA1

0FA2

0FA3

0FA4

0FA5

0FA6

0FA7

0FA8

0FA9

0FAA

0FAB

0FAC

0FAD

0FAE

0FAF

0FB0

0FB1

0FB2

0FB3

0FB4

0FB5

0FB6

0FB7

0FB8

0FB9

0FBA

0FBB

0FBC

0FBE

0FBF

0FC0

0FC1

0FC2

0FC3

0FC4

0FC5

0FC6

0FC7

0FC8

0FC9

0FCA

0FCB

0FCC

0FCF

0

1

2

3

4

5

6

7

8

9

A

B

C

D

E

F

Robert
Page 32: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

An Ordered List of Collation Elements for Unicode Tibetan

[*Note: a comprehensive Collation Element Table for Tibetan script will include additional collation elements, such as , , ྋྙ, ྉྤ, ྉྥ, beyond those listed here.]

A. Primary Weighted Collation Elements A.1. The 133 radical-initial sequences (also covers the suffix letters):

ཀ དཀ བཀ རྐ ལྐ སྐ བརྐ བསྐ ཁ མཁ འཁ ག དག བག མག འག རྒ ལྒ སྒ བརྒ བསྒ ང དང མང རྔ ལྔ སྔ བརྔ བསྔ ཅ གཅ བཅ ལྕ བལྕ ཆ མཆ འཆ ཇ མཇ འཇ རྗ ལྗ བརྗ ཉ གཉ མཉ རྙ སྙ བརྙ བསྙ ཏ གཏ བཏ རྟ ལྟ སྟ བརྟ བལྟ བསྟ ཐ མཐ འཐ ད གད བད མད འད རྡ ལྡ སྡ བརྡ བལྡ བསྡ ན གན མན རྣ སྣ བརྣ བསྣ པ དཔ ལྤ སྤ ཕ འཕ བ དབ འབ རྦ ལྦ སྦ མ དམ རྨ སྨ ཙ གཙ བཙ རྩ སྩ བརྩ བསྩ ཚ མཚ འཚ ཛ མཛ འཛ རྫ བརྫ ཝ ཞ གཞ བཞ ཟ གཟ བཟ འ ཡ གཡ ར བར(seen in བརླ) ལ ཤ གཤ བཤ ས གས བས ཧ ལྷ ཨ

Key: Black for the 30 nominal letters. Note that whereas any of these 30 can serve as a bare radical,

10 of these can also appear in suffix position in native orthographies. Blue for (relatively) unambiguous cases of prefixed and/or superscribed radical letters. Note that

certain unavoidable ambiguities arise between native orthographies and transcriptions from foreign languages.

Red for ambiguous cases where a 3rd codepoint is required to distinguish the sequence as being a prefixed radical letter (as opposed to a root letter followed by a suffix). Note that certain cases (in Dzongkha) require a 4th codepoint in order to distinguish a case of a prefixed radical letter from a case of a suffix letter followed by a secondary syllable that involves a vowel (i.e., ནི or མོ ).

Magenta for an ambiguous case (in Dzongkha) where a 3rd (or possibly 4th) codepoint is required to distinguish the sequence as being a prefixed radical letter (as opposed to a suffix letter ད followed by a secondary syllable པ or པ ོ).

A.2. The 4 explicit vowels:

◌ི ◌ུ ◌ེ ◌ོ A.3. The 30 post-radicals: ◌ྐ ◌ྑ ◌ྒ ◌ྔ ◌ ྕ◌ྖ ◌ྗ ◌ྙ ◌ྟ ◌ ྠ◌ྡ ◌ྣ ◌ྤ ◌ྥ ◌ ྦ◌ྨ ◌ྩ ◌ ྪ◌ྫ ◌ ྭ◌ྮ ◌ྯ ◌ཱ ◌ྱ ◌ ྲ◌ླ ◌ྴ ◌ྶ ◌ྷ ◌ ྸ

Page 33: Sorting Unicode Tibetan using a Multi-Weight Collation …ph2046/iats/it/IATS-X_Chilton_slides.pdfTibetan script 193 distinct characters defined in Unicode The 30 letters (along with

B. Secondary Weighted Collation Elements (have no primary weight) B.1. The 4 combining marks:

◌྄ ◌ཱ ◌༹ ◌ཿ B.2. The 5 signs (used in transliteration):

྅ ྈ ྉ ྊ ྋ C. The 120 Remaining Unicode Tibetan Characters The characters listed above (in items A and B) account for 73 of the 193 Tibetan characters defined in Unicode. This leaves 120 characters, of which 19 can be decomposed into simple elements and thus need not be treated in the collation element table. There is also no need to assign primary secondary or tertiary weights to the 61 characters that function as punctuation marks and other symbols since these generally have no impact on dictionary sort order. [Note that "Syllable OM" at U+0F00 is here treated as an ornamental symbol rather than as having any lexical value due to the fact that there is no canonical or compatibility decompostion specified for this character.] The digits and half digits account for 20 further characters. The remaining 20 characters are variations (i.e., having both primary and tertiary weights) of certain of the 30 nominal letters, 4 vowels, and 30 subjoined post-initial letters listed previously. 9 Nominal letter variants:

ཊ ཋ ཌ ཎ ◌ཾ ◌ྂ ◌ྃ ར(fixed form) ཥ 3 Vowel variants:

◌ྀ ◌ཻ ◌ཽ 8 Subjoined letter variants:

◌ྚ ◌ྛ ◌ྜ ◌ྞ ◌ྺ ◌ྻ ◌ྼ ◌ྵ