P. Chellappan

27
am nco ng tan ar s P.Chellappan

Transcript of P. Chellappan

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 1/27

am nco ng tan ar sP.Chellappan

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 2/27

What is Encoding ?

Computers store all data as numbers

 

not as glyphs (shape)s no s ore as u as

A number should be assigned to every

a p a eThis is called encoding

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 3/27

What is Encoding ?

A Byte is a unit of storage and consists of 8bits

A Byte usually represents one character

A B te can contain a maximum of 256combinations of 1s and 0s

00000000 00000001 ... 

11111110, 11111111

 

be stored in a byte

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 4/27

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 5/27

Tamil Encoding

Tamil has 313 characters

Number of slots available is less than 256

Gl hs are encoded and not characters

e.g. £, ª, «, ¬ etc are encoded

 . . but « + è + £

  s s ca e yp nco ng

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 6/27

TAM Encoding

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 7/27

English Text input, display and storage

Q W E

Z X C

English English

driver font

Screen 

and 9 7 1 1 0 1 0 0

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 8/27

Tamil Text input, display and storage

Q W E

Z X C

Tamil Tamil Ü

 

driver FontÜ

Screen 

Ü‹ñ£ 2 2 0 1 3 9 2 4 1 1 6 3

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 9/27

8 Bit Encodings - Problems

•Same 256 locations are used by alllanguages

97 a a 

•Use of wron font will dis la arba e

English Tamil

andnd

•Hence language information is also

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 10/27

8 Bit Encodings - Problems

Font Change is difficulte.g Ü‹ñ£ and mother‹‹ñ£

ross a orm a a exc ange s cu

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 11/27

Solution

•Characters of different languages shouldbe give different numbers

•256 locations are not sufficient

 • y e c arac er sys em canhave 65,536 characters

•Sufficient to handle most of the worldlanguages !

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 12/27

Solution

 Moving to a

16 Bit encodin s stem

is a must

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 13/27

Unicode

•16 Bit character encoding scheme

•Contains characters of most of the

world's languages•9 Indian Scripts including Tamil areencoded

•One Script could cover more that one

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 14/27

Unicode

•Primarily encodes characters not glyphs  necessarily 1:1

 

•Many Characters can be represented by

g ype.g è + § will be displayed as °

è + ªo£ will be displayed as ªè£

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 15/27

Unicode

Level 1 Implementation

 

•TTF font is sufficientLevel 2 Implementation (Tamil)

•OTF Fonts are required

 , .

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 16/27

Unicode

•Tamil block - U+0B80 to U+0BFF (128locations)

•All Vowels Ü to å÷ and ayutham ç

 • aram r ya e è to ù an ú, û,ü and ý (consonant)

•Vowel Modifiers – o ¢, o£, o ¤, o ¦, o §

¨   ÷ 

Tamil numerals and symbols are encoded

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 17/27

Unicode Code Chart

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 18/27

Unicode Sequences

•A Uyirmeis are enco e on y as

consonant + vowel modifiers

è + o ¢ = è¢

è + o£ = è£è + o § = °

¨è + ªo£ = ªè£

   ¢  •þ s enco e as è + o + û

•ÿ is encoded as ú +o ¢ + ó + o ¦

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 19/27

Advatages of Unicode

•It is a common enco ing eing use orall languages by all operating systems

•It is the standard used in the Internet for

all data exchange•Systems automatically detect thelanguage and process information

correctly

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 20/27

Disadvantages of Unicode

•Fi e sizes are igger

•Data processing time is more

•NLP is more complex

– Meis are not encoded 

– Character widths are not uniform

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 21/27

Problems with Unicode

•Not a app ications support comp exscripts

•Major DTP applications also do support

complex scripts at present•Same characters are represented bymore than one not necessarily correct

sequences.e.g. è + ªo£ = ªè£

è + ª + £ = ªè£

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 22/27

TACE – An alternate encoding

•It is an a ternate 16 it enco ing

•It encodes all uyirs, meis and uyirmeis

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 23/27

TACE – Code chart

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 24/27

TACE – Design features

•Every uyirmei as a unique co e point

•Since every character is encoded,complex script support is not required

•The mei and uyir component of everycharacter can be easily separated bysimple bit operations

e.g xxxx xxxc cccc vvvv

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 25/27

TACE – Advantages

•Comp ex Script support is not require

•Compact file sizes

•Fast data processing

•Fast NLP rocessin 

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 26/27

TACE – Disdvantages

•T e major ott e nec is t at it isencoded in the Private Use Area (PUA)

•It cannot be used for global data

interchange•It can be used only within a closed usergroup

8/2/2019 P. Chellappan

http://slidepdf.com/reader/full/p-chellappan 27/27

Tamil Nadu Government Standards

•T e Government o Tami Na u as ta ena policy decision to convert to the 16 Bitenco ng sys em

•UNICODE will be the primary 16 bits an ar

•TACE16 will be the only alternate

encoding standard in areas where supportfor unicode is not available fully orpar a y.