4 character encoding-unicode
Transcript of 4 character encoding-unicode
![Page 1: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/1.jpg)
Unicode
• A standard character encoding designed to support all of the world's languages
• Unicode represents characters differently than ASCII
• Characters are mapped to a code point
A 65Code Point
1000001
UTF-32
UTF-16
UTF-8
![Page 2: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/2.jpg)
UTF-32• Uses 4 bytes (32 bits)• Example:
– A (100 0001)
0 1 0 0 0 0 0 1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
![Page 3: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/3.jpg)
UTF-32• Problem:
1 KBin
ASCII4 KB
in UTF-32
![Page 4: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/4.jpg)
UTF-16• Stores each char in either 16-bit or two 16-bit
0 0 …. 0 0
0 0 …. 0 0 0 0 …. 0 0
16 bits
16 bits16 bits
![Page 5: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/5.jpg)
UTF-16• Problem:
1 KBin
ASCII2 KB
in UTF-16
![Page 6: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/6.jpg)
UTF-8
• It supports every language you’ll probably ever need.• No need for Windows-1252 this and Windows-1253 that.• Its code point range is from 0x00 to 0x10FFFF• It uses a variable (1 to 4) byte encoding.
![Page 7: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/7.jpg)
UTF-8 (1-byte)
• 1-byte UTF-8 is used for code points in the range 0x00 to 0x7F.• 1-byte UTF-8 ASCII
MSBit is 0code point representation
• Examples of 1-byte UTF-8:– “A” -> 0100 0001– “&” -> 0010 0110– “5” -> 0011 01010 X X X X X X X
![Page 8: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/8.jpg)
UTF-8 (2-byte)
• 2-byte UTF-8code point != representation
• The code point is broken apart into two pieces.• The five MSBits of the code point are assigned to the first byte
and the six LSBits are assigned to the second byte.
![Page 9: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/9.jpg)
UTF-8 (2-byte)
For the first byte of 2-byte UTF-8:• The three MSBits are set to 110• The remaining bits are the five MSBits of the code point.For the second byte of 2-byte UTF-8• The two MSBits are set to 10• The remaining bits are the six LSBits of the code point.
![Page 10: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/10.jpg)
UTF-8 (2-byte)
1 1 0 X X X X X
1 0 X X X X X X
Leading Byte
Continuation Byte
![Page 11: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/11.jpg)
UTF-8 (3-byte)
• 3-byte UTF-8 is used for code points in the range 0x0800 to 0xFFFF.
• 3-byte UTF-8code point != representation
• The code point is broken apart into three pieces.
![Page 12: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/12.jpg)
UTF-8 (3-byte)
• The four MSBits of the code point are assigned to the first byte.
• The middle six bits are assigned to the second byte.• The six LSBits are assigned to the third byte.
![Page 13: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/13.jpg)
UTF-8 (3-byte)
For the first byte of 3-byte UTF-8• The four MSBits are set to 1110• The remaining bits are the four MSBits of the code point.For the second byte of 3-byte UTF-8• The two MSBits are set to 10• The remaining bits are the six middle bits of the code point.
![Page 14: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/14.jpg)
UTF-8 (3-byte)
For the third byte of 3-byte UTF-8 • The two MSBits are set to 10• The remaining bits are the six LSBits of the code point.
![Page 15: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/15.jpg)
UTF-8 (3-byte)
1 1 1 0 X X X X
1 0 X X X X X X
Leading Byte
Continuation Byte
1 0 X X X X X X
Continuation Byte
![Page 16: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/16.jpg)
UTF-8 (4-byte)
• 4-byte UTF-8 is used for code points in the range 0x10000 to 0x10FFFF.
• 4-byte UTF-8code point != representation
• The code point is broken apart into four pieces.
![Page 17: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/17.jpg)
UTF-8 (4-byte)
• The three MSBits of the code point are assigned to the first byte.
• The next six MSBits are assigned to the second byte.• Another of the next six MSBits are assigned to the third byte.• The six LSBits are assigned to the fourth byte.
![Page 18: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/18.jpg)
UTF-8 (4-byte)
For the first byte of 4-byte UTF-8• The five MSBits are set to 11110• The remaining bits are the three MSBits of the code point.For the second byte of 4-byte UTF-8• The two MSBits are set to 10• The remaining bits are the next six middle bits of the code point.
![Page 19: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/19.jpg)
UTF-8 (4-byte)
For the third byte of 4-byte UTF-8• The two MSBits are set to 10• The remaining bits are the next six middle bits of the code point.For the fourth byte of 4-byte UTF-8• The two MSBits are set to 10• The remaining bits are the six LSBits of the code point.
![Page 20: 4 character encoding-unicode](https://reader036.fdocuments.net/reader036/viewer/2022071904/55c6f65dbb61eb0b0f8b4568/html5/thumbnails/20.jpg)
Examoles
10011100101001