Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE...

10
Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8

Transcript of Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE...

Page 1: Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8.

Agenda Data Representation – Characters

Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8

Page 2: Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8.

Representing Characters In the previous lesson, you learned how numbers are

stored in computers as binary code. This lesson will show how other data (such as alpha-

numeric characters, special characters and control-codes) are stored.

One question you may ask is “how can a binary code be used to represent numbers and characters at the same time?”

The answer is that computer programming languages can define and understand variables that store different kinds of data.

In “C” programming language variables that store data must be declared to identify the type of data that is stored in binary code, such as numbers (ints, doubles, and floats) as well as characters or strings (chars or an array of chars).

Page 3: Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8.

Representing Characters To allow consistent data transfer among

computer systems (such as using the ftp command), rules on how characters are assigned binary code combinations needed to be created.

These rules or “encoding standards” have evolved over a period of time and are still evolving.

We will discuss 5 popular “encoding standards” – ASCII, ISO 8859, EBCDIC, UNICODE and UTF-8.

Page 4: Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8.

Encoding Scheme #1 ASCII (American Standard Code for Information Interchange)

A consistent set of rules in which series of 0’s and 1’s are used to represent characters. This allows uniformity between data transfer among computer systems.

Evolved from computers that could only work on 7-bit codes at a time.

Computers then evolved into 8-bit machines, thus a leading “0” (zero) was placed at beginning to keep original ASCII code, but allowed for additional characters which are often referred to as the “extended ASCII” character set.

Programmers when writing programs may need to access these ASCII characters (or control codes) by decimal, octal or hexadecimal number, so an ASCII table is available to provide assistance. You can issue command “man ascii” to find out more information regarding this encoding scheme.

Page 5: Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8.

ASCII Table Second Hex Digit 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| A| B| C| D| E| F|

|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

F 0 |NUL|SOH|STX|ETX|EOT|ENQ|ACK|BEL| BS| HT| LF| VT| FF| CR| SO| SI|

i |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

r 1 |DLE|DC1|DC2|DC3|DC4|NAK|SYN|ETB|CAN| EM|SUB|ESC| FS| GS| RS| US|

s |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

t 2 | SP| !| "| #| $| %| &| '| (| )| *| +| ,| -| .| /|

|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

H 3 | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| :| ;| <| =| >| ?|

e |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

x 4 | @| A| B| C| D| E| F| G| H| I| J| K| L| M| N| O|

|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

D 5 | P| Q| R| S| T| U| V| W| X| Y| Z| [| \| ]| ^| _|

i |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

g 6 | `| a| b| c| d| e| f| g| h| i| j| k| l| m| n| o|

I |---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

t 7 | p| q| r| s| t| u| v| w| x| y| z| {| || }| ~|DEL|

|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Each character is referred by two hexadecimal digits (read from row first, then column).

Special control characters, such as <LF>, start from hex number 01 upwards until hex number 1F

Interesting arrangement of upper and lower case letters related by 20 (hex) which means mathematical calculation can perform conversion between upper and lower case.

Page 6: Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8.

Encoding Scheme #2 Extended ASCII:

ISO 8859 (International Organization for Standards #8859) An encoding scheme to provide additional characters from the

extra bit added to the already existing 7-bit ASCII code. There are sets 1 (ISO 8859-1) and more recently set 15 (ISO

8859-15) which are used to represent most western European symbols).

Other sets in between include set 2 (ISO 8859-1) use to represent most eastern European symbols and set 10 (ISO 8859-10) used to represent Lap/Nordic/Eskimo symbols and so forth…

ISO 8859 tables are accessible from the internet by performing a net search on “ISO 8859 Tables”. There are also links on the UNX122 webpage.

You can issue command “man iso_8859_1”, etc. to find out more information regarding these encoding schemes.

Page 7: Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8.

Encoding Scheme #3 EBCDIC (Extended Binary Coded Decimal Interchange

Code) An 8-bit binary code used on IBM mainframe

computers. The rules for 0’s and 1’s in a binary code to

represent characters, differerent from ASCII, but there are programs (including the FTP command) that can transfer ASCII files to EBCDIC files to allow transfer of data between different types of computers.

An EBCDIC Table also exists (see link on UNX122 webpage).

Page 8: Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8.

Encoding Schemes for the Future

As economies move to a more “global environment” there is a move towards an encoding scheme that will simultaneously incorporate all the world language symbols into one large encoding scheme.

This would avoid “fragmented” encoding schemes previously discussed and allow programs to easily translate and transfer data among different countries.

Page 9: Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8.

Encoding Scheme #4 UNICODE (Universal Character Set / ISO 10646)

16-bit encoding scheme used to represent over 65,000 characters.

This encoding scheme will allow most world language symbols due to the additional 8 bits in the code.

Unicode is currently in use for many PC’s running Windows 98 and up, and is considered to be the latest trend in data representation to foster global communication.

Page 10: Agenda Data Representation – Characters Encoding Schemes ASCII ISO 8859 (sets 1-15) EBCDIC UNICODE UTF-8.

Encoding Scheme #5 UTF-8

UNICODE has some potential problems such as backward compatibility for programs that were originally written for ASCII and ISO 8859 encoding schemes.

UTF-8 is another encoding scheme that is used to solve the above problem and other problems associated with UNICODE by incorporate ASCII and ISO 8859 character sets.