Unicode basics in python

15

Click here to load reader

Transcript of Unicode basics in python

Page 1: Unicode basics in python

Unicode

in python

Page 2: Unicode basics in python

We Cover these now

● Unicode history● terms clarity (code point,BOM,utf-8,utf-16)● decoding and encoding in python● how django handles these?● helpful python modules to tackle it

Note: BOM is used in utf-16.since, it has multi bytes character code point

Page 3: Unicode basics in python

How it came?

Americans came up with (7 bit)ASCII representation with english only alphabets as a standard to exchange information.(‘A’ - 65, ’a’ - 97)

Rest of the world came up with their unaccented english characters ('ä', )in their own way.(messed up)

Page 4: Unicode basics in python

What causes unicode born?

To exchange information in all languages, we got some requirements● Unique and simple rule was needed● Adoptable across all machines(windows,ibm,

etc..)● Efficient storage as much possible

Page 5: Unicode basics in python

Unicode

Unicode = UCS(universal character set) + bit representation logicUCS:character + code point(‘a’, 97)bit representation:

BOM = Big endian (or) Little endian

00 48 00 65 00 6C 00 6C 00 6F (or) 48 00 65 00 6C

00 6C 00 6F 00

Page 6: Unicode basics in python

utf-8 is famous, because

● multi-byte encoding● variable width encoding● upto 4 byte code points are allowed by utf-8● mostly, No need BOM(8 bits)● memory efficient

How? for NON-ASCII bytes, 1st byte is reserved to indicate the no of bytes the char is using(eg.compression)

Page 7: Unicode basics in python

decoding

Character to Numeric value(code point) conversion● from <type 'str'> to <type 'unicode'>● it throws maximum “UnicodeDecodeError:”

(samples demo)

Page 8: Unicode basics in python

encoding

● Numeric value(code point) to Characters● from <type 'unicode'> to <type 'str'>● it throws maximum “UnicodeEncodeError:”

(samples demo)

Page 9: Unicode basics in python

Rules to Remember…

● Decode early, Unicode everywhere, Encode late● UTF-8 is the best guess for an encoding● chardet.detect()==========================

in Python 3 this is solved…

● <type 'str'> is a Unicode object

Page 10: Unicode basics in python

How django handles?>>> def to_unicode(... obj, encoding='utf-8'):... if isinstance(obj, basestring):... if not isinstance(obj, unicode):... obj = unicode(obj, encoding)... return obj

smart_text(s, encoding='utf-8', strings_only=False, errors='strict')force_text(s, encoding='utf-8', strings_only=False, errors='strict')smart_bytes(s, encoding='utf-8', strings_only=False, errors='strict')

Page 11: Unicode basics in python

How to set your python default encoding standard?

import sys>>>reload(sys)>>>sys.setdefaultencoding(‘utf-8’)>>>sys.getdefaultencoding>>>’utf-8’(or)# -*- coding: utf-8 -*-(tell to python you saved <mod_name.py> in utf-8)

Page 12: Unicode basics in python

Related python modules..

● chardet.detect()● unicodedata● codecs

Page 13: Unicode basics in python

Thanks for your time

Post your questions.

Page 14: Unicode basics in python

samples demo….

Page 15: Unicode basics in python

screenshot2