Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so...

32
Chapter 5

Transcript of Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so...

Page 1: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Chapter 5

Page 2: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Bytes and Octets, ASCII and Unicode

Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were talking about.

Today bytes are also universally 8 bits so we have two names for the same thing.

Unicode (16-bit codes) is an expansion of ASCII (8-bit codes).

Authors recommend always using Unicode for strings (but don't follow their own advice.

elvish = u'Namárië!'

Page 3: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Unicode 2 Network

Unicode characters that need to be transmitted across a network are sent as octets.

We need a Unicode2Network conversion scheme.

Enter 'utf-8'

For example, the uft-8 encoding of the character ë is the two characters C3 AB.

Understand that the above string means that when printed, printables are themselves and unprintables are \xnn where nn is a hexadecimal value.

>>> elvish = u'Namárië!'>>> elvish.encode('utf-8')'Nam\xc3\xa1ri\xc3\xab!'

Page 4: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Other Encodings

There are many choices fro encoding schemes.

utf-16: '\xff\xfe' represents byte order and all other characters are represented in 2 octets, typically <p>\x00 where <p> means “printable”

>>> elvish.encode('utf-16')'\xff\xfeN\x00a\x00m\x00\xe1\x00r\x00i\x00\xeb\x00!\x00'>>> elvish.encode('cp1252')'Nam\xe1ri\xeb!'>>> elvish.encode('idna')'xn--namri!-rta6f'>>> elvish.encode('cp500')'\xd5\x81\x94E\x99\x89SO'

Page 5: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Decodings:

Upon receipt, byte streams need to be decoded. To do this the encoding needs to be understood and then things are easy.

>>> print 'Nam\xe1ri\xeb!'.decode('cp1252')Namárië!>>> print 'Nam\xe1ri\xeb!'.decode('cp1252')Namárië!>>> print 'Nam\xe1ri\xeb!'.decode('cp1252')Namárië!

Page 6: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Decodings:

Note that if you are not “printing” that decode returns some universal representation of the original string.

>>> 'Nam\xe1ri\xeb!'.decode('cp1252')u'Nam\xe1ri\xeb!'>>> print 'Nam\xe1ri\xeb!'.decode('cp1252')Namárië!>>> '\xd5\x81\x94E\x99\x89SO'.decode('cp500')u'Nam\xe1ri\xeb!'>>> 'xn--namri!-rta6f'.decode('idna')u'nam\xe1ri\xeb!'>>> '\xff\xfeN\x00a\x00m\x00\xe1\x00r\x00i\x00\xeb\x00!\x00'.decode('utf-16')u'Nam\xe1ri\xeb!'>>> 'Nam\xc3\xa1ri\xc3\xab!'.decod('utf-8')Traceback (most recent call last): File "<stdin>", line 1, in <module>AttributeError: 'str' object has no attribute 'decod'>>> 'Nam\xc3\xa1ri\xc3\xab!'.decode('utf-8')u'Nam\xe1ri\xeb!'

Page 7: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Do it yourself; or not!

If you use high-level protocols (and their libraries) like HTTP encoding is done for you.

If not, you'll need to do it yourself.

Page 8: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Not supported:

ASCII is a 7-bit code so can't be used to encode some things.

>>> elvish.encode('ascii')Traceback (most recent call last): File "<stdin>", line 1, in <module>UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 3: ordinal not in range(128)

Page 9: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Variable length encodings:

Some codecs have different encodings of characters in different lengths.

Example, utf-16 uses either 16 or 32 bits to encode a character.

utf-16 adds prefix bytes - \xff\xfe.

All these things make it hard to pick out individual characters

Page 10: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Network Byte Order

Either big-endian or little-endian.

Typically needed for binary data. Text is handled by encoding (and knowing where your message ends (framing)).

Problem: Send 4253 across a netwrok connection

Solution 1: Send '4253'

Problem: Need to convert string <--> number. Lots of arithmetic.

Still, lots of situations do exactly this (HTTP, for example, since it is a text protocol)

We used to use dense binary protocols but less and less.

Page 11: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

How does Python see 4253?

Python stores a number as binary, we can look at its hex representation as follows:

Each hex digit is 4 bits.

Computers store this value in memory using big-endian (most significant bits first) or little-endian (least significant bits first) format.

>>> hex(4253)'0x109d'

Page 12: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Python's perspective on a religious war.

Python is agnostic.

'<': little-endian

'>': big-endian

'i': integer

'!': network perspective (big-endian)

>>> import struct>>> struct.pack('<i',4253)'\x9d\x10\x00\x00'>>> struct.pack('>i',4253)'\x00\x00\x10\x9d'>>> struct.pack('!i',4253)'\x00\x00\x10\x9d'

>>> struct.unpack('!i','\x00\x00\x10\x9d')(4253,)

Page 13: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Older Approaches

h2ns(), h2nl(), n2hs() and n2hl().

Authors say, “Don't do it”.

Page 14: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Framing

UDP does framing for you. Data is transmitted in the same chucks it is received from the application

In TCP you have to frame your own transmitted data.

Framing answers the question, “When is it safe to stop calling recv()?

Page 15: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Simple Example: Single Stream

Send data with no reply

import socket, syss = socket.socket(socket.AF_INET, socket.SOCK_STREAM)HOST = sys.argv.pop() if len(sys.argv) == 3 else '127.0.0.1'PORT = 1060if sys.argv[1:] == ['server']: s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) s.bind((HOST, PORT)) s.listen(1) print 'Listening at', s.getsockname() sc, sockname = s.accept() print 'Accepted connection from', sockname sc.shutdown(socket.SHUT_WR) message = '' while True: more = sc.recv(8192) # arbitrary value of 8k if not more: # socket has closed when recv() returns '' break message += more print 'Done receiving the message; it says:' print message sc.close() s.close()

Page 16: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Simple Example

elif sys.argv[1:] == ['client']: s.connect((HOST, PORT)) s.shutdown(socket.SHUT_RD) s.sendall('Beautiful is better than ugly.\n') s.sendall('Explicit is better than implicit.\n') s.sendall('Simple is better than complex.\n') s.close()

else: print >>sys.stderr, 'usage: streamer.py server|client [host]'

Page 17: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Simple Example: Streaming in both directions; one RQ, one RP

Important cariat: Always complete streaming in one direction before beginning in the opposite direction. If not, deadlock can happen.

Page 18: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Simple Example: Fixed Length Messages

In this case use TCP's sendall() and write your own recvall().

Rarely happens.

def recvall(sock, length): data = '' while len(data) < length: more = sock.recv(length - len(data)) if not more: raise EOFError('socket closed %d bytes into a %d-byte message' % (len(data), length)) data += more return data

Page 19: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Simple Example: Delimit Message with Special Characters.

Use a character outside the range of possible message characters unless the message is binary.

Authors' recommendation is to use this only if you know the message “alphabet” is limited.

If you need to use message characters then “escape” them inside the message.

Using this approach has issues – recognizing an escaped character, removing the escaping upon arrival and message length.

Page 20: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Simple Example: Prefix message with its length

Popular with binary data.

Don't forget to “frame” the length itself.

What if this is your choice but you don't know in advance the length of the message? Divide your message up into known length segments and send them separately. Now all you need is a signal for the final segment.

Page 21: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Listing 5-2.

#!/usr/bin/env python# Foundations of Python Network Programming - Chapter 5 - blocks.py# Sending data one block at a time.

import socket, struct, syss = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

HOST = sys.argv.pop() if len(sys.argv) == 3 else '127.0.0.1'PORT = 1060format = struct.Struct('!I') # for messages up to 2**32 - 1 in length

def recvall(sock, length): data = '' while len(data) < length: more = sock.recv(length - len(data)) if not more: raise EOFError('socket closed %d bytes into a %d-byte message' % (len(data), length)) data += more return data

Page 22: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Listing 5-2.

def get(sock): lendata = recvall(sock, format.size) (length,) = format.unpack(lendata) return recvall(sock, length)

def put(sock, message): sock.send(format.pack(len(message)) + message)

if sys.argv[1:] == ['server']: s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) s.bind((HOST, PORT)) s.listen(1) print 'Listening at', s.getsockname() sc, sockname = s.accept() print 'Accepted connection from', sockname sc.shutdown(socket.SHUT_WR) while True: message = get(sc) if not message: break print 'Message says:', repr(message) sc.close() s.close()

Page 23: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Listing 5-2.

elif sys.argv[1:] == ['client']: s.connect((HOST, PORT)) s.shutdown(socket.SHUT_RD) put(s, 'Beautiful is better than ugly.') put(s, 'Explicit is better than implicit.') put(s, 'Simple is better than complex.') put(s, '') s.close()

else: print >>sys.stderr, 'usage: streamer.py server|client [host]'

Page 24: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

HTTP Example:

• Uses a delimiter - '\r\n\r\n' – for the header and Content-Length field in the header for possibly purely binary data.

Page 25: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Pickles:

• Pickles is native serialization built into Python.

• Serialization is used to send objects that include pointers across the network where the pointers ill have to be rebuilt.

• Pickling is a mix of text and data:

• At the other end:

>>> import pickle>>> pickle.dumps([5,6,7])'(lp0\nI5\naI6\naI7\na.'>>>

>>> pickle.dumps([5,6,7])'(lp0\nI5\naI6\naI7\na.'>>> pickle.loads(('(lp0\nI5\naI6\naI7\na.An apple day') )[5, 6, 7]

Page 26: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Pickles:

• Problem in network case is that we can't tell how many bytes of pickle data were consumed before we get to what follows (“An apple a day”).

• If we use load() function on a file instead, then the file pointer is maintained and we can ask its location.

• Remember that Python lets you turn a socket into a file object – makefile().

>>> from StringIO import StringIO>>> f = StringIO('(lp0\nI5\naI6\naI7\na.An apple day')>>> pickle.load(f)[5, 6, 7]>>> f.pos18>>> f.read()'An apple day'>>>

Page 27: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

JSON

• Popular and easily allows data exchange between software written in different languages.

• Does not support framing.

• JSON supports Unicode but not binary (see BSON)

• See Chapter 18

>>> import json>>> json.dumps([51,u'Namárië!'])'[51, "Nam\\u00e1ri\\u00eb!"]'>>> json.loads('{"name": "lancelot", "quest" : "Grail"}'){u'quest': u'Grail', u'name': u'lancelot'}>>>

Page 28: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

XML

• Popular and easily allows data exchange between software written in different languages.

• Does not support framing.

• Best for text documents.

• See Chapter 10

Page 29: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Compression

• Time spent transmitting much longer than time pre- and post-processsing exchanged data.

• HTTP lets client and server decide whether to compress or not.

• zlib is self-framing. Start feeding it a compressed data stream and it will know when the stream has come to an end.

>>> data = zlib.compress('sparse')+'.'+zlib.compress('flat')+'.'>>> data'x\x9c+.H,*N\x05\x00\t\r\x02\x8f.x\x9cK\xcbI,\x01\x00\x04\x16\x01\xa8.'>>> len(data)28>>>

did not try to compress this

Page 30: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Compression

• Suppose the previous data arrives in 8-byte chunks.

• We are still expecting more data.

>>> dobj = zlib.decompressobj()>>> dobj.decompress(data[0:8]), dobj.unused_data('spars', '')>>> indicates we haven't reached EOF

>>> dobj.decompress(data[8:16]), dobj.unused_data('e', '.x')>>>

says we consumed the first compressed bitand some data was unused.

Page 31: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Compression

• Skip over the '.' and start to decompress the rest of the compressed data

>>> dobj2 = zlib.decompressobj()>>> dobj2.decompress('x'), dobj2.unused_data('', '')>>> dobj2.decompress(data[16:24]), dobj2.unused_data('flat', '')>>> dobj2.decompress(data[24:]), dobj2.unused_data('', '.')>>>

final '.'; the point is, the stuff we have gathered so far'' + 'flat' + ''consists of all the data compressed by the 2nd useof zlib.compress()

NOTE: Using zlib regularly provides its own framing.

Page 32: Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were.

Network Exceptions:

• Many possibilities, some specific (socket.timeout) and some generic (socket.error).

• Homework: Write two short python scripts; one that opens a UDP socket connected to a remote socket. The second program tries to send data to the previous socket but will fail since its socket is not the one the other was “connected” to. Find out the exact error that Python returns, along with the value of ErrNo.

• Familiar exceptions – socket.gaierror, socket.error, socket.timeout.