Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so...

Chapter 5

Bytes and Octets, ASCII and Unicode

Early on bytes could be anywhere from 5 to 9 bits so octet came into use to tell us exactly what we were talking about.

Today bytes are also universally 8 bits so we have two names for the same thing.

Unicode (16-bit codes) is an expansion of ASCII (8-bit codes).

Authors recommend always using Unicode for strings (but don't follow their own advice.

elvish = u'Namárië!'

Unicode 2 Network

Unicode characters that need to be transmitted across a network are sent as octets.

We need a Unicode2Network conversion scheme.

Enter 'utf-8'

For example, the uft-8 encoding of the character ë is the two characters C3 AB.

Understand that the above string means that when printed, printables are themselves and unprintables are \xnn where nn is a hexadecimal value.

>>> elvish = u'Namárië!'>>> elvish.encode('utf-8')'Nam\xc3\xa1ri\xc3\xab!'

Other Encodings

There are many choices fro encoding schemes.

utf-16: '\xff\xfe' represents byte order and all other characters are represented in 2 octets, typically <p>\x00 where <p> means “printable”

>>> elvish.encode('utf-16')'\xff\xfeN\x00a\x00m\x00\xe1\x00r\x00i\x00\xeb\x00!\x00'>>> elvish.encode('cp1252')'Nam\xe1ri\xeb!'>>> elvish.encode('idna')'xn--namri!-rta6f'>>> elvish.encode('cp500')'\xd5\x81\x94E\x99\x89SO'

Decodings:

Upon receipt, byte streams need to be decoded. To do this the encoding needs to be understood and then things are easy.

>>> print 'Nam\xe1ri\xeb!'.decode('cp1252')Namárië!>>> print 'Nam\xe1ri\xeb!'.decode('cp1252')Namárië!>>> print 'Nam\xe1ri\xeb!'.decode('cp1252')Namárië!

Decodings:

Note that if you are not “printing” that decode returns some universal representation of the original string.

>>> 'Nam\xe1ri\xeb!'.decode('cp1252')u'Nam\xe1ri\xeb!'>>> print 'Nam\xe1ri\xeb!'.decode('cp1252')Namárië!>>> '\xd5\x81\x94E\x99\x89SO'.decode('cp500')u'Nam\xe1ri\xeb!'>>> 'xn--namri!-rta6f'.decode('idna')u'nam\xe1ri\xeb!'>>> '\xff\xfeN\x00a\x00m\x00\xe1\x00r\x00i\x00\xeb\x00!\x00'.decode('utf-16')u'Nam\xe1ri\xeb!'>>> 'Nam\xc3\xa1ri\xc3\xab!'.decod('utf-8')Traceback (most recent call last): File "<stdin>", line 1, in <module>AttributeError: 'str' object has no attribute 'decod'>>> 'Nam\xc3\xa1ri\xc3\xab!'.decode('utf-8')u'Nam\xe1ri\xeb!'

Do it yourself; or not!

If you use high-level protocols (and their libraries) like HTTP encoding is done for you.

If not, you'll need to do it yourself.

Not supported:

ASCII is a 7-bit code so can't be used to encode some things.

>>> elvish.encode('ascii')Traceback (most recent call last): File "<stdin>", line 1, in <module>UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 3: ordinal not in range(128)

Variable length encodings:

Some codecs have different encodings of characters in different lengths.

Example, utf-16 uses either 16 or 32 bits to encode a character.

utf-16 adds prefix bytes - \xff\xfe.

All these things make it hard to pick out individual characters

Network Byte Order

Either big-endian or little-endian.

Typically needed for binary data. Text is handled by encoding (and knowing where your message ends (framing)).

Problem: Send 4253 across a netwrok connection

Solution 1: Send '4253'

Problem: Need to convert string <--> number. Lots of arithmetic.

Still, lots of situations do exactly this (HTTP, for example, since it is a text protocol)

We used to use dense binary protocols but less and less.

How does Python see 4253?

Python stores a number as binary, we can look at its hex representation as follows:

Each hex digit is 4 bits.

Computers store this value in memory using big-endian (most significant bits first) or little-endian (least significant bits first) format.

>>> hex(4253)'0x109d'

Python's perspective on a religious war.

Python is agnostic.

'<': little-endian

'>': big-endian

'i': integer

'!': network perspective (big-endian)

>>> import struct>>> struct.pack('<i',4253)'\x9d\x10\x00\x00'>>> struct.pack('>i',4253)'\x00\x00\x10\x9d'>>> struct.pack('!i',4253)'\x00\x00\x10\x9d'

>>> struct.unpack('!i','\x00\x00\x10\x9d')(4253,)

Older Approaches

h2ns(), h2nl(), n2hs() and n2hl().

Authors say, “Don't do it”.

Framing

UDP does framing for you. Data is transmitted in the same chucks it is received from the application

In TCP you have to frame your own transmitted data.

Framing answers the question, “When is it safe to stop calling recv()?

Simple Example: Single Stream

Send data with no reply

import socket, syss = socket.socket(socket.AF_INET, socket.SOCK_STREAM)HOST = sys.argv.pop() if len(sys.argv) == 3 else '127.0.0.1'PORT = 1060if sys.argv[1:] == ['server']: s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) s.bind((HOST, PORT)) s.listen(1) print 'Listening at', s.getsockname() sc, sockname = s.accept() print 'Accepted connection from', sockname sc.shutdown(socket.SHUT_WR) message = '' while True: more = sc.recv(8192) # arbitrary value of 8k if not more: # socket has closed when recv() returns '' break message += more print 'Done receiving the message; it says:' print message sc.close() s.close()

Simple Example

elif sys.argv[1:] == ['client']: s.connect((HOST, PORT)) s.shutdown(socket.SHUT_RD) s.sendall('Beautiful is better than ugly.\n') s.sendall('Explicit is better than implicit.\n') s.sendall('Simple is better than complex.\n') s.close()

else: print >>sys.stderr, 'usage: streamer.py server|client [host]'

Simple Example: Streaming in both directions; one RQ, one RP

Important cariat: Always complete streaming in one direction before beginning in the opposite direction. If not, deadlock can happen.

Simple Example: Fixed Length Messages

In this case use TCP's sendall() and write your own recvall().

Rarely happens.

def recvall(sock, length): data = '' while len(data) < length: more = sock.recv(length - len(data)) if not more: raise EOFError('socket closed %d bytes into a %d-byte message' % (len(data), length)) data += more return data

Simple Example: Delimit Message with Special Characters.

Use a character outside the range of possible message characters unless the message is binary.

Authors' recommendation is to use this only if you know the message “alphabet” is limited.

If you need to use message characters then “escape” them inside the message.

Using this approach has issues – recognizing an escaped character, removing the escaping upon arrival and message length.

Simple Example: Prefix message with its length

Popular with binary data.

Don't forget to “frame” the length itself.

What if this is your choice but you don't know in advance the length of the message? Divide your message up into known length segments and send them separately. Now all you need is a signal for the final segment.

Listing 5-2.

#!/usr/bin/env python# Foundations of Python Network Programming - Chapter 5 - blocks.py# Sending data one block at a time.

import socket, struct, syss = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

HOST = sys.argv.pop() if len(sys.argv) == 3 else '127.0.0.1'PORT = 1060format = struct.Struct('!I') # for messages up to 2**32 - 1 in length

def recvall(sock, length): data = '' while len(data) < length: more = sock.recv(length - len(data)) if not more: raise EOFError('socket closed %d bytes into a %d-byte message' % (len(data), length)) data += more return data

Listing 5-2.

def get(sock): lendata = recvall(sock, format.size) (length,) = format.unpack(lendata) return recvall(sock, length)

def put(sock, message): sock.send(format.pack(len(message)) + message)

if sys.argv[1:] == ['server']: s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) s.bind((HOST, PORT)) s.listen(1) print 'Listening at', s.getsockname() sc, sockname = s.accept() print 'Accepted connection from', sockname sc.shutdown(socket.SHUT_WR) while True: message = get(sc) if not message: break print 'Message says:', repr(message) sc.close() s.close()

Listing 5-2.

elif sys.argv[1:] == ['client']: s.connect((HOST, PORT)) s.shutdown(socket.SHUT_RD) put(s, 'Beautiful is better than ugly.') put(s, 'Explicit is better than implicit.') put(s, 'Simple is better than complex.') put(s, '') s.close()

else: print >>sys.stderr, 'usage: streamer.py server|client [host]'

HTTP Example:

• Uses a delimiter - '\r\n\r\n' – for the header and Content-Length field in the header for possibly purely binary data.

Pickles:

• Pickles is native serialization built into Python.

• Serialization is used to send objects that include pointers across the network where the pointers ill have to be rebuilt.

• Pickling is a mix of text and data:

• At the other end:

>>> import pickle>>> pickle.dumps([5,6,7])'(lp0\nI5\naI6\naI7\na.'>>>

>>> pickle.dumps([5,6,7])'(lp0\nI5\naI6\naI7\na.'>>> pickle.loads(('(lp0\nI5\naI6\naI7\na.An apple day') )[5, 6, 7]

Pickles:

• Problem in network case is that we can't tell how many bytes of pickle data were consumed before we get to what follows (“An apple a day”).

• If we use load() function on a file instead, then the file pointer is maintained and we can ask its location.

• Remember that Python lets you turn a socket into a file object – makefile().

>>> from StringIO import StringIO>>> f = StringIO('(lp0\nI5\naI6\naI7\na.An apple day')>>> pickle.load(f)[5, 6, 7]>>> f.pos18>>> f.read()'An apple day'>>>

JSON

• Popular and easily allows data exchange between software written in different languages.

• Does not support framing.

• JSON supports Unicode but not binary (see BSON)

• See Chapter 18

>>> import json>>> json.dumps([51,u'Namárië!'])'[51, "Nam\\u00e1ri\\u00eb!"]'>>> json.loads('{"name": "lancelot", "quest" : "Grail"}'){u'quest': u'Grail', u'name': u'lancelot'}>>>

XML

• Popular and easily allows data exchange between software written in different languages.

• Does not support framing.

• Best for text documents.

• See Chapter 10

Compression

• Time spent transmitting much longer than time pre- and post-processsing exchanged data.

• HTTP lets client and server decide whether to compress or not.

• zlib is self-framing. Start feeding it a compressed data stream and it will know when the stream has come to an end.

>>> data = zlib.compress('sparse')+'.'+zlib.compress('flat')+'.'>>> data'x\x9c+.H,*N\x05\x00\t\r\x02\x8f.x\x9cK\xcbI,\x01\x00\x04\x16\x01\xa8.'>>> len(data)28>>>

did not try to compress this

Compression

• Suppose the previous data arrives in 8-byte chunks.

• We are still expecting more data.

>>> dobj = zlib.decompressobj()>>> dobj.decompress(data[0:8]), dobj.unused_data('spars', '')>>> indicates we haven't reached EOF

>>> dobj.decompress(data[8:16]), dobj.unused_data('e', '.x')>>>

says we consumed the first compressed bitand some data was unused.

Compression

• Skip over the '.' and start to decompress the rest of the compressed data

>>> dobj2 = zlib.decompressobj()>>> dobj2.decompress('x'), dobj2.unused_data('', '')>>> dobj2.decompress(data[16:24]), dobj2.unused_data('flat', '')>>> dobj2.decompress(data[24:]), dobj2.unused_data('', '.')>>>

final '.'; the point is, the stuff we have gathered so far'' + 'flat' + ''consists of all the data compressed by the 2nd useof zlib.compress()

NOTE: Using zlib regularly provides its own framing.

Network Exceptions:

• Many possibilities, some specific (socket.timeout) and some generic (socket.error).

• Homework: Write two short python scripts; one that opens a UDP socket connected to a remote socket. The second program tries to send data to the previous socket but will fail since its socket is not the one the other was “connected” to. Find out the exact error that Python returns, along with the value of ErrNo.

• Familiar exceptions – socket.gaierror, socket.error, socket.timeout.

Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so...

Documents

Transcript of Chapter 5. Bytes and Octets, ASCII and Unicode Early on bytes could be anywhere from 5 to 9 bits so...