am50_hw4

AM 50: Homework 4 (due April 6th at 12 PM)

Program files: A number of program files for this homework can be downloaded as a single ZIP file from thecourse website.

1. (a) Let a be a real number where −1 < a < 1. By considering (1 − a)∑Nn=0 an show that

N

∑n=0

an =1 − aN+1

1 − a(1)

and hence that∞

∑n=0

an =1

1 − a. (2)

(b) By differentiating both sides of Eq. 2 or otherwise, find an expression for ∑∞n=0 nan.

(c) Consider a coin with probability p of obtaining a head, and probability q = 1 − p ofobtaining a tail. Let N be a random variable that is the number of coin tosses requireduntil first obtaining a tail. Calculate the Shannon entropy of N.

2. Channel entropy. Consider a channel where information is sent as three-digit decimalnumbers of the form 000, 001, 002, . . . , 148, 149. Assume that each of the 150 numbers isequally probable. The entropy of a number is therefore HN = −150( 1

150 log21

150 ) = log2 150.

(a) Calculate the Shannon entropy of the first digit, second digit, and third digit.

(b) Calculate the sum of the entropies of the three digits. Why is the sum not equal to HN?

(c) Suppose that the second digit of a number is 3. What are entropies of the first and thirddigits? Are the entropies the same as in part (a)?

3. Digram analysis of different languages. In his paper, Shannon considers the digram struc-ture of English, where he looks at the probabilities of two-letter combinations. It turns out thatdigrams represent a simple but surprisingly powerful way to differentiate between languages.To illustrate this, we have downloaded out-of-copyright books in five different Europeanlanguages from Project Gutenberg:

0. English: The Voyage Out by Virginia Woolf (1915)

1. French: L’Atlantide by Pierre Benoît (1920)

2. German: Siddartha by Hermann Hesse (1922)

3. Italian: Dal Cellulare a Finalborgo by Paolo Valera (1899)

4. Spanish: La Voz de la Conseja edited by Emilio Carrère (1920)

These books are included in the ZIP file for the homework. They have been pre-processed toremove any special characters and accented characters. A program digram.py is supplied,which scans each book and calculates the probability distribution of the digrams. It removespunctuation, converts everything to lower case, and then considers each word. In the analysis,it considers 27 characters, with 0 corresponding to a space, 1 = a, . . . , 25 = y, and 26 = z. Fora word such as toast, the program considers it with a space at both ends as

_toast_ (3)

1

http://www.gutenberg.org/

and then counts digrams _t, to, oa, as, st, and t_. After the program has run, it will haveassembled probabilities pk

i,j of the occurrence of digram (i, j) in language k. To simplifythings, the program creates a tiny fictitious probability in any digram that was not seen whenscanning the books (such as qx). Hence pk

i,j > 0 for all combinations i, j, and k and you do nothave to worry about taking the logarithm of zero in subsequent computations.

(a) Run the program digram.py. It will build the digram probabilities and then output a 2Dmatrix for the English digrams. What do you notice about the row for q? Comment on afew other features of the matrix.

(b) There are a few lines at the end of the program that can be modified to plot the differencebetween two languages. Modify the program as directed to plot the difference betweenItalian and Spanish. Comment on some key differences between the two languages.

(c) Add lines to the program to calculate the entropy of selecting a random digram foreach of the five languages. What language has the most entropy? What language hasthe least? Compare the entropies against a hypothetical language in which all 27 × 27digrams are equally probable.

(d) Add lines to the program to calculate a measure of the difference between two languages,

D(k, l) =

√√√√ 26

∑i=0

26

∑j=0

(pk

i,j − pli,j

)2.

What pair of languages are the most similar? What pair of languages are the mostdifferent?

4. Automatic language detection. Digrams can be used to differentiate between languages.Suppose that a given message consists of a sequence of digrams (iα, jα) for α = 0, . . . , N − 1.Then for language k, the likelihood of observing the message is

L(k) =N−1

∏α=0

pkiα,jα .

The above formula is unwieldy to use computationally, since it involves multiplying manytiny numbers together. Taking the logarithm of both sides yields the log-likelihood formula

log2 L(k) =N−1

∑α=0

log2(pkiα,jα), (4)

which is easier to deal with. Write a Python program to take a string of text and evaluate thelog-likelihood as in Eq. 4. For each word in the string, consider it as in (3) above. It may beuseful to look at how the create_table function in digram.py works, since each word can beprocessed in a very similar way.

For each of the following strings, use your program to calculate the log-likelihood for each ofthe five languages, and determine which language the string is most likely to be written in.

(a) “Tom Hanks”, “Penelope Cruz”, “Juliette Binoche”, and three other names.

2

(b) “New York”, “Colorado”, “Vermont”, “Alberta”, and three other states/provinces.

(c) “Los Angeles”, “Anaheim”, “Cincinnati”, “Portland”, and three other cities.

(d) “hello”, “bonjour”, “hola”, “hi”, and three other greetings.

(e) “oui”, “auf”, “uno”, and three other small words.

(f) “Words are, in my not-so-humble opinion, our most inexhaustible source of magic.”

(g) “En art comme en amour, l’instinct suffit.”

Finally, find an example that does not match your expectation, and briefly discuss a strategythat might make the program more accurate to handle this case.

5. Decoding error-correcting codes. Extra credit.

(a) The file code1.txt is an ASCII message that has been converted to binary, encodedusing a three-bit repetition code, and has had some noise artificially introduced. As anexample, the character q is 113 in ASCII, which is 01110001 in binary. This would beencoded as

000111111111000000000111

and after noise has been artificially introduced it could be

010011110111001001100101,

although due to the redundancy, the character can still be decoded. Write a Pythonprogram to decode the message. The Python command chr will be useful to convert aninteger into an ASCII character.

(b) Decode the file code2.txt, which is encoded using a (7, 4) Hamming code.

3

am50_hw4

Documents

Transcript of am50_hw4