Chapter 6 Entropy and Shannon’s First Theorem. Information Existence: I(p) = log_( 1 / p ) units...

20
Chapter 6 Entropy and Shannon’s First Theorem

Transcript of Chapter 6 Entropy and Shannon’s First Theorem. Information Existence: I(p) = log_( 1 / p ) units...

Chapter 6 Entropy and

Shannon’s First Theorem

Information

Existence: I(p) = log_(1/p)

units of information: in base 2 = a bitin base e = a natin base 10 = a Hartley

6.2

A quantitative measure of the amount of information any event represents. I(p) = the amount of information in the occurrence of an event of probability p.

Axioms:A. I(p) ≥ 0 for any event pB. I(p1∙p2) = I(p1) + I(p2) p1 & p2 are independent eventsC. I(p) is a continuous function of p

Cauchy functional equation

source single symbol

Uniqueness: Suppose I (′ p) satisfies the axioms. Since I (′ p) ≥ 0, take any 0 < p0 < 1, any base k = (1/p0)(1/I′(p0)). So kI′(p0) = 1/p0, and hence logk (1/p0) = I′(p0). Now, any z (0,1) can be written as p0

r, r a real number R+ (r = logp0 z). The

Cauchy Functional Equation implies that I (′ p0n) = n I (′ p0)

and m Z+, I (′ p01/m) = (1/m) I (′ p0), which gives I (′ p0

n/m) = (n/m) I (′ p0), and hence by continuity I (′ p0

r) = r I (′ p0). Hence I (′ z) = r log∙ k (1/p0) = logk (1/p0

r) = logk (1/z).

Note: In this proof, we introduce an arbitrary p0, show how any z relates to it, and then eliminate the dependency on that particular p0.

6.2

Entropy

The average amount of information received on a per symbol basis from a source S = {s1, …, sq} of symbols, si has probability pi. It is measuring the information rate. In radix r, when all the probabilities are independent:

mean geometric weighted

theofn informatio

11

ninformatio ofmean arithmetic weighted

1

1log

1log

1log)(

q

i

p

ir

q

i

p

ir

q

i irir

ii

ppppSH

• Entropy is amount of information in probability distribution.Alternative approach: consider a long message of N symbols

from S = {s1, …, sq} with probabilities p1, …, pq. You expect si to appear Npi times, and the probability of this typical message is:

6.3

)(1

log1

log isn informatio whose 11

SHNp

pNP

pPq

i ii

q

i

Npi

i

Consider f(p) = p ln (1/p): (works for any base, not just e)

f′(p) = (-p ln p)′ = -p(1/p) – ln p = -1 + ln (1/p)

f″(p) = p(-p-2) = - 1/p < 0 for p  (0,1) f is concave down

0lim)(

)ln(lim

lnlim

lnlim)(lim

2

1

010101

1

00

pp

pp

pp

pfppp

p

p

pp

f(1) = 0

1/e

f

0 1/e 1p

6.3

f′(0) = ∞f′(1) = -1

f′(1/e) = 0 f(1/e) = 1/e

Basic information about logarithm function

Tangent line to y = ln x at x = 1(y ln 1) = (ln)′x=1(x 1)

y = x 1

(ln x)″ = (1/x)′ = -(1/x2) < 0 x ln x is concave down.

Conclusion: ln x x 1

0

-1

1

ln x

x

y = x 1

6.4

• Minimum Entropy occurs when one pi = 1 and all others are 0.• Maximum Entropy occurs when? Consider

011)()1(log

consider and ons,distributiy probabilit twobe 1 and 1Let

1111

only when

1

11

q

ii

q

ii

q

iii

q

i i

ii

yxq

i i

ii

q

ii

q

ii

yxyxx

yx

x

yx

yx

ii

0

1

loglog1

loglog)(

1on distributi

withGibbs

1

1

11

qy

q

i ii

q

ii

q

i ii

i

pqppq

ppqSH

6.4

Fundamental Gibbs inequality

• Hence H(S) ≤ log q, and equality occurs only when pi = 1/q.

Entropy Examples

S = {s1} p1 = 1 H(S) = 0 (no information)

S = {s1,s2} p1 = p2 = ½ H2(S) = 1 (1 bit per symbol)

S = {s1, …, sr} p1 = … = pr = 1/r Hr(S) = 1 but H2(S) = log2r.

• Run length coding (for instance, in binary predictive coding):

p = 1 q is probability of a 0. H2(S) = p log2(1/p) + q log2(1/q)

As q 0 the term q log2(1/q) dominates (compare slopes). C.f.

average run length = 1/q and average # of bits needed = log2(1/q).

So q log2(1/q) = avg. amount of information per bit of original code.

Entropy as a Lower Bound for Average Code Length

Given an instantaneous code with length li in radix r, let

.)( hence and ,0log,1 Since .log

)log(log1

log1

log)(

1log

1loglog applying ,0log Gibbs,by So

1;;11

1

111

1

11

LSHKKlpK

rlKpQ

pp

pSH

Qpp

Q

p

Qp

QK

rQ

rK

rr

q

iiir

q

iriri

q

i iri

q

i irir

iii

iq

i i

iri

q

ii

l

i

q

il

i

i

By the McMillan inequality, this hold for all uniquely decodable codes. Equality occurs when K = 1 (the decoding tree is complete) and il

i rp

6.5

Shannon-Fano Coding

Simplest variable length method. Less efficient than Huffman, but allows one to code symbol si with length li directly from probability pi.

li = logr(1/pi)

.1

11

log1

logr

prp

p

rr

ppl

pil

ii

l

iiri

ir

ii

Summing this inequality over i:rr

p

K

rpq

i

iq

i

lq

ii

i1

1111

Kraft inequality is satisfied, therefore there is an instantaneous code with these lengths.

6.6

i

r

q

iii

q

i irir

p

SH

L

lpp

pSH

by multiplied summingby

1)(1

log)( Also,11

Example: p’s: ¼, ¼, ⅛, ⅛, ⅛, ⅛ l’s: 2, 2, 3, 3, 3, 3 K = 1

H2(S) = 2.5 L = 5/20

0 0

0 0

1

1 1

1 1

6.6

Recall: The nth extension of a source S = {s1, …, sq} with probabilities p1, …, pq is the set of symbols

T = Sn = {si1 ∙∙∙ sin

: sij S 1 j n} where

ti = si1 ∙∙∙ sin

has probability pi1 ∙∙∙ pin

= Qi assuming independent

probabilities. Let i = (i1−1, …, in−1)q + 1, an n-digit number base q.

The entropy is: []

concatenation

The Entropy of Code Extensions

multiplication

.1

log1

log1

log1

log

1log

1log)()(

111

11

11

1

n

n

nn

n

n

n

n

q

i ii

q

i ii

q

i iii

q

i iii

q

i ii

n

pQ

pQ

ppQ

ppQ

QQTHSH

6.8

1. gives up all themadding and extension,st )1( in they probabilit ajust is ˆ

)()(ˆ

1logˆ1

log

1log

1log th term heConsider t

1

1

1

1

1

1

1

1

1

ˆ

1

1 11

ˆ

11

11

nppp

SHSHppp

ppppp

ppp

ppp

pQk

nk

n

n

k

n k k

knk

n k

n

n

k

n

n

k

iii

i

q

iiiki

q

i

q

i

q

i iiiii

q

iki

q

i iii

q

i

q

i iii

q

i ii

6.8

H(Sn) = n∙H(S)Hence the average S-F code length Ln for T satisfies:H(T)  Ln < H(T) + 1 n ∙ H(S) Ln < n ∙ H(S) + 1

H(S)  (Ln/n) < H(S) + 1/n [now let n go to infinity]

Extension Example

S = {s1, s2} H2(S) = (2/3)log2(3/2) + (1/3)log2(3/1)

p1 = 2/3 p2 = 1/3 ~ 0.9182958 …

Huffman: s1 = 0 s2 = 1 Avg. coded length = (2/3)∙1+(1/3)∙1 = 1

Shannon-Fano: l1 = 1 l2 = 2 Avg. length = (2/3)∙1+(1/3)∙2 = 4/3

2nd extension: p11 = 4/9 p12 = 2/9 = p21 p22 = 1/9 S-F:

l11 = log2 (9/4) = 2 l12 = l21 = log2 (9/2) = 3 l22 = log2 (9/1) = 4

LSF(2) = avg. coded length = (4/9)∙2+(2/9)∙3∙2+(1/9)∙4 = 24/9 = 2.666…

Sn = (s1 + s2)n, probabilities are corresponding terms in (p1 + p2)n

n

i

ini ppi

n

021 n

iini

i

n

3

2

3

1

3

2y probabilit with symbols are thereSo

6.9

inini

n

3log3log

2

3log islength SF ingcorrespond The 222

323log223log

3

1

3log23

13log

3

2

200

2

02

02

)(

nnii

n

i

nn

ini

nin

i

nL

n

i

in

i

in

n

i

in

n

in

in

SF

(2 + 1)n = 3n

)(3

23log Hence 22

as)(

SHn

Ln

nSF

1

000

1

0

11

0

11

0

332223)(2

3)(2)2(2)2(

nn

n

i

in

i

in

i

inn

i

i

nxn

i

inindxn

i

inin

nnii

ni

i

n

i

nnnin

i

n

nxini

nxnx

i

nx

6.9

2n 3n-1 *

Extension cont.

Markov Process Entropy

Hence,

.,, state theletting of think process,order th an For

. follows y that probabilit lconditiona )|(

1

11

m

mm

ii

iiiiii

sssm

ssssssp

Ssii

ii

i

ssIsspsSH

sspssI

)|()|()|(

so and,)|(

1log)|(

1,)|(

1log),()|(),(

)|()|()()|()|()(

)|()()(Then . statein being ofy probabilit the)(let Now,

mi

im

i

mii

m

m

Ssssspi

Ss Ssii

Ss Ssii

Ssii

Ss

Ss

sspssIssp

ssIsspspssIsspsp

sSHspSHssp

6.10

Example

Si1Si2

Si p(si | si1, si2

) p(si1, si2

) p(si1, si2

, si)

0 0 0 0.8 5/14 4/14

0 0 1 0.2 5/14 1/14

0 1 0 0.5 2/14 1/14

0 1 1 0.5 2/14 1/14

1 0 0 0.5 2/14 1/14

1 0 1 0.5 2/14 1/14

1 1 0 0.2 5/14 1/14

1 1 1 0.8 5/14 4/14

3

21

21

}1,0{22 ),|(

1log),,()(

iiiiii sssp

ssspSH

801377.05.0

1log

14

14

2.0

1log

14

12

8.0

1log

14

42 222

6.11

0, 0

1, 00, 1

1, 1

.8

.8

.5

.5

.5

.5

.2

.2

equilibrium probabilities: p(0,0) = 5/14 = p(1,1) p(0,1) = 2/14 = p(1,0)

previousstate

nextstate

The Fibonacci numbers

Let f0 = 1 f1 = 2 f2 = 3 f3 = 5 f4 = 8 , …. be defined by fn+1 = fn + fn−1. The = the golden ratio, a root of the equation x2 = x + 1. Use these as the weights for a system of number representation with digits 0 and 1, without adjacent 1’s (because (100)phi = (11)phi).

Base Fibonacci

Representation Theorem: every number from 0 to fn − 1 can be uniquely written as an n-bit number with no adjacent one’s .

Existence: Basis: n = 0 0 ≤ i ≤ 0. 0 = (0)phi = ε

Induction: Let 0 ≤ i ≤ fn+1 If i < fn , we are done by induction hypothesis. Otherwise, fn ≤ i < fn+1 = fn−1 + fn , so 0 ≤ (i − fn) < fn−1, and is uniquely representable by i − fn = (bn−2 … b0)phi with bi in {0, 1} ¬(bi = bi+1 = 1). Hence i = (10bn−2 … b0)phi which also has no adjacent ones.

Uniqueness: Let i be the smallest number ≥ 0 with two distinct representations (no leading zeros). i = (bn−1 … b0)phi = (b′n−1 … b′0)phi . By minimality of i bn−1  ≠  b′n−1 , and so without loss of generality, let bn−1 = 1 b′n−1 = 0, implies (b′n−2 … b′0)phi ≥ fn−1 which can’t be true.

The golden ratio = (1+√5)/2 is a solution to x2 − x − 1 = 0 and is equal to the limit of the ratio of adjacent Fibonacci numbers.

0…

r − 1H2 = log2 r

1/r

0 11/

1/2

1

0

1st order Markov process:

010

1/

1/2

1/ 1/2

1 01/ + 1/2 = 1

Think of source as emitting variable length symbols:

Entropy = (1/)∙log + ½(1/²)∙log ² = log which is maximal

take into account variable length symbols

1/

1/2 0

Base Fibonacci