RVVOHVV & RGLQJ - School of Computing and Engineering · Info Theory on Entropy Self Info of an...

Spring 2020 Venu: Haag 313

ECE 5578 Multimedia Communication Lec 02 - Entropy & Lossless Coding

Zhu Li

Dept of CSEE, UMKC

Office: FH560E, Email: [email protected], Ph: x 2346.

http://l.web.umkc.edu/lizhu

ECE 5578 Multimedia Communication, 2020. p.1

slides created with WPS Office Linux and EqualX equation editor

Outline

Lecture 01 ReCap Info Theory on Entropy Lossless Entropy Coding


Video Compression in Summary Compression Pipeline:


Video Coding Standards: Rate-Distortion Performance

Pre-HEVC


MPEG System: Storage & Communication Solution

HTTP Adaptive Streaming of Video


Outline

Lecture 01 ReCap Info Theory on Entropy Self Info of an event Entropy of the source Relative Entropy Mutual Info

Entropy Coding

Thanks for SFU’s Prof. Jie Liang’s slides!


Entropy and its ApplicationEntropy coding: the last part of a compression system

Losslessly represent symbolsKey idea: Assign short codes for common symbols Assign long codes for rare symbols

Question: How to evaluate a compression method?

o Need to know the lower bound we can achieve.o Entropy

Entropycoding

QuantizationTransform

Encoder

0100100101111


Claude Shannon: 1916-2001 A distant relative of Thomas Edison 1932: Went to University of Michigan. 1937: Master thesis at MIT became the foundation of

digital circuit design:o “The most important, and also the most famous,

master's thesis of the century“ 1940: PhD, MIT 1940-1956: Bell Lab (back to MIT after that) 1948: The birth of Information Theory

o A mathematical theory of communication, Bell System Technical Journal.


Axiom Definition of InformationInformation is a measure of uncertainty or surprise Axiom 1:

Information of an event is a function of its probability:i(A) = f (P(A)). What’s the expression of f()?

Axiom 2: Rare events have high information content Water found on Mars!!!

Common events have low information content It’s raining in Vancouver.

Information should be a decreasing function of the probability:Still numerous choices of f().

Axiom 3: Information of two independent events = sum of individual information:

If P(AB)=P(A)P(B) i(AB) = i(A) + i(B). Only the logarithmic function satisfies these conditions.


Self-information

)(log)(

1log)( xpxp

xi bb -==

• Shannon’s Definition [1948]:• X: discrete random variable with alphabet {A1, A2, …, AN}• Probability mass function: p(x) = Pr{ X = x}• Self-information of an event X = x:

If b = 2, unit of information is bit

Self information indicates the number of bitsneeded to represent an event.

1

P(x)

)(log xPb-

0


Recall: the mean of a function g(X):

Entropy is the expected self-information of the r.v. X:

The entropy represents the minimal number of bits needed to losslessly represent one output of the source.

Entropy of a Random Variable

å=x xp

xpXH)(

1log)()(

)g( )())(()( xxpXgE xp å=

( ))(log)(

1log )()( XpEXp

EH xpxp -=÷÷ø

öççè

æ=

Also write as H (p): function of the distribution of X, not the value of X.


Example

P(X=0) = 1/2P(X=1) = 1/4P(X=2) = 1/8P(X=3) = 1/8Find the entropy of X.Solution:

1( ) ( ) log( )

1 1 1 1 1 2 3 3 7log 2 log 4 log8 log8 bits/sample.2 4 8 8 2 4 8 8 4

xH X p x

p x=

= + + + = + + + =

å


Example

A binary source: only two possible outputs: 0, 1 Source output example: 000101000101110101…… p(X=0) = p, p(X=1)= 1 – p.

Entropy of X: H(p) = p (-log2(p) ) + (1-p) (-log2(1-p)) H = 0 when p = 0 or p =1

oFixed output, no information H is largest when p = 1/2

oHighest uncertaintyoH = 1 bit in this case

Properties:H ≥ 0H concave (proved later) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.2

0.4

0.6

0.8

1

p

Ent

ropy

Equal prob maximize entropy


Joint entropy

1 1 2 2( , , )n np X i X i X i= = =

• We can get better understanding of the source S by looking at a block of output X1X2…Xn:

• The joint probability of a block of output:

Joint entropy

1 2

1 2

1 1 2 21 1 2 2

( , , )1( , , ) log

( , , )n

n

n ni i i n n

H X X X

p X i X i X ip X i X i X i

=

= = == = =åå å

Joint entropy is the number of bits required to represent the sequence X1X2…Xn: This is the lower bound for entropy coding.

( )),...(log 1 nXXpE-=


Conditional Entropy

1 ( )( | ) log log( | ) ( , )

p yi x yp x y p x y

= =

• Conditional Self-Information of an event X = x, given that event Y = y has occurred:

( )( , )

( | ) ( ) ( | ) ( ) ( | ) log( ( | ))

( | ) ( ) log( ( | )) ( , ) log( ( | ))

log( ( | )

x x y

x y x y

p x y

H Y X p x H Y X x p x p y x p y x

p y x p x p y x p x y p y x

E p y x

= = = -

= - = -

= -

å å å

åå åå

Conditional Entropy H(Y | X): Average cond. self-info.Remaining uncertainty about Y given the knowledge of X.

Note: p(x | y), p(x, y) and p(y) are three different distributions: p1(x | y), p2(x, y) and p3(y).


Conditional Entropy Example: for the following joint distribution p(x, y), find H(Y | X).

Indeed, H(X|Y) = H(X, Y) – H(Y)= 27/8 – 2 = 11/8 bits

1 2 3 41 1/8 1/16 1/32 1/32

2 1/16 1/8 1/32 1/32

3 1/16 1/16 1/16 1/16

4 1/4 0 0 0

Y

X P(X): [ ½, ¼, 1/8, 1/8 ] >> H(X) = 7/4 bits

P(Y): [ ¼ , ¼, ¼, ¼ ] >> H(Y) = 2 bits

H(X|Y) = ��=14 �� = ��|� = ��

= ¼ H(1/2 ¼ 1/8 1/8 ) + 1/4H(1/4, ½, 1/8 ,1/8) + 1/4H(1/4 ¼ ¼ ¼ ) + 1/4H(1 0 0 0) = 11/8 bits


General Chain Rule

General form of chain rule:

)...|(),...,( 1,11

21 XXXHXXXH i

n

iin -

=å=

The joint encoding of a sequence can be broken into the sequential encoding of each sample, e.g. H(X1, X2, X3)=H(X1) + H(X2|X1) + H(X3|X2, X1) Advantages:

Joint encoding needs joint probability: difficult Sequential encoding only needs conditional entropy,can use local neighbors to approximate the conditional entropy context-adaptive arithmetic coding.

Adding H(Z): H(X, Y | Z) + H(z) = H(X, Y, Z)= H(z) + H(X | Z) + H(Y | X, Z)


General Chain Rule

),...|()...|()(),...( 111211 -= nnn xxxpxxpxpxxpProof:

.),...|(

),...|(log),...(

),...|(log),...(

),...|(log),...(

),...(log),...(),...(

111

1 ,...1111

,...111

11

,...1 1

111

,...1111

å

å å

å å

å Õ

å

=

-

=

-

-

=

=

-

=

-=

-=

-=

-=

n

iii

n

i xnxiin

xnxii

n

in

xnx

n

iiin

xnxnnn

XXXH

xxxpxxp

xxxpxxp

xxxpxxp

xxpxxpXXH


General Chain Rule

1 1( | ,... )i ip x x x -

The complexity of the conditional probability

grows as the increase of i.

In many cases we can approximate the cond. probability with some nearest neighbors (contexts):

1 1 1( | ,... ) ( | ,... )i i i i L ip x x x p x x x- - -»

The low-dim cond prob is more manageable How to measure the quality of the approximation?

Relative entropy

0 1 1 0 1 0 1 a b c b c a bc b a b c b a


Relative Entropy – Cost of Coding with Wrong Distr

Also known as Kullback Leibler (K-L) Distance, Information Divergence, Information GainA measure of the “distance” between two distributions: In many applications, the true distribution p(X) is unknown, and we

only know an estimation distribution q(X) What is the inefficiency in representing X?

o The true entropy:

o The actual rate:

o The difference:

÷÷ø

öççè

æ==å )(

)(log)()(log)()||(

XqXpE

xqxpxpqpD p

x

1 ( ) log ( )x

R p x p x= -å2 ( ) log ( )

xR p x q x= -å

2 1 ( || )R R D p q- =ECE 5578 Multimedia Communication, 2020. p.21

Relative Entropy

Properties: ÷÷ø

öççè

æ==å )(

)(log)()(log)()||(

XqXpE

xqxpxpqpD p

x

( || ) 0.D p q ³

( || ) 0 if and only if q = p.D p q =

What if p(x)>0, but q(x)=0 for some x? D(p||q)=∞

Caution: D(p||q) is not a true distance

Not symmetric in general: D(p || q) ≠ D(q || p) Does not satisfy triangular inequality.

Proved later.


Relative Entropy How to make it symmetric?

Many possibilities, for example:

( )1 ( || ) ( || )2

D p q D q p+

( || ) ( || )D p q D q p

can be useful for pattern classification.

)||(1

)||(1

pqDqpD+


Mutual Information

i (x | y): conditional self-information

)()(),(log

)()|(log)|()();(

ypxpyxp

xpyxpyxixiyxi ==-=

Note: i(x; y) can be negative, if p(x | y) < p(x).

Mutual information between two events:

i(x | y) = -log p(x | y)

A measure of the amount of information that one event contains about another one. or the reduction in the uncertainty of one event due to the knowledge of the other.


Mutual Information

I(X; Y): Mutual information between two random variables:

( ) ( , )

( , )( ; ) ( , ) ( ; ) ( , ) log( ) ( )

( , )D ( , ) || ( ) ( ) log( ) ( )

x y x y

p x y

p x yI X Y p x y i x y p x yp x p y

p X Yp x y p x p y Ep X p Y

= =

æ ö= = ç ÷

è ø

åå åå

But it is symmetric: I(X; Y) = I(Y; X) Mutual information is a relative entropy:

If X, Y are independent: p(x, y) = p(x) p(y) I (X; Y) = 0 Knowing X does not reduce the uncertainty of Y.

Different from i(x; y), I(X; Y) >=0 (due to averaging)


Entropy and Mutual Information

( , ) ( | )( ; ) ( , ) log ( , ) log( ) ( ) ( )

( , ) log ( | ) ( , ) log ( )

( ) ( | )

x y x y

x y x y

p x y p x yI X Y p x y p x yp x p y p x

p x y p x y p x y p x

H X H X Y

= =

= -

= -

åå åå

åå åå

2. Similarly: ( ; ) ( ) ( | )I X Y H Y H Y X= -

1.

3. I(X; Y) = H(X) + H(Y) – H(X, Y)Proof: Expand the definition:

( ; ) ( ) ( | )I X Y H X H X Y= -

( )

),()()(

)(log)(log),(log),();(

YXHYHXH

ypxpyxpyxpYXIx y

-+=

--=åå


Entropy and Mutual Information

H(X) H(Y)

I(X; Y)H(X | Y) H(Y | X)

Total area: H(X, Y)

It can be seen from this figure that I(X; X) = H(X): Proof:Let X = Y in I(X; Y) = H(X) + H(Y) – H(X, Y),or in I(X; Y) = H(X) – H(X | Y) (and use H(X|X)=0).


the penalty of coding saypixels X and Y separately, isthe relative entropy of theirjoint distribution and assumingindependent distribution

Application of Mutual Information

a b c b c a bc b a b c b a

Mutual information can be used in the optimization of context quantization.Example: If each neighbor has 26 possible values (a to z),

then 5 neighbors have 265 combinations: too many cond probs to estimate.To reduce the number, can group similar data pattern

together context quantization

( )( )1 1 1 1( | ,... ) | ,...i i i ip x x x p x f x x- -»


Application of Mutual Information in Compression

We need to design the function f( ) to minimize the conditional entropy

( )( )1 1 1 1( | ,... ) | ,...i i i ip x x x p x f x x- -»

)...|(),...,( 1,11

21 XXXHXXXH i

n

iin -

=å=

( | ) ( ) ( ; )H X Y H X I X Y= -But

The problem is equivalent to maximizing the mutual information between {Xi} and f(x1, … xi-1).

))...(|( 1,1 XXfXH ii -

For further info: Liu and Karam, Mutual Information-Based Analysisof JPEG2000 Contexts, IEEE Trans Image Processing, VOL. 14, NO. 4, APRIL 2005, pp. 411-422.


Outline

Lecture 01 ReCap Info Theory and Entropy Entropy Coding Prefix Coding Kraft-McMillan Inequality Shannon Codes


Variable Length Coding

Design the mapping from source symbols to codewordsLossless mappingDifferent codewords may have different lengthsGoal: minimizing the average codeword lengthThe entropy is the lower bound.


Classes of Codes

Non-singular code: Different inputs are mapped to different codewords (invertible).Uniquely decodable code: any encoded string has only one possible

source string, but may need delay to decode.Prefix-free code (or simply prefix, or instantaneous):

No codeword is a prefix of any other codeword. The focus of our studies. Questions:

o Characteristic?o How to design?o Is it optimal?

All codes

Non-singular codes

Uniquely decodablecodes

Prefix-free codes


Prefix Code

Examples

XSingular Non-singular,

But not uniquely decodable

Uniquely decodable, but not

prefix-free

Prefix-free

1 0 0 0 02 0 010 01 103 0 01 011 1104 0 10 0111 111

Need punctuation ……01011…

Need to look at nextbit to decode previous code.


Carter-Gill’s Conjecture [1974]

Carter-Gill’s Conjecture [1974] Every uniquely decodable code can be replaced by a prefix-free code

with the same set of codeword compositions. So we only need to study prefix-free code.


Prefix-free CodeCan be uniquely decoded.No codeword is a prefix of another one.Also called prefix codeGoal: construct prefix code with minimal expected length.

Can put all codewords in a binary tree:

0 1

0 1

0 1

010

110 111

Root node

leaf node

Internal node

Prefix-free code contains leaves only. How to express the requirement mathematically?


Kraft-McMillan Inequality

121

£å=

-N

i

li

• The codeword lengths li, i=1,…N of a prefix code over an alphabet of size D(=2) satisfies the inequality

Conversely, if a set of {li} satisfies the inequality above, then there exists a prefix code with codeword

lengths li, i=1,…N.

The characteristic of prefix-free codes:



22 1211

LN

i

lLN

i

l ii £Û£ åå=

-

=

-

Consider D=2: expand the binary code tree to full depth L = max(li): li is the code length

010

110111

Number of nodes in the last level: Each code corresponds to a sub-tree: The number of off springs in the last level:

K-M inequality: # of L-th level offsprings of all codes is less than 2^L !

ilL-2

L2

L = 3

Example: {0, 10, 110, 111}


2^3=8

{4, 2, 0, 0}


010

110 111

11

Invalid code: {0, 10, 11, 110, 111}: li=[1 2 2 3 3]

Leads to more than2^L offspring: 12> 23

121

£å¥

=

-

i

li K-M inequality:


Extended Kraft Inequality

Countably infinite prefix code also satisfies the Kraft inequality: Has infinite number of code words.

Example: 0, 10, 110, 1110, 11110, 111……10, ……

(Golomb-Rice code, next lecture) Each codeword can be mapped to a subinterval in [0, 1] that is

disjoint with others (revisited in arithmetic coding)

121

£å¥

=

-

i

li

)0 )10

)0 0.5 0.75 0.875 1

110 ……


010

110

….

L = 3

Shannon Code: Bounds on Optimal Code

Code length is not integer in general.iDi p-l log* =

Practical codewords have to be integer.

úú

ùêê

é=

iDi p

l 1logShannon Code:

Is this a valid prefix code? Check Kraft inequality

.11log

1log

==£= åååå-ú

ú

ùêê

é-

-i

ppl pDDD iD

iD

i

11log1log +<£i

Dii

D pl

púú

ùêê

é=

iDi p

l 1log

1)()( +<£ XHLXH DD

Yes !

This is just one choice. May not be optimal (see example later)


Optimal Code

The optimal code with integer lengths should be better than Shannon code

1)()( * +££ XHLXH DD

To reduce the overhead per symbol: Encode a block of symbols {x1, x2, …, xn} together

{ }),...,,(1),...,,(),...,,(1212121 nnnn xxxlE

nxxxlxxxp

nL å ==

{ } 1),...,,(),...,,(),...,,( 212121 +££ nnn XXXHxxxlEXXXH

Assume i.i.d. samples: )(),...,,( 21 XnHXXXH n =

nXHLXH n

1)()( +££ )(XHLn ® if stationary.(entropy rate)


Optimal CodeImpact of wrong pdf: what’s the penalty if the pdf we use is different

from the true pdf?True pdf: p(x) Codeword length: l(x)Estimated pdf: q(x) Expected length: Epl(X)

1)||()()()||()( ++<£+ qpDpHXlEqpDpH p

Proof: assume Shannon code úú

ùêê

é=

)(1log)(xq

xl

åå ÷÷ø

öççè

æ+<ú

ú

ùêê

é= 1

)(1log)(

)(1log)()(

xqxp

xqxpXlE p

1)||()(1)(

1)()(log)( ++=÷÷

ø

öççè

æ+÷÷

ø

öççè

æ=å qpDXH

xpxqxpxp

The lower bound is derived similarly.ECE 5578 Multimedia Communication, 2020. p.42

Shannon Code is not optimal

Example: Binary r.v. X: p(0)=0.9999, p(1)=0.0001.

Entropy: 0.0015 bits/sampleAssign binary codewords by Shannon code:

úú

ùêê

é=

)(1log)(xp

xl

.19999.01log2 =úú

ùêêé .14

0001.01log2 =úú

ùêêé

Expected length: 0.9999 x 1+ 0.0001 x 14 = 1.0013.Within the range of [H(X), H(X) + 1].

But we can easily beat this by the code {0, 1} Expected length: 1.


Summary & Break

Info Theory Summary: Entropy: H(x) Conditional Entropy: H(X|Y) Mutual Info: I(X, Y) = Relative Entropy: D(X||Y)

Practical Codes: Hoffman Coding Golomb Coding and JPEG 2000 Lossless Coding


Practical Entropy Codes


Entropy Self Info of an event

�� = �� =−log �Pr �� = �� =−log�� Entropy of a source

�� = ��log��1��

Conditional Entropy, Mutual Information��|�� = ��,�� −��,�� = �� +�� − ��,��


H(X) H(Y)

I(X; Y)H(X | Y) H(Y | X)

Total area: H(X, Y)

a b c b c a bc b a b c b a

Main application: Context Modeling

Relative Entropy

��||�� =�� log �

��

Lossless Coding

Prefix Coding Codes on leaves No code is prefix of other codes Simple encoding/decoding

Kraft- McMillan Inequality: For a coding scheme with code length: l1, l2, …ln,

Given a set of integer length {l1, l2, …ln} that satisfy above inequality, we can always find a prefix code with code length l1, l2, …ln


0 1

0 1

0 10

10

110 111

Root node

leaf node

Internal node

��2−�� ≤ 1

Context Reduces Entropy Example


x1 x2 x3

x4 x5

H(x5) > H(x5|x4,x3, x2,x1)

H(x5) > H(x5|f(x4,x3, x2,x1))

Condition reduces entropy:

Context function:f(x4,x3, x2,x1)= sum(x4,x3, x2,x1)

getEntropy.m, lossless_coding.m

H(x5 | f(x4,x3, x2,x1)==100)

H(x5 | f(x4,x3, x2,x1)<100)

Context:

lenna.png

Huffman Code (1952) Design Requirement: The source probability distribution.

(But not available in many cases) Procedure:

1. Sort the probability of all source symbols in a descending order.2. Merge the last two into a new symbol, add their probabilities.3. Repeat Step 1, 2 until only one symbol (the root) is left.4. Code assignment:

Traverse the tree from the root to each leaf node, assign 0 to the top branch and 1 to the bottom branch.


Example

Source alphabet A = {a1, a2, a3, a4, a5}Probability distribution: {0.2, 0.4, 0.2, 0.1, 0.1}

a2 (0.4)

a1(0.2)

a3(0.2)

a4(0.1)

a5(0.1)

Sort

0.2

merge Sort

0.4

0.2

0.2

0.2

0.4

merge Sort

0.4

0.2

0.40.6

merge

0.6

0.4

Sort

1

merge

Assign code

0

1

1

00

01

1

000

001

01

1

000

01

0010

0011

1

000

01

0010

0011


Huffman code is prefix-free

01 (a1)

000 (a3)

0010 (a4) 0011(a5)

1

000

01

0010

0011

1 (a2)

All codewords are leaf nodes No code is a prefix of any other code. (Prefix free)


Average Codeword Length vs Entropy

Source alphabet A = {a, b, c, d, e} Probability distribution: {0.2, 0.4, 0.2, 0.1, 0.1} Code: {01, 1, 000, 0010, 0011}

Entropy:H(S) = - (0.2*log2(0.2)*2 + 0.4*log2(0.4)+0.1*log2(0.1)*2)= 2.122 bits / symbol

Average Huffman codeword length:L = 0.2*2+0.4*1+0.2*3+0.1*4+0.1*4 = 2.2 bits / symbol

This verifies H(S) ≤ L < H(S) + 1.


Huffman Code is not unique

0.4

0.2

0.40.6

0.6

0.4

0

1

1

00

01

Multiple ordering choices for tied probabilities

Two choices for each split: 0, 1 or 1, 0

0.4

0.2

0.4 0.6

0.6

0.4

1

0

0

10

11

a

b

c

0.4

0.2

0.40.6

0.6

0.4

1

0

0

10

11

b

a

c

0.4

0.2

0.40.6

0.6

0.4

1

0

0

10

11


Canonical Huffman Code

Huffman algorithm is needed only to compute the optimal codeword lengths The optimal codewords for a given data set are not unique

Canonical Huffman code is well structured Given the codeword lengths, can find a canonical Huffman

code Also known as slice code, alphabetic code.


úú

ùêê

é=

)(1log)(xp

xl

Canonical Huffman Code - Prunning the Bin Tree

Example: Codeword lengths from prob: 2, 2, 3, 3, 3, 4, 4 Verify that it satisfies Kraft-McMillan inequality

01

10011

000 001

1010 1011

A non-canonical example

00 01

The Canonical Tree

Rules: Assign 0 to left branch and 1 to right branch Build the tree from left to right in increasing order of depth Each leaf is placed at the first available position

110100 101

10

1110 1111

111

121

£å=

-N

i

li


Advantages of Canonical Huffman1. Reducing memory requirement

Non-canonical tree needs:All codewordsLengths of all codewords

Need a lot of space for large table

01

10011

000 001

1010 1011

00 01110100 101

1110 1111

Canonical tree only needs: Min: shortest codeword length Max: longest codeword length Distribution: Number of codewords in each level

Min=2, Max=4, # in each level: 2, 3, 2


Golomb Code

Lecture 02 ReCap Hoffman Coding Golomb Coding


Unary Code (Comma Code)

Encode a nonnegative integer n by n 1’s and a 0(or n 0’s and an 1).

n Codeword0 01 102 1103 11104 111105 111110… …

Is this code prefix-free?

0 1

0 10 1

010

1101110 … …

When is this code optimal? When probabilities are: 1/2, 1/4, 1/8, 1/16, 1/32 … D-adic Huffman code becomes unary code in this case.

No need to store codeword table, very simple


Implementation – Very Efficient

Encoding:UnaryEncode(n) { while (n > 0) { WriteBit(1); n--; } WriteBit(0);}

Decoding:UnaryDecode() { n = 0; while (ReadBit(1) == 1) { n++; } return n;}


Golomb Code [Golomb, 1966]

A multi-resolutional approach: Divide all numbers into groups of equal size m

o Denote as Golomb(m) or Golomb-m Groups with smaller symbol values have shorter codes Symbols in the same group has codewords of similar lengths

o The codeword length grows much slower than in unary code

0 max

m m m m

Codeword : (Unary, fixed-length)

Group ID: Unary code

Index within each group:


Golomb Code

rmmnrqmn +úûú

êëê=+=

q: Quotient, used unary code

q Codeword0 01 102 1103 11104 111105 1111106 1111110… …

r: remainder, “fixed-length” code

K bits if m = 2^k m=8: 000, 001, ……, 111

If m ≠ 2^k: (not desired) bits for smaller r bits for larger r

ë ûm2logé ùm2logm = 5: 00, 01, 10, 110, 111


Golomb Code with m = 5 (Golomb-5)n q r code0 0 0 0001 0 1 0012 0 2 0103 0 3 01104 0 4 0111

n q r code5 1 0 10006 1 1 10017 1 2 10108 1 3 101109 1 4 10111

n q r code10 2 0 1100011 2 1 1100112 2 2 1101013 2 3 11011014 2 4 110111


Golomb vs Canonical Huffman

Golomb code is a canonical Huffman With more properties

Codewords: 000, 001, 010, 0110, 0111, 1000, 1001, 1010, 10110, 10111

Canonical form:From left to rightFrom short to longTake first valid spot


A special Golomb code with m= 2^k The remainder r is the fixed k LSB bits of n

n q r code0 0 0 00001 0 1 00012 0 2 00103 0 3 00114 0 4 01005 0 5 01016 0 6 01107 0 7 0111

n q r code8 1 0 100009 1 1 1000110 1 2 1001011 1 3 1001112 1 4 1010013 1 5 1010114 1 6 1011015 1 7 10111

m = 8

Golobm-Rice Code


Implementation

Encoding:

GolombEncode(n, RBits) { q = n >> RBits; UnaryCode(q); WriteBits(n, RBits);}

Decoding:GolombDecode(RBits) { q = UnaryDecode(); n = (q << RBits) + ReadBits(RBits); return n;}

Remainder bits: RBits = 3 for m = 8

Output the lower (RBits) bits of n.

n q r code0 0 0 00001 0 1 00012 0 2 00103 0 3 00114 0 4 01005 0 5 01016 0 6 01107 0 7 0111


Exponential Golomb Code (Exp-Golomb)

Golomb code divides the alphabet into groups of equal size

0 max

m m m m

In Exp-Golomb code, the group size increases exponentially

Codes still contain two parts: Unary code followed by fixed-length code

0 max

4 8 161 2

n code0 0

1 100

2 101

3 11000

4 11001

5 11010

6 11011

7 1110000

8 1110001

9 1110010

10 1110011

11 1110100

12 1110101

13 1110110

14 1110111

15 111100000 Proposed by Teuhola in 1978


ImplementationDecoding

ExpGolombDecode() { GroupID = UnaryDecode(); if (GroupID == 0) {

return 0; } else { Base = (1 << GroupID) - 1; Index = ReadBits(GroupID); return (Base + Index); } }}

n code GroupID

0 0 0

1 100 1

2 101

3 11000 2

4 11001

5 11010

6 11011

7 1110000 3

8 1110001

9 1110010

10 1110011

11 1110100

12 1110101

13 1110110

14 1110111


Outline

Golomb Code Family: Unary Code Golomb Code Golomb-Rice Code Exponential Golomb Code

Why Golomb code?


Geometric Distribution (GD)

Geometric distribution with parameter ρ: P(x) = ρx (1 - ρ), x ≥ 0, integer. Prob of the number of failures before the first success in a series of

independent Yes/No experiments (Bernoulli trials). Unary code is the optimal prefix code for geometric distribution

with ρ ≤ 1/2:

ρ = 1/4: P(x): 0.75, 0.19, 0.05, 0.01, 0.003, … Huffman coding never needs to re-order equivalent to unary code. Unary code is the optimal prefix code, but not efficient

( avg length >> entropy) ρ = 3/4: P(x): 0.25, 0.19, 0.14, 0.11, 0.08, … Reordering is needed for Huffman code, unary code not optimal prefix code.

ρ = 1/2: Expected length = entropy. Unary code is not only the optimal prefix code, but also optimal among all

entropy coding (including arithmetic coding).


Geometric Distribution (GD)

Geometric distribution is very useful for image/video compressionExample 1: run-length coding

Binary sequence with i.i.d. distribution P(0) = ρ ≈ 1: Example: 0000010000000010000110000001 Entropy << 1, prefix code has poor performance.

Run-length coding is efficient to compress the data:o r: Number of consecutive 0’s between two 1’so run-length representation of the sequence: 5, 8, 4, 0, 6

Probability distribution of the run-length r:

o P(r = n) = ρn (1- ρ): n 0’s followed by an 1.

o The run has one-sided geometric distribution with parameter ρ.

r

P(r)


Geometric Distribution GD is the discrete analogy of the Exponential distribution

xexf ll -= )( 21

Two-sided geometric distribution is the discrete analogy of the Laplacian distribution (also called double exponential distribution)

( ) xf x e ll -=

x

f(x)

x

f(x)


Why Golomb Code?Significance of Golomb code: For any geometric distribution (GD), Golomb code is optimal prefix

code and is as close to the entropy as possible (among all prefix codes) How to determine the Golomb parameter? How to apply it into practical codec?


Geometric DistributionExample 2: GD is a also good model for Prediction error

e(n) = x(n) – pred(x(1), …, x(n-1)).Most e(n)’s have smaller values around 0: can be modeled by geometric distribution.

n

p(n)

.10 ,11)( || <<+-

= rrrr nnp


x1 x2 x3

x4 x5

0.2 0.3 0.2

0.2 0 0

0 0 0

Optimal Code for Geometric Distribution

Geometric distribution with parameter ρ: P(X=n) = ρn (1 - ρ)

Unary code is optimal prefix code when ρ ≤ 1/2. Also optimal among all entropy coding for ρ = 1/2.

How to design the optimal code when ρ > 1/2 ?

x

P(x)

1 1

0 0

1( ) ( ) (1 ) (1 ) (1 )1q

mm mqm r qm mq m

X Xr r

P q P qm r rr r r r r rr

- -+

= =

-= + = - = - = -

-å å xq has geometric dist with parameter ρm.Unary code is optimal for xq if ρm ≤ 1/2

r2log1

-³m integer. possible minimal theis log

1

2úú

ùêê

é-=

rm

rq xmxx +=

Transform into GD with ρ ≤ 1/2 (as close as possible)How? By grouping m events together!

Each x can be written as

x

P(x)


Golomb Parameter Estimation (J2K book: pp. 55)

xxP rr)1()( -=

• Goal of adaptive Golomb code: • For the given data, find the best m such that ρm ≤1/2.

• How to find ρ from the statistics of past data?

rr

rrrrr

-=

--=-= å

¥

= 1)1()1()1()( 2

0x

xxxE

.)(1

)(xE

xE+

=r

21

)(1)(

£÷÷ø

öççè

æ+

=m

m

xExEr ÷÷

ø

öççè

æ +³

)()(1log/1

xExEm

Let m=2k .)(

)(1log/1log 22 ÷÷ø

öççè

æ÷÷ø

öççè

æ +³

xExEk Too costly to compute


Golomb Parameter Estimation (J2K book: pp. 55)

( )1

E x rr

=-

A faster method: Assume ρ ≈ 1, 1 – ρ ≈ 0.

( )( ))(

111)1(111xE

mmmmm -=-

-»--»--=rrrrr

ρm ≤1/2 )(212 xEm k ³=

.)(21log ,0max 2

þýü

îíì

úú

ùêê

é÷øö

çèæ= xEk


Summary

Hoffman Coding A prefix code that is optimal in code length (average) Canonical form to reduce variation of the code length Widely used

Golomb Coding Suitable for coding prediction errors in image Optimal for Geometrical Distribution of p=0.5 Simple to encode and decode Many practical applications, e.g., JPEG-2000 lossless.


RVVOHVV & RGLQJ - School of Computing and Engineering · Info Theory on Entropy Self Info of an...

Documents

Transcript of RVVOHVV & RGLQJ - School of Computing and Engineering · Info Theory on Entropy Self Info of an...