RVVOHVV & RGLQJ - School of Computing and Engineering · Info Theory on Entropy Self Info of an...
Transcript of RVVOHVV & RGLQJ - School of Computing and Engineering · Info Theory on Entropy Self Info of an...
Spring 2020 Venu: Haag 313
ECE 5578 Multimedia Communication Lec 02 - Entropy & Lossless Coding
Zhu Li
Dept of CSEE, UMKC
Office: FH560E, Email: [email protected], Ph: x 2346.
http://l.web.umkc.edu/lizhu
ECE 5578 Multimedia Communication, 2020. p.1
slides created with WPS Office Linux and EqualX equation editor
Outline
Lecture 01 ReCap Info Theory on Entropy Lossless Entropy Coding
ECE 5578 Multimedia Communication, 2020. p.2
Video Compression in Summary Compression Pipeline:
ECE 5578 Multimedia Communication, 2020. p.3
Video Coding Standards: Rate-Distortion Performance
Pre-HEVC
ECE 5578 Multimedia Communication, 2020. p.4
MPEG System: Storage & Communication Solution
HTTP Adaptive Streaming of Video
ECE 5578 Multimedia Communication, 2020. p.5
Outline
Lecture 01 ReCap Info Theory on Entropy Self Info of an event Entropy of the source Relative Entropy Mutual Info
Entropy Coding
Thanks for SFU’s Prof. Jie Liang’s slides!
ECE 5578 Multimedia Communication, 2020. p.6
Entropy and its ApplicationEntropy coding: the last part of a compression system
Losslessly represent symbolsKey idea: Assign short codes for common symbols Assign long codes for rare symbols
Question: How to evaluate a compression method?
o Need to know the lower bound we can achieve.o Entropy
Entropycoding
QuantizationTransform
Encoder
0100100101111
ECE 5578 Multimedia Communication, 2020. p.7
Claude Shannon: 1916-2001 A distant relative of Thomas Edison 1932: Went to University of Michigan. 1937: Master thesis at MIT became the foundation of
digital circuit design:o “The most important, and also the most famous,
master's thesis of the century“ 1940: PhD, MIT 1940-1956: Bell Lab (back to MIT after that) 1948: The birth of Information Theory
o A mathematical theory of communication, Bell System Technical Journal.
ECE 5578 Multimedia Communication, 2020. p.8
Axiom Definition of InformationInformation is a measure of uncertainty or surprise Axiom 1:
Information of an event is a function of its probability:i(A) = f (P(A)). What’s the expression of f()?
Axiom 2: Rare events have high information content Water found on Mars!!!
Common events have low information content It’s raining in Vancouver.
Information should be a decreasing function of the probability:Still numerous choices of f().
Axiom 3: Information of two independent events = sum of individual information:
If P(AB)=P(A)P(B) i(AB) = i(A) + i(B). Only the logarithmic function satisfies these conditions.
ECE 5578 Multimedia Communication, 2020. p.9
Self-information
)(log)(
1log)( xpxp
xi bb -==
• Shannon’s Definition [1948]:• X: discrete random variable with alphabet {A1, A2, …, AN}• Probability mass function: p(x) = Pr{ X = x}• Self-information of an event X = x:
If b = 2, unit of information is bit
Self information indicates the number of bitsneeded to represent an event.
1
P(x)
)(log xPb-
0
ECE 5578 Multimedia Communication, 2020. p.10
Recall: the mean of a function g(X):
Entropy is the expected self-information of the r.v. X:
The entropy represents the minimal number of bits needed to losslessly represent one output of the source.
Entropy of a Random Variable
å=x xp
xpXH)(
1log)()(
)g( )())(()( xxpXgE xp å=
( ))(log)(
1log )()( XpEXp
EH xpxp -=÷÷ø
öççè
æ=
Also write as H (p): function of the distribution of X, not the value of X.
ECE 5578 Multimedia Communication, 2020. p.11
Example
P(X=0) = 1/2P(X=1) = 1/4P(X=2) = 1/8P(X=3) = 1/8Find the entropy of X.Solution:
1( ) ( ) log( )
1 1 1 1 1 2 3 3 7log 2 log 4 log8 log8 bits/sample.2 4 8 8 2 4 8 8 4
xH X p x
p x=
= + + + = + + + =
å
ECE 5578 Multimedia Communication, 2020. p.12
Example
A binary source: only two possible outputs: 0, 1 Source output example: 000101000101110101…… p(X=0) = p, p(X=1)= 1 – p.
Entropy of X: H(p) = p (-log2(p) ) + (1-p) (-log2(1-p)) H = 0 when p = 0 or p =1
oFixed output, no information H is largest when p = 1/2
oHighest uncertaintyoH = 1 bit in this case
Properties:H ≥ 0H concave (proved later) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.2
0.4
0.6
0.8
1
p
Ent
ropy
Equal prob maximize entropy
ECE 5578 Multimedia Communication, 2020. p.13
Joint entropy
1 1 2 2( , , )n np X i X i X i= = =
• We can get better understanding of the source S by looking at a block of output X1X2…Xn:
• The joint probability of a block of output:
Joint entropy
1 2
1 2
1 1 2 21 1 2 2
( , , )1( , , ) log
( , , )n
n
n ni i i n n
H X X X
p X i X i X ip X i X i X i
=
= = == = =åå å
Joint entropy is the number of bits required to represent the sequence X1X2…Xn: This is the lower bound for entropy coding.
( )),...(log 1 nXXpE-=
ECE 5578 Multimedia Communication, 2020. p.14
Conditional Entropy
1 ( )( | ) log log( | ) ( , )
p yi x yp x y p x y
= =
• Conditional Self-Information of an event X = x, given that event Y = y has occurred:
( )( , )
( | ) ( ) ( | ) ( ) ( | ) log( ( | ))
( | ) ( ) log( ( | )) ( , ) log( ( | ))
log( ( | )
x x y
x y x y
p x y
H Y X p x H Y X x p x p y x p y x
p y x p x p y x p x y p y x
E p y x
= = = -
= - = -
= -
å å å
åå åå
Conditional Entropy H(Y | X): Average cond. self-info.Remaining uncertainty about Y given the knowledge of X.
Note: p(x | y), p(x, y) and p(y) are three different distributions: p1(x | y), p2(x, y) and p3(y).
ECE 5578 Multimedia Communication, 2020. p.15
Chain Rule
H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)
Proof:
H(X) H(Y)
H(X | Y) H(Y | X)
Total area: H(X, Y)
( )
å
åååå
åå
åå
+=+-=
--=
-=
-=
x
x yx y
x y
x y
XYHXHXYHxpxp
xypyxpxpyxp
xypxpyxp
yxpyxpYXH
).|()()|()(log)(
)|(log),()(log),(
)|()(log),(
),(log),(),(
Simpler notation:
)|()())|(log)((log)),((log),(
XYHXHXYpXpEYXpEYXH
+=+-=-=
ECE 5578 Multimedia Communication, 2020. p.16
Conditional Entropy Example: for the following joint distribution p(x, y), find H(Y | X).
Indeed, H(X|Y) = H(X, Y) – H(Y)= 27/8 – 2 = 11/8 bits
1 2 3 41 1/8 1/16 1/32 1/32
2 1/16 1/8 1/32 1/32
3 1/16 1/16 1/16 1/16
4 1/4 0 0 0
Y
X P(X): [ ½, ¼, 1/8, 1/8 ] >> H(X) = 7/4 bits
P(Y): [ ¼ , ¼, ¼, ¼ ] >> H(Y) = 2 bits
H(X|Y) = ��=14 ��� = �����|� = ��
= ¼ H(1/2 ¼ 1/8 1/8 ) + 1/4H(1/4, ½, 1/8 ,1/8) + 1/4H(1/4 ¼ ¼ ¼ ) + 1/4H(1 0 0 0) = 11/8 bits
ECE 5578 Multimedia Communication, 2020. p.17
General Chain Rule
General form of chain rule:
)...|(),...,( 1,11
21 XXXHXXXH i
n
iin -
=å=
The joint encoding of a sequence can be broken into the sequential encoding of each sample, e.g. H(X1, X2, X3)=H(X1) + H(X2|X1) + H(X3|X2, X1) Advantages:
Joint encoding needs joint probability: difficult Sequential encoding only needs conditional entropy,can use local neighbors to approximate the conditional entropy context-adaptive arithmetic coding.
Adding H(Z): H(X, Y | Z) + H(z) = H(X, Y, Z)= H(z) + H(X | Z) + H(Y | X, Z)
ECE 5578 Multimedia Communication, 2020. p.18
General Chain Rule
),...|()...|()(),...( 111211 -= nnn xxxpxxpxpxxpProof:
.),...|(
),...|(log),...(
),...|(log),...(
),...|(log),...(
),...(log),...(),...(
111
1 ,...1111
,...111
11
,...1 1
111
,...1111
å
å å
å å
å Õ
å
=
-
=
-
-
=
=
-
=
-=
-=
-=
-=
n
iii
n
i xnxiin
xnxii
n
in
xnx
n
iiin
xnxnnn
XXXH
xxxpxxp
xxxpxxp
xxxpxxp
xxpxxpXXH
ECE 5578 Multimedia Communication, 2020. p.19
General Chain Rule
1 1( | ,... )i ip x x x -
The complexity of the conditional probability
grows as the increase of i.
In many cases we can approximate the cond. probability with some nearest neighbors (contexts):
1 1 1( | ,... ) ( | ,... )i i i i L ip x x x p x x x- - -»
The low-dim cond prob is more manageable How to measure the quality of the approximation?
Relative entropy
0 1 1 0 1 0 1 a b c b c a bc b a b c b a
ECE 5578 Multimedia Communication, 2020. p.20
Relative Entropy – Cost of Coding with Wrong Distr
Also known as Kullback Leibler (K-L) Distance, Information Divergence, Information GainA measure of the “distance” between two distributions: In many applications, the true distribution p(X) is unknown, and we
only know an estimation distribution q(X) What is the inefficiency in representing X?
o The true entropy:
o The actual rate:
o The difference:
÷÷ø
öççè
æ==å )(
)(log)()(log)()||(
XqXpE
xqxpxpqpD p
x
1 ( ) log ( )x
R p x p x= -å2 ( ) log ( )
xR p x q x= -å
2 1 ( || )R R D p q- =ECE 5578 Multimedia Communication, 2020. p.21
Relative Entropy
Properties: ÷÷ø
öççè
æ==å )(
)(log)()(log)()||(
XqXpE
xqxpxpqpD p
x
( || ) 0.D p q ³
( || ) 0 if and only if q = p.D p q =
What if p(x)>0, but q(x)=0 for some x? D(p||q)=∞
Caution: D(p||q) is not a true distance
Not symmetric in general: D(p || q) ≠ D(q || p) Does not satisfy triangular inequality.
Proved later.
ECE 5578 Multimedia Communication, 2020. p.22
Relative Entropy How to make it symmetric?
Many possibilities, for example:
( )1 ( || ) ( || )2
D p q D q p+
( || ) ( || )D p q D q p
can be useful for pattern classification.
)||(1
)||(1
pqDqpD+
ECE 5578 Multimedia Communication, 2020. p.23
Mutual Information
i (x | y): conditional self-information
)()(),(log
)()|(log)|()();(
ypxpyxp
xpyxpyxixiyxi ==-=
Note: i(x; y) can be negative, if p(x | y) < p(x).
Mutual information between two events:
i(x | y) = -log p(x | y)
A measure of the amount of information that one event contains about another one. or the reduction in the uncertainty of one event due to the knowledge of the other.
ECE 5578 Multimedia Communication, 2020. p.24
Mutual Information
I(X; Y): Mutual information between two random variables:
( ) ( , )
( , )( ; ) ( , ) ( ; ) ( , ) log( ) ( )
( , )D ( , ) || ( ) ( ) log( ) ( )
x y x y
p x y
p x yI X Y p x y i x y p x yp x p y
p X Yp x y p x p y Ep X p Y
= =
æ ö= = ç ÷
è ø
åå åå
But it is symmetric: I(X; Y) = I(Y; X) Mutual information is a relative entropy:
If X, Y are independent: p(x, y) = p(x) p(y) I (X; Y) = 0 Knowing X does not reduce the uncertainty of Y.
Different from i(x; y), I(X; Y) >=0 (due to averaging)
ECE 5578 Multimedia Communication, 2020. p.25
Entropy and Mutual Information
( , ) ( | )( ; ) ( , ) log ( , ) log( ) ( ) ( )
( , ) log ( | ) ( , ) log ( )
( ) ( | )
x y x y
x y x y
p x y p x yI X Y p x y p x yp x p y p x
p x y p x y p x y p x
H X H X Y
= =
= -
= -
åå åå
åå åå
2. Similarly: ( ; ) ( ) ( | )I X Y H Y H Y X= -
1.
3. I(X; Y) = H(X) + H(Y) – H(X, Y)Proof: Expand the definition:
( ; ) ( ) ( | )I X Y H X H X Y= -
( )
),()()(
)(log)(log),(log),();(
YXHYHXH
ypxpyxpyxpYXIx y
-+=
--=åå
ECE 5578 Multimedia Communication, 2020. p.26
Entropy and Mutual Information
H(X) H(Y)
I(X; Y)H(X | Y) H(Y | X)
Total area: H(X, Y)
It can be seen from this figure that I(X; X) = H(X): Proof:Let X = Y in I(X; Y) = H(X) + H(Y) – H(X, Y),or in I(X; Y) = H(X) – H(X | Y) (and use H(X|X)=0).
ECE 5578 Multimedia Communication, 2020. p.27
the penalty of coding saypixels X and Y separately, isthe relative entropy of theirjoint distribution and assumingindependent distribution
Application of Mutual Information
a b c b c a bc b a b c b a
Mutual information can be used in the optimization of context quantization.Example: If each neighbor has 26 possible values (a to z),
then 5 neighbors have 265 combinations: too many cond probs to estimate.To reduce the number, can group similar data pattern
together context quantization
( )( )1 1 1 1( | ,... ) | ,...i i i ip x x x p x f x x- -»
ECE 5578 Multimedia Communication, 2020. p.28
Application of Mutual Information in Compression
We need to design the function f( ) to minimize the conditional entropy
( )( )1 1 1 1( | ,... ) | ,...i i i ip x x x p x f x x- -»
)...|(),...,( 1,11
21 XXXHXXXH i
n
iin -
=å=
( | ) ( ) ( ; )H X Y H X I X Y= -But
The problem is equivalent to maximizing the mutual information between {Xi} and f(x1, … xi-1).
))...(|( 1,1 XXfXH ii -
For further info: Liu and Karam, Mutual Information-Based Analysisof JPEG2000 Contexts, IEEE Trans Image Processing, VOL. 14, NO. 4, APRIL 2005, pp. 411-422.
ECE 5578 Multimedia Communication, 2020. p.29
Outline
Lecture 01 ReCap Info Theory and Entropy Entropy Coding Prefix Coding Kraft-McMillan Inequality Shannon Codes
ECE 5578 Multimedia Communication, 2020. p.30
Variable Length Coding
Design the mapping from source symbols to codewordsLossless mappingDifferent codewords may have different lengthsGoal: minimizing the average codeword lengthThe entropy is the lower bound.
ECE 5578 Multimedia Communication, 2020. p.31
Classes of Codes
Non-singular code: Different inputs are mapped to different codewords (invertible).Uniquely decodable code: any encoded string has only one possible
source string, but may need delay to decode.Prefix-free code (or simply prefix, or instantaneous):
No codeword is a prefix of any other codeword. The focus of our studies. Questions:
o Characteristic?o How to design?o Is it optimal?
All codes
Non-singular codes
Uniquely decodablecodes
Prefix-free codes
ECE 5578 Multimedia Communication, 2020. p.32
Prefix Code
Examples
XSingular Non-singular,
But not uniquely decodable
Uniquely decodable, but not
prefix-free
Prefix-free
1 0 0 0 02 0 010 01 103 0 01 011 1104 0 10 0111 111
Need punctuation ……01011…
Need to look at nextbit to decode previous code.
ECE 5578 Multimedia Communication, 2020. p.33
Carter-Gill’s Conjecture [1974]
Carter-Gill’s Conjecture [1974] Every uniquely decodable code can be replaced by a prefix-free code
with the same set of codeword compositions. So we only need to study prefix-free code.
ECE 5578 Multimedia Communication, 2020. p.34
Prefix-free CodeCan be uniquely decoded.No codeword is a prefix of another one.Also called prefix codeGoal: construct prefix code with minimal expected length.
Can put all codewords in a binary tree:
0 1
0 1
0 1
010
110 111
Root node
leaf node
Internal node
Prefix-free code contains leaves only. How to express the requirement mathematically?
ECE 5578 Multimedia Communication, 2020. p.35
Kraft-McMillan Inequality
121
£å=
-N
i
li
• The codeword lengths li, i=1,…N of a prefix code over an alphabet of size D(=2) satisfies the inequality
Conversely, if a set of {li} satisfies the inequality above, then there exists a prefix code with codeword
lengths li, i=1,…N.
The characteristic of prefix-free codes:
ECE 5578 Multimedia Communication, 2020. p.36
Kraft-McMillan Inequality
22 1211
LN
i
lLN
i
l ii £Û£ åå=
-
=
-
Consider D=2: expand the binary code tree to full depth L = max(li): li is the code length
010
110111
Number of nodes in the last level: Each code corresponds to a sub-tree: The number of off springs in the last level:
K-M inequality: # of L-th level offsprings of all codes is less than 2^L !
ilL-2
L2
L = 3
Example: {0, 10, 110, 111}
ECE 5578 Multimedia Communication, 2020. p.37
2^3=8
{4, 2, 0, 0}
Kraft-McMillan Inequality
010
110 111
11
Invalid code: {0, 10, 11, 110, 111}: li=[1 2 2 3 3]
Leads to more than2^L offspring: 12> 23
121
£å¥
=
-
i
li K-M inequality:
ECE 5578 Multimedia Communication, 2020. p.38
Extended Kraft Inequality
Countably infinite prefix code also satisfies the Kraft inequality: Has infinite number of code words.
Example: 0, 10, 110, 1110, 11110, 111……10, ……
(Golomb-Rice code, next lecture) Each codeword can be mapped to a subinterval in [0, 1] that is
disjoint with others (revisited in arithmetic coding)
121
£å¥
=
-
i
li
)0 )10
)0 0.5 0.75 0.875 1
110 ……
ECE 5578 Multimedia Communication, 2020. p.39
010
110
….
L = 3
Shannon Code: Bounds on Optimal Code
Code length is not integer in general.iDi p-l log* =
Practical codewords have to be integer.
úú
ùêê
é=
iDi p
l 1logShannon Code:
Is this a valid prefix code? Check Kraft inequality
.11log
1log
==£= åååå-ú
ú
ùêê
é-
-i
ppl pDDD iD
iD
i
11log1log +<£i
Dii
D pl
púú
ùêê
é=
iDi p
l 1log
1)()( +<£ XHLXH DD
Yes !
This is just one choice. May not be optimal (see example later)
ECE 5578 Multimedia Communication, 2020. p.40
Optimal Code
The optimal code with integer lengths should be better than Shannon code
1)()( * +££ XHLXH DD
To reduce the overhead per symbol: Encode a block of symbols {x1, x2, …, xn} together
{ }),...,,(1),...,,(),...,,(1212121 nnnn xxxlE
nxxxlxxxp
nL å ==
{ } 1),...,,(),...,,(),...,,( 212121 +££ nnn XXXHxxxlEXXXH
Assume i.i.d. samples: )(),...,,( 21 XnHXXXH n =
nXHLXH n
1)()( +££ )(XHLn ® if stationary.(entropy rate)
ECE 5578 Multimedia Communication, 2020. p.41
Optimal CodeImpact of wrong pdf: what’s the penalty if the pdf we use is different
from the true pdf?True pdf: p(x) Codeword length: l(x)Estimated pdf: q(x) Expected length: Epl(X)
1)||()()()||()( ++<£+ qpDpHXlEqpDpH p
Proof: assume Shannon code úú
ùêê
é=
)(1log)(xq
xl
åå ÷÷ø
öççè
æ+<ú
ú
ùêê
é= 1
)(1log)(
)(1log)()(
xqxp
xqxpXlE p
1)||()(1)(
1)()(log)( ++=÷÷
ø
öççè
æ+÷÷
ø
öççè
æ=å qpDXH
xpxqxpxp
The lower bound is derived similarly.ECE 5578 Multimedia Communication, 2020. p.42
Shannon Code is not optimal
Example: Binary r.v. X: p(0)=0.9999, p(1)=0.0001.
Entropy: 0.0015 bits/sampleAssign binary codewords by Shannon code:
úú
ùêê
é=
)(1log)(xp
xl
.19999.01log2 =úú
ùêêé .14
0001.01log2 =úú
ùêêé
Expected length: 0.9999 x 1+ 0.0001 x 14 = 1.0013.Within the range of [H(X), H(X) + 1].
But we can easily beat this by the code {0, 1} Expected length: 1.
ECE 5578 Multimedia Communication, 2020. p.43
Summary & Break
Info Theory Summary: Entropy: H(x) Conditional Entropy: H(X|Y) Mutual Info: I(X, Y) = Relative Entropy: D(X||Y)
Practical Codes: Hoffman Coding Golomb Coding and JPEG 2000 Lossless Coding
ECE 5578 Multimedia Communication, 2020. p.44
Practical Entropy Codes
ECE 5578 Multimedia Communication, 2020. p.45
Entropy Self Info of an event
��� = ��� =−log �Pr �� = ���� =−log����� Entropy of a source
���� = ����log��1���
Conditional Entropy, Mutual Information����|��� = ����,��� −���������,��� = ����� +����� − ����,���
ECE 5578 Multimedia Communication, 2020. p.46
H(X) H(Y)
I(X; Y)H(X | Y) H(Y | X)
Total area: H(X, Y)
a b c b c a bc b a b c b a
Main application: Context Modeling
Relative Entropy
���||�� =���� log �
���� �
Lossless Coding
Prefix Coding Codes on leaves No code is prefix of other codes Simple encoding/decoding
Kraft- McMillan Inequality: For a coding scheme with code length: l1, l2, …ln,
Given a set of integer length {l1, l2, …ln} that satisfy above inequality, we can always find a prefix code with code length l1, l2, …ln
ECE 5578 Multimedia Communication, 2020. p.47
0 1
0 1
0 10
10
110 111
Root node
leaf node
Internal node
��2−�� ≤ 1
Context Reduces Entropy Example
ECE 5578 Multimedia Communication, 2020. p.48
x1 x2 x3
x4 x5
H(x5) > H(x5|x4,x3, x2,x1)
H(x5) > H(x5|f(x4,x3, x2,x1))
Condition reduces entropy:
Context function:f(x4,x3, x2,x1)= sum(x4,x3, x2,x1)
getEntropy.m, lossless_coding.m
H(x5 | f(x4,x3, x2,x1)==100)
H(x5 | f(x4,x3, x2,x1)<100)
Context:
lenna.png
Huffman Code (1952) Design Requirement: The source probability distribution.
(But not available in many cases) Procedure:
1. Sort the probability of all source symbols in a descending order.2. Merge the last two into a new symbol, add their probabilities.3. Repeat Step 1, 2 until only one symbol (the root) is left.4. Code assignment:
Traverse the tree from the root to each leaf node, assign 0 to the top branch and 1 to the bottom branch.
ECE 5578 Multimedia Communication, 2020. p.49
Example
Source alphabet A = {a1, a2, a3, a4, a5}Probability distribution: {0.2, 0.4, 0.2, 0.1, 0.1}
a2 (0.4)
a1(0.2)
a3(0.2)
a4(0.1)
a5(0.1)
Sort
0.2
merge Sort
0.4
0.2
0.2
0.2
0.4
merge Sort
0.4
0.2
0.40.6
merge
0.6
0.4
Sort
1
merge
Assign code
0
1
1
00
01
1
000
001
01
1
000
01
0010
0011
1
000
01
0010
0011
ECE 5578 Multimedia Communication, 2020. p.50
Huffman code is prefix-free
01 (a1)
000 (a3)
0010 (a4) 0011(a5)
1
000
01
0010
0011
1 (a2)
All codewords are leaf nodes No code is a prefix of any other code. (Prefix free)
ECE 5578 Multimedia Communication, 2020. p.51
Average Codeword Length vs Entropy
Source alphabet A = {a, b, c, d, e} Probability distribution: {0.2, 0.4, 0.2, 0.1, 0.1} Code: {01, 1, 000, 0010, 0011}
Entropy:H(S) = - (0.2*log2(0.2)*2 + 0.4*log2(0.4)+0.1*log2(0.1)*2)= 2.122 bits / symbol
Average Huffman codeword length:L = 0.2*2+0.4*1+0.2*3+0.1*4+0.1*4 = 2.2 bits / symbol
This verifies H(S) ≤ L < H(S) + 1.
ECE 5578 Multimedia Communication, 2020. p.52
Huffman Code is not unique
0.4
0.2
0.40.6
0.6
0.4
0
1
1
00
01
Multiple ordering choices for tied probabilities
Two choices for each split: 0, 1 or 1, 0
0.4
0.2
0.4 0.6
0.6
0.4
1
0
0
10
11
a
b
c
0.4
0.2
0.40.6
0.6
0.4
1
0
0
10
11
b
a
c
0.4
0.2
0.40.6
0.6
0.4
1
0
0
10
11
ECE 5578 Multimedia Communication, 2020. p.53
Canonical Huffman Code
Huffman algorithm is needed only to compute the optimal codeword lengths The optimal codewords for a given data set are not unique
Canonical Huffman code is well structured Given the codeword lengths, can find a canonical Huffman
code Also known as slice code, alphabetic code.
ECE 5578 Multimedia Communication, 2020. p.54
úú
ùêê
é=
)(1log)(xp
xl
Canonical Huffman Code - Prunning the Bin Tree
Example: Codeword lengths from prob: 2, 2, 3, 3, 3, 4, 4 Verify that it satisfies Kraft-McMillan inequality
01
10011
000 001
1010 1011
A non-canonical example
00 01
The Canonical Tree
Rules: Assign 0 to left branch and 1 to right branch Build the tree from left to right in increasing order of depth Each leaf is placed at the first available position
110100 101
10
1110 1111
111
121
£å=
-N
i
li
ECE 5578 Multimedia Communication, 2020. p.55
Advantages of Canonical Huffman1. Reducing memory requirement
Non-canonical tree needs:All codewordsLengths of all codewords
Need a lot of space for large table
01
10011
000 001
1010 1011
00 01110100 101
1110 1111
Canonical tree only needs: Min: shortest codeword length Max: longest codeword length Distribution: Number of codewords in each level
Min=2, Max=4, # in each level: 2, 3, 2
ECE 5578 Multimedia Communication, 2020. p.56
Golomb Code
Lecture 02 ReCap Hoffman Coding Golomb Coding
ECE 5578 Multimedia Communication, 2020. p.57
Unary Code (Comma Code)
Encode a nonnegative integer n by n 1’s and a 0(or n 0’s and an 1).
n Codeword0 01 102 1103 11104 111105 111110… …
Is this code prefix-free?
0 1
0 10 1
010
1101110 … …
When is this code optimal? When probabilities are: 1/2, 1/4, 1/8, 1/16, 1/32 … D-adic Huffman code becomes unary code in this case.
No need to store codeword table, very simple
ECE 5578 Multimedia Communication, 2020. p.58
Implementation – Very Efficient
Encoding:UnaryEncode(n) { while (n > 0) { WriteBit(1); n--; } WriteBit(0);}
Decoding:UnaryDecode() { n = 0; while (ReadBit(1) == 1) { n++; } return n;}
ECE 5578 Multimedia Communication, 2020. p.59
Golomb Code [Golomb, 1966]
A multi-resolutional approach: Divide all numbers into groups of equal size m
o Denote as Golomb(m) or Golomb-m Groups with smaller symbol values have shorter codes Symbols in the same group has codewords of similar lengths
o The codeword length grows much slower than in unary code
0 max
m m m m
Codeword : (Unary, fixed-length)
Group ID: Unary code
Index within each group:
ECE 5578 Multimedia Communication, 2020. p.60
Golomb Code
rmmnrqmn +úûú
êëê=+=
q: Quotient, used unary code
q Codeword0 01 102 1103 11104 111105 1111106 1111110… …
r: remainder, “fixed-length” code
K bits if m = 2^k m=8: 000, 001, ……, 111
If m ≠ 2^k: (not desired) bits for smaller r bits for larger r
ë ûm2logé ùm2logm = 5: 00, 01, 10, 110, 111
ECE 5578 Multimedia Communication, 2020. p.61
Golomb Code with m = 5 (Golomb-5)n q r code0 0 0 0001 0 1 0012 0 2 0103 0 3 01104 0 4 0111
n q r code5 1 0 10006 1 1 10017 1 2 10108 1 3 101109 1 4 10111
n q r code10 2 0 1100011 2 1 1100112 2 2 1101013 2 3 11011014 2 4 110111
ECE 5578 Multimedia Communication, 2020. p.62
Golomb vs Canonical Huffman
Golomb code is a canonical Huffman With more properties
Codewords: 000, 001, 010, 0110, 0111, 1000, 1001, 1010, 10110, 10111
Canonical form:From left to rightFrom short to longTake first valid spot
ECE 5578 Multimedia Communication, 2020. p.63
A special Golomb code with m= 2^k The remainder r is the fixed k LSB bits of n
n q r code0 0 0 00001 0 1 00012 0 2 00103 0 3 00114 0 4 01005 0 5 01016 0 6 01107 0 7 0111
n q r code8 1 0 100009 1 1 1000110 1 2 1001011 1 3 1001112 1 4 1010013 1 5 1010114 1 6 1011015 1 7 10111
m = 8
Golobm-Rice Code
ECE 5578 Multimedia Communication, 2020. p.64
Implementation
Encoding:
GolombEncode(n, RBits) { q = n >> RBits; UnaryCode(q); WriteBits(n, RBits);}
Decoding:GolombDecode(RBits) { q = UnaryDecode(); n = (q << RBits) + ReadBits(RBits); return n;}
Remainder bits: RBits = 3 for m = 8
Output the lower (RBits) bits of n.
n q r code0 0 0 00001 0 1 00012 0 2 00103 0 3 00114 0 4 01005 0 5 01016 0 6 01107 0 7 0111
ECE 5578 Multimedia Communication, 2020. p.65
Exponential Golomb Code (Exp-Golomb)
Golomb code divides the alphabet into groups of equal size
0 max
m m m m
In Exp-Golomb code, the group size increases exponentially
Codes still contain two parts: Unary code followed by fixed-length code
0 max
4 8 161 2
n code0 0
1 100
2 101
3 11000
4 11001
5 11010
6 11011
7 1110000
8 1110001
9 1110010
10 1110011
11 1110100
12 1110101
13 1110110
14 1110111
15 111100000 Proposed by Teuhola in 1978
ECE 5578 Multimedia Communication, 2020. p.66
ImplementationDecoding
ExpGolombDecode() { GroupID = UnaryDecode(); if (GroupID == 0) {
return 0; } else { Base = (1 << GroupID) - 1; Index = ReadBits(GroupID); return (Base + Index); } }}
n code GroupID
0 0 0
1 100 1
2 101
3 11000 2
4 11001
5 11010
6 11011
7 1110000 3
8 1110001
9 1110010
10 1110011
11 1110100
12 1110101
13 1110110
14 1110111
ECE 5578 Multimedia Communication, 2020. p.67
Outline
Golomb Code Family: Unary Code Golomb Code Golomb-Rice Code Exponential Golomb Code
Why Golomb code?
ECE 5578 Multimedia Communication, 2020. p.68
Geometric Distribution (GD)
Geometric distribution with parameter ρ: P(x) = ρx (1 - ρ), x ≥ 0, integer. Prob of the number of failures before the first success in a series of
independent Yes/No experiments (Bernoulli trials). Unary code is the optimal prefix code for geometric distribution
with ρ ≤ 1/2:
ρ = 1/4: P(x): 0.75, 0.19, 0.05, 0.01, 0.003, … Huffman coding never needs to re-order equivalent to unary code. Unary code is the optimal prefix code, but not efficient
( avg length >> entropy) ρ = 3/4: P(x): 0.25, 0.19, 0.14, 0.11, 0.08, … Reordering is needed for Huffman code, unary code not optimal prefix code.
ρ = 1/2: Expected length = entropy. Unary code is not only the optimal prefix code, but also optimal among all
entropy coding (including arithmetic coding).
ECE 5578 Multimedia Communication, 2020. p.69
Geometric Distribution (GD)
Geometric distribution is very useful for image/video compressionExample 1: run-length coding
Binary sequence with i.i.d. distribution P(0) = ρ ≈ 1: Example: 0000010000000010000110000001 Entropy << 1, prefix code has poor performance.
Run-length coding is efficient to compress the data:o r: Number of consecutive 0’s between two 1’so run-length representation of the sequence: 5, 8, 4, 0, 6
Probability distribution of the run-length r:
o P(r = n) = ρn (1- ρ): n 0’s followed by an 1.
o The run has one-sided geometric distribution with parameter ρ.
r
P(r)
ECE 5578 Multimedia Communication, 2020. p.70
Geometric Distribution GD is the discrete analogy of the Exponential distribution
xexf ll -= )( 21
Two-sided geometric distribution is the discrete analogy of the Laplacian distribution (also called double exponential distribution)
( ) xf x e ll -=
x
f(x)
x
f(x)
ECE 5578 Multimedia Communication, 2020. p.71
Why Golomb Code?Significance of Golomb code: For any geometric distribution (GD), Golomb code is optimal prefix
code and is as close to the entropy as possible (among all prefix codes) How to determine the Golomb parameter? How to apply it into practical codec?
ECE 5578 Multimedia Communication, 2020. p.72
Geometric DistributionExample 2: GD is a also good model for Prediction error
e(n) = x(n) – pred(x(1), …, x(n-1)).Most e(n)’s have smaller values around 0: can be modeled by geometric distribution.
n
p(n)
.10 ,11)( || <<+-
= rrrr nnp
ECE 5578 Multimedia Communication, 2020. p.73
x1 x2 x3
x4 x5
0.2 0.3 0.2
0.2 0 0
0 0 0
Optimal Code for Geometric Distribution
Geometric distribution with parameter ρ: P(X=n) = ρn (1 - ρ)
Unary code is optimal prefix code when ρ ≤ 1/2. Also optimal among all entropy coding for ρ = 1/2.
How to design the optimal code when ρ > 1/2 ?
x
P(x)
1 1
0 0
1( ) ( ) (1 ) (1 ) (1 )1q
mm mqm r qm mq m
X Xr r
P q P qm r rr r r r r rr
- -+
= =
-= + = - = - = -
-å å xq has geometric dist with parameter ρm.Unary code is optimal for xq if ρm ≤ 1/2
r2log1
-³m integer. possible minimal theis log
1
2úú
ùêê
é-=
rm
rq xmxx +=
Transform into GD with ρ ≤ 1/2 (as close as possible)How? By grouping m events together!
Each x can be written as
x
P(x)
ECE 5578 Multimedia Communication, 2020. p.74
Golomb Parameter Estimation (J2K book: pp. 55)
xxP rr)1()( -=
• Goal of adaptive Golomb code: • For the given data, find the best m such that ρm ≤1/2.
• How to find ρ from the statistics of past data?
rr
rrrrr
-=
--=-= å
¥
= 1)1()1()1()( 2
0x
xxxE
.)(1
)(xE
xE+
=r
21
)(1)(
£÷÷ø
öççè
æ+
=m
m
xExEr ÷÷
ø
öççè
æ +³
)()(1log/1
xExEm
Let m=2k .)(
)(1log/1log 22 ÷÷ø
öççè
æ÷÷ø
öççè
æ +³
xExEk Too costly to compute
ECE 5578 Multimedia Communication, 2020. p.75
Golomb Parameter Estimation (J2K book: pp. 55)
( )1
E x rr
=-
A faster method: Assume ρ ≈ 1, 1 – ρ ≈ 0.
( )( ))(
111)1(111xE
mmmmm -=-
-»--»--=rrrrr
ρm ≤1/2 )(212 xEm k ³=
.)(21log ,0max 2
þýü
îíì
úú
ùêê
é÷øö
çèæ= xEk
ECE 5578 Multimedia Communication, 2020. p.76
Summary
Hoffman Coding A prefix code that is optimal in code length (average) Canonical form to reduce variation of the code length Widely used
Golomb Coding Suitable for coding prediction errors in image Optimal for Geometrical Distribution of p=0.5 Simple to encode and decode Many practical applications, e.g., JPEG-2000 lossless.
ECE 5578 Multimedia Communication, 2020. p.77