Data Compression Lempel-Ziv Coding
Transcript of Data Compression Lempel-Ziv Coding
-
7/28/2019 Data Compression Lempel-Ziv Coding
1/24
Lempel-Ziv Coding
Prof. J a-Ling Wu
Department of Computer Science
and Information EngineeringNational Taiwan University
-
7/28/2019 Data Compression Lempel-Ziv Coding
2/24
2
J .Ziv and A.Lempel, Compression of individualsequences by variable rate coding, IEEE Trans.Information Theory, 1978.
Algorithm :
The source sequence is sequentially parsed intostrings that have not appeared so far.
For example, 1011010100010, will be parsed as1,0,11,0 1,010,00,10,
After every comma, we look along the input sequenceuntil we come to the shortest string that has not been
marked off before. Since this is the shortest suchstring, all its prefixes must have occurred easier. Inparticular, the string consisting of all but the last bit ofthis string must have occurred earlier.
-
7/28/2019 Data Compression Lempel-Ziv Coding
3/24
3
We code this phrase by giving the location of the prefixand the value of the last bit.
Let C(n) be the number of phrases in the parsing of
the input n-sequence. We need log C(n) bits todescribe the location of the prefix to the phrase and 1bit to describe the last bit.
In the above example, the code sequence is :
(000,1), (000,0), (001,1), (010,1), (100,0), (010,0), (001,0)
where the first number of each pair gives the index ofthe prefix and the 2nd number gives the last bit of the
phrase.
-
7/28/2019 Data Compression Lempel-Ziv Coding
4/24
4
Decoding the coded sequence is straight forward andwe can recover the source sequence without error!!
Two passes-algorithm :
pass-1 : parse the string and calculate C(n)pass-2 : calculate the points and produce the code string
The two-passes algorithm allots an equal number of bitsto all the pointers. This is not necessary, since therange of the pointers is smaller at the initial portion ofthe string.
T-A. Welch, A technique for high-performance dataCompression, Computer, 1984.
Bell, Cleary, Witlen, Text Compression, Prentice-Hall, 1990.
Dictionary construction
Pattern Matching
-
7/28/2019 Data Compression Lempel-Ziv Coding
5/24
5
Definition :
A parsing S of a binary string x1, x2,,xn is a divisionof the string into phrases, separated by commas. Adistinct parsing is a parsing such that no two phrasesare identical.
The L-Z algorithm described above gives a distinct
parsing of the source sequence. Let C(n) denote the
number of phrases in the L-Z parsing of a sequence of
length n.
The compressed sequence (after applying the L-Z
algorithm) consists of a list of c(n) pairs of numbers,
each pair consisting of a pointer to the previous
occurrence of the prefix of the phrase and the last bitof the phrase. 1
log c(n)
-
7/28/2019 Data Compression Lempel-Ziv Coding
6/24
6
the total length of the compressed sequence is :C(n)(log c(n)+1)
we will show that
for a stationary ergodic sequence X1,X2,,Xn.
( )( )( )H
n
nCnC
+1log)(
-
7/28/2019 Data Compression Lempel-Ziv Coding
7/24
7
Lemma 1 (Lempel and Ziv) :
The number of phrase c(n) in a distinct parsing of abinary sequence X1,X2,,Xn satisfies
wheren 0 as n.
( )( ) n
nnC
n log1
-
7/28/2019 Data Compression Lempel-Ziv Coding
8/24
8
Let
: the sum of the lengths of all distinct strings of length
less than or equal to k.C(n) : the no. of phrases in a distinct parsing of a
sequence of length n
this number is maximized when all the phrases are
as short as possible.n=nk all the phrases are of length k, thus
If nkn
-
7/28/2019 Data Compression Lempel-Ziv Coding
9/24
9
Then the parsing into shortest phrases has each of thephases of length k and phrases of the length k+1.
We now bound the size of k for a given n.
Let nk n < nk+1. Thenn nk = (k-1)2k+1+2 2k
k log n
n nk+1 = k2k+2+2 (k+2)2k+2 [ (log n+2)2k+2 ]
1+
k
( ) ( )
( )1
11111
=
+
+
+
+
+=
K
nnC
K
n
k
n
kK
n
knCnC kkk
2loglog2
++
n
nk
-
7/28/2019 Data Compression Lempel-Ziv Coding
10/24
10
for all n 4
( ) ( )( )
( )
( )
( )
( ) ( )
( )
+=
=
=
+
=
+
++=
++=
nn
n
n
n
K
n
nC
n
nn
n
nn
n
nn
nnnkk
n
n
n
as0log
4loglogwhere
log11
log1
loglog
4loglog
1
loglog
3log2log1
loglog
32loglog1
32logloglog321
-
7/28/2019 Data Compression Lempel-Ziv Coding
11/24
11
Lemma 2 :Let Z be a positive integer valued r.v. with mean.
Then the entropy H(z) is bound by
H(z) (+1) log(+1)log
pf : H W.
-
7/28/2019 Data Compression Lempel-Ziv Coding
12/24
12
Let be a stationary ergodic process with pmfp(x1,x2,xn). For fixed integer k, defined the k
th orderMarkov approximation to P as
, and the initial state
will be part of the specification of Qk.
Since is itself an ergodic process, we have
{ } =iiX
( )( ) ( )( ) ( )
( ) jixxxx
xxPxPxxxxQ
jii
j
i
n
j
j
kjjknKk
+
=
,,...,,where
,...,,,...
1
1
101101
( )
0
1 kx
1
n
knn XXP
( )( ) ( )
( ) ( )111
10121
log
log1
,...,,log1
=
= j
kjj
j
kjj
n
j
j
kjjknK
XXHXXPE
XXPn
XXXXQn
-
7/28/2019 Data Compression Lempel-Ziv Coding
13/24
13
We will bound the rate of the L-Z code by the entropy
rate of the k-th order Markov approximation for all k. Theentropy rate of the Markov approximation
converges to the entropy rate of the process as kand this will prove the result.
1
j
kjj XXH
-
7/28/2019 Data Compression Lempel-Ziv Coding
14/24
14
Spose , and spose that is parsedinto C distinct phrases, y1,y2,,yc. Leti be theindex of the start of the i-th phrase, i.e., .For each i=1,2,
,c, defines . Thus, S
iis the k
bits of x preceding yi, of course,
Let Cls
be the number of phrases yi with length l and
preceding state Si=S for l=1,2, and , we thenhave
and
( ) ( )n
kn
k xX 11 = nx1
11 += ii
xy i
1
= ii ki
xS
( )0
11 = kxS
kXs
nlC
CC
sl
ls
sl
ls
=
=
,
,
-
7/28/2019 Data Compression Lempel-Ziv Coding
15/24
15
Lemma 3 : ( Zivs inequality )
For any distinct parsing (in particular, the L-Z parsing)of the string x1,x2,,xn, we have
Note that the right hand side does not depend on Qk.
proof : we write
( ) sl
lslsnk CCsxxxQ
,
121 log,...,,log
( )=
=
=
C
i
ii
cnk
syP
syyyQsxxxQ
1
121121 ,...,,,...,,
-
7/28/2019 Data Compression Lempel-Ziv Coding
16/24
16
( ) ( )
( )
( )
( )
=
=
=
==
==
==
=
sl
Ss
lyi
ii
ls
ls
Sslyi
ii
lssl
ls
sl sslyi
ii
C
i
iink
i
i
i
i
ii
syPC
C
syPCC
syP
syPsxxxQ
, :
:,
, ,:
1
121
1log
log1
log
log,...,,logor
Now since the yi are distinct, we have ( )==
Ss
lyi
ii
ii
syP
:
1
( ) sl ls
lsnKC
CsxxxQ,
121
1log,...,,log
-
7/28/2019 Data Compression Lempel-Ziv Coding
17/24
17
Theorem:
Let {Xn}be a stationary ergodic process with entropyrate H(XX), and let C(n) be the number of phrases in adistinct parsing of a sample of length n from thisprocess. Then
with probability 1.
( ) ( ) ( )XHn
nCnCn
logsuplim
-
7/28/2019 Data Compression Lempel-Ziv Coding
18/24
18
( )
==
=
=
sl
sl
sl
sl
lsls
sl
lsls
ls
lslsnk
C
nl
C
C
CC
CCCCC
C
CCCsxxxQ
,,
,,
,
121
,1
havewe,writting
loglog
log,...,,log:pf
-
7/28/2019 Data Compression Lempel-Ziv Coding
19/24
19
We now define r.v.s U,V, such that
Pr(U=l,V=s) =l,sThus E(U)=n/c, and
Since H(U,V) H(U) + H(V)
and H(V) log |XX|k = k
From Lemma 2, we have
( ) ( )VUHnC
Cn
C
sxxxQn
ccVUCHsxxxQ
nK
nk
,log,...,,log
1
log),()|,...,,(log
121
121
( ) ( ) ( )
( ) ( ).1log,,Thus
1log1log
log1log1
log1log1
OC
n
n
C
kn
C
VUHn
C
n
C
C
n
C
n
C
n
C
n
C
n
C
n
EUEUEUEUUH
++
+
++=
+
+=
++
-
7/28/2019 Data Compression Lempel-Ziv Coding
20/24
20
For a given n, the maximum of is attained for
the maximum value of C . But from Lemma 1,
C
n
n
Clog
en
C 1for
( )( )
( )
+
nas0,thereforeand
log
logloglog
Thus.11log
VUHn
C
n
nO
C
n
n
C
On
nC
-
7/28/2019 Data Compression Lempel-Ziv Coding
21/24
21
Therefore,
wherek(n)0 as n. Hence, with probability 1,
( ) ( ) ( ) ( )nsxxxQnn
nCnCknk +
121 ,...,,log
1log
( ) ( )( )( )
( ) ( ) .,...,
,...,,log1
limlog
suplim
10
0121
=
kasXHXXXH
XXXXQ
nn
nCnC
k
knk
nn
-
7/28/2019 Data Compression Lempel-Ziv Coding
22/24
22
Theorem: Let be a stationary ergodic stochastic
process. Let l(X1,X2,Xn) be the L-Z codeword lengthassociated with X1,X2,,Xn. Then
pf :
The L-Z code is a universal code, in which, the code
does not depend on the distribution of the source!!
( ) ( ) ( )( )
( )
( ) ( ) ( ) ( )
( ) 1.yprobabilitwith,
log
suplim
,...,,
suplim
0suplim1,Lemmaby
1log,...,,
21
21
XH
n
nC
n
nCnC
n
XXXl
n
nC
nCnCXXXl
n
n
n
n
+=
=
+=
( ) ( ) 1.yprobabilitwith,...,,1
lim 21 XHXXXln nn
{ }=ii
X
-
7/28/2019 Data Compression Lempel-Ziv Coding
23/24
23
Optimal Variable Length-to-Fixed Length Code
Algorithm :step 1. node
step 2. 2l leaf nodes
Example : Source data input : A B C C B A A A C
PA=0.7
PB=0.2 code-tree AAAAPC=0.1 AAA 0.343 AAAB
AA 0.49 AAB 0.098 AAAC
A 0.7 AB 0.14 AAC 0.049root B 0.2 AC 0.07
C 0.1
-
7/28/2019 Data Compression Lempel-Ziv Coding
24/24
24
A 1011
B 1010
C 1001
AA 1000
AB 0111
AC 0110
AAA 0101
AAB 0100
AAC 0011
AAAA 0010
AAAB 0001
AAAC 0000
A B, C, C, B, A A A C0111 1001 1001 1010 0000
Tunstall 77
GIT ph.D. Thesis