Data Compression Lempel-Ziv Coding

download Data Compression Lempel-Ziv Coding

of 24

Transcript of Data Compression Lempel-Ziv Coding

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    1/24

    Lempel-Ziv Coding

    Prof. J a-Ling Wu

    Department of Computer Science

    and Information EngineeringNational Taiwan University

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    2/24

    2

    J .Ziv and A.Lempel, Compression of individualsequences by variable rate coding, IEEE Trans.Information Theory, 1978.

    Algorithm :

    The source sequence is sequentially parsed intostrings that have not appeared so far.

    For example, 1011010100010, will be parsed as1,0,11,0 1,010,00,10,

    After every comma, we look along the input sequenceuntil we come to the shortest string that has not been

    marked off before. Since this is the shortest suchstring, all its prefixes must have occurred easier. Inparticular, the string consisting of all but the last bit ofthis string must have occurred earlier.

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    3/24

    3

    We code this phrase by giving the location of the prefixand the value of the last bit.

    Let C(n) be the number of phrases in the parsing of

    the input n-sequence. We need log C(n) bits todescribe the location of the prefix to the phrase and 1bit to describe the last bit.

    In the above example, the code sequence is :

    (000,1), (000,0), (001,1), (010,1), (100,0), (010,0), (001,0)

    where the first number of each pair gives the index ofthe prefix and the 2nd number gives the last bit of the

    phrase.

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    4/24

    4

    Decoding the coded sequence is straight forward andwe can recover the source sequence without error!!

    Two passes-algorithm :

    pass-1 : parse the string and calculate C(n)pass-2 : calculate the points and produce the code string

    The two-passes algorithm allots an equal number of bitsto all the pointers. This is not necessary, since therange of the pointers is smaller at the initial portion ofthe string.

    T-A. Welch, A technique for high-performance dataCompression, Computer, 1984.

    Bell, Cleary, Witlen, Text Compression, Prentice-Hall, 1990.

    Dictionary construction

    Pattern Matching

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    5/24

    5

    Definition :

    A parsing S of a binary string x1, x2,,xn is a divisionof the string into phrases, separated by commas. Adistinct parsing is a parsing such that no two phrasesare identical.

    The L-Z algorithm described above gives a distinct

    parsing of the source sequence. Let C(n) denote the

    number of phrases in the L-Z parsing of a sequence of

    length n.

    The compressed sequence (after applying the L-Z

    algorithm) consists of a list of c(n) pairs of numbers,

    each pair consisting of a pointer to the previous

    occurrence of the prefix of the phrase and the last bitof the phrase. 1

    log c(n)

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    6/24

    6

    the total length of the compressed sequence is :C(n)(log c(n)+1)

    we will show that

    for a stationary ergodic sequence X1,X2,,Xn.

    ( )( )( )H

    n

    nCnC

    +1log)(

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    7/24

    7

    Lemma 1 (Lempel and Ziv) :

    The number of phrase c(n) in a distinct parsing of abinary sequence X1,X2,,Xn satisfies

    wheren 0 as n.

    ( )( ) n

    nnC

    n log1

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    8/24

    8

    Let

    : the sum of the lengths of all distinct strings of length

    less than or equal to k.C(n) : the no. of phrases in a distinct parsing of a

    sequence of length n

    this number is maximized when all the phrases are

    as short as possible.n=nk all the phrases are of length k, thus

    If nkn

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    9/24

    9

    Then the parsing into shortest phrases has each of thephases of length k and phrases of the length k+1.

    We now bound the size of k for a given n.

    Let nk n < nk+1. Thenn nk = (k-1)2k+1+2 2k

    k log n

    n nk+1 = k2k+2+2 (k+2)2k+2 [ (log n+2)2k+2 ]

    1+

    k

    ( ) ( )

    ( )1

    11111

    =

    +

    +

    +

    +

    +=

    K

    nnC

    K

    n

    k

    n

    kK

    n

    knCnC kkk

    2loglog2

    ++

    n

    nk

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    10/24

    10

    for all n 4

    ( ) ( )( )

    ( )

    ( )

    ( )

    ( ) ( )

    ( )

    +=

    =

    =

    +

    =

    +

    ++=

    ++=

    nn

    n

    n

    n

    K

    n

    nC

    n

    nn

    n

    nn

    n

    nn

    nnnkk

    n

    n

    n

    as0log

    4loglogwhere

    log11

    log1

    loglog

    4loglog

    1

    loglog

    3log2log1

    loglog

    32loglog1

    32logloglog321

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    11/24

    11

    Lemma 2 :Let Z be a positive integer valued r.v. with mean.

    Then the entropy H(z) is bound by

    H(z) (+1) log(+1)log

    pf : H W.

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    12/24

    12

    Let be a stationary ergodic process with pmfp(x1,x2,xn). For fixed integer k, defined the k

    th orderMarkov approximation to P as

    , and the initial state

    will be part of the specification of Qk.

    Since is itself an ergodic process, we have

    { } =iiX

    ( )( ) ( )( ) ( )

    ( ) jixxxx

    xxPxPxxxxQ

    jii

    j

    i

    n

    j

    j

    kjjknKk

    +

    =

    ,,...,,where

    ,...,,,...

    1

    1

    101101

    ( )

    0

    1 kx

    1

    n

    knn XXP

    ( )( ) ( )

    ( ) ( )111

    10121

    log

    log1

    ,...,,log1

    =

    = j

    kjj

    j

    kjj

    n

    j

    j

    kjjknK

    XXHXXPE

    XXPn

    XXXXQn

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    13/24

    13

    We will bound the rate of the L-Z code by the entropy

    rate of the k-th order Markov approximation for all k. Theentropy rate of the Markov approximation

    converges to the entropy rate of the process as kand this will prove the result.

    1

    j

    kjj XXH

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    14/24

    14

    Spose , and spose that is parsedinto C distinct phrases, y1,y2,,yc. Leti be theindex of the start of the i-th phrase, i.e., .For each i=1,2,

    ,c, defines . Thus, S

    iis the k

    bits of x preceding yi, of course,

    Let Cls

    be the number of phrases yi with length l and

    preceding state Si=S for l=1,2, and , we thenhave

    and

    ( ) ( )n

    kn

    k xX 11 = nx1

    11 += ii

    xy i

    1

    = ii ki

    xS

    ( )0

    11 = kxS

    kXs

    nlC

    CC

    sl

    ls

    sl

    ls

    =

    =

    ,

    ,

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    15/24

    15

    Lemma 3 : ( Zivs inequality )

    For any distinct parsing (in particular, the L-Z parsing)of the string x1,x2,,xn, we have

    Note that the right hand side does not depend on Qk.

    proof : we write

    ( ) sl

    lslsnk CCsxxxQ

    ,

    121 log,...,,log

    ( )=

    =

    =

    C

    i

    ii

    cnk

    syP

    syyyQsxxxQ

    1

    121121 ,...,,,...,,

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    16/24

    16

    ( ) ( )

    ( )

    ( )

    ( )

    =

    =

    =

    ==

    ==

    ==

    =

    sl

    Ss

    lyi

    ii

    ls

    ls

    Sslyi

    ii

    lssl

    ls

    sl sslyi

    ii

    C

    i

    iink

    i

    i

    i

    i

    ii

    syPC

    C

    syPCC

    syP

    syPsxxxQ

    , :

    :,

    , ,:

    1

    121

    1log

    log1

    log

    log,...,,logor

    Now since the yi are distinct, we have ( )==

    Ss

    lyi

    ii

    ii

    syP

    :

    1

    ( ) sl ls

    lsnKC

    CsxxxQ,

    121

    1log,...,,log

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    17/24

    17

    Theorem:

    Let {Xn}be a stationary ergodic process with entropyrate H(XX), and let C(n) be the number of phrases in adistinct parsing of a sample of length n from thisprocess. Then

    with probability 1.

    ( ) ( ) ( )XHn

    nCnCn

    logsuplim

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    18/24

    18

    ( )

    ==

    =

    =

    sl

    sl

    sl

    sl

    lsls

    sl

    lsls

    ls

    lslsnk

    C

    nl

    C

    C

    CC

    CCCCC

    C

    CCCsxxxQ

    ,,

    ,,

    ,

    121

    ,1

    havewe,writting

    loglog

    log,...,,log:pf

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    19/24

    19

    We now define r.v.s U,V, such that

    Pr(U=l,V=s) =l,sThus E(U)=n/c, and

    Since H(U,V) H(U) + H(V)

    and H(V) log |XX|k = k

    From Lemma 2, we have

    ( ) ( )VUHnC

    Cn

    C

    sxxxQn

    ccVUCHsxxxQ

    nK

    nk

    ,log,...,,log

    1

    log),()|,...,,(log

    121

    121

    ( ) ( ) ( )

    ( ) ( ).1log,,Thus

    1log1log

    log1log1

    log1log1

    OC

    n

    n

    C

    kn

    C

    VUHn

    C

    n

    C

    C

    n

    C

    n

    C

    n

    C

    n

    C

    n

    C

    n

    EUEUEUEUUH

    ++

    +

    ++=

    +

    +=

    ++

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    20/24

    20

    For a given n, the maximum of is attained for

    the maximum value of C . But from Lemma 1,

    C

    n

    n

    Clog

    en

    C 1for

    ( )( )

    ( )

    +

    nas0,thereforeand

    log

    logloglog

    Thus.11log

    VUHn

    C

    n

    nO

    C

    n

    n

    C

    On

    nC

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    21/24

    21

    Therefore,

    wherek(n)0 as n. Hence, with probability 1,

    ( ) ( ) ( ) ( )nsxxxQnn

    nCnCknk +

    121 ,...,,log

    1log

    ( ) ( )( )( )

    ( ) ( ) .,...,

    ,...,,log1

    limlog

    suplim

    10

    0121

    =

    kasXHXXXH

    XXXXQ

    nn

    nCnC

    k

    knk

    nn

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    22/24

    22

    Theorem: Let be a stationary ergodic stochastic

    process. Let l(X1,X2,Xn) be the L-Z codeword lengthassociated with X1,X2,,Xn. Then

    pf :

    The L-Z code is a universal code, in which, the code

    does not depend on the distribution of the source!!

    ( ) ( ) ( )( )

    ( )

    ( ) ( ) ( ) ( )

    ( ) 1.yprobabilitwith,

    log

    suplim

    ,...,,

    suplim

    0suplim1,Lemmaby

    1log,...,,

    21

    21

    XH

    n

    nC

    n

    nCnC

    n

    XXXl

    n

    nC

    nCnCXXXl

    n

    n

    n

    n

    +=

    =

    +=

    ( ) ( ) 1.yprobabilitwith,...,,1

    lim 21 XHXXXln nn

    { }=ii

    X

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    23/24

    23

    Optimal Variable Length-to-Fixed Length Code

    Algorithm :step 1. node

    step 2. 2l leaf nodes

    Example : Source data input : A B C C B A A A C

    PA=0.7

    PB=0.2 code-tree AAAAPC=0.1 AAA 0.343 AAAB

    AA 0.49 AAB 0.098 AAAC

    A 0.7 AB 0.14 AAC 0.049root B 0.2 AC 0.07

    C 0.1

  • 7/28/2019 Data Compression Lempel-Ziv Coding

    24/24

    24

    A 1011

    B 1010

    C 1001

    AA 1000

    AB 0111

    AC 0110

    AAA 0101

    AAB 0100

    AAC 0011

    AAAA 0010

    AAAB 0001

    AAAC 0000

    A B, C, C, B, A A A C0111 1001 1001 1010 0000

    Tunstall 77

    GIT ph.D. Thesis