Lempel-Ziv methods. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20062...

of 38/38
Lempel-Ziv methods
  • date post

    14-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Embed Size (px)

Transcript of Lempel-Ziv methods. Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20062...

  • Slide 1

Lempel-Ziv methods Slide 2 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20062 Dictionary models - I Dictionary-based compression methods use the principle of replacing substrings in a message with a codeword that identifies each substring in a dictionary, or codebook The dictionary contains a list of substrings and their associated codewords Unlike symbolwise methods, dictionary methods often use fixed codewords rather than explicit probability distribution Slide 3 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20063 Dictionary models - II For example, we can insert into the dictionary the full set of 8-bit ASCII characters How many? and the 256 most common pairs of characters If we use fixed length codeword, how many bits does we need to index dictionary entries? SOL. 9 bits What about the performances in bits/character in the best and in the worst case? SOL. best:4.5b/char worst:9b/char!! Slide 4 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20064 Dictionary models - III Another possibility is to use longer words in the dictionary, perhaps common words like the or and or common components of words like tion. This strings are the phrases of the dictionary A dictionary with a predefined set of phrases does not achieve good compression Performances are better if we tune the dictionary on input source, i.e. if we loose input indipendence Slide 5 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20065 Dictionary models - IV For istance common phrases for an italian sport newspaper are very rare in a business management book To avoid the problem of dictionary being unsuitable for the text at hand we can build a new dictionary for each message to be compressed........ but there is a significant overhead for transmitting and storing it Deciding the size of the dictionary in order to maximize compression is a very difficult problem Slide 6 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20066 The Lempel-Ziv methods The only efficient solution to the problem is to use an adaptive dictionary scheme Pratically all adaptive dictionary compression methods are based on one of the two methods developed by two israely researchers, Abraham Lempel and Jacob Ziv in 1977 e 1978, and called LZ77 and LZ78 "A Universal Algorithm for Sequential Data Compression" in the IEEE Transactions on Information Theory, May 1977 Slide 7 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20067 The key idea - I The key insight of the method is that it is possible to automatically build a dictionary of previously seen strings in the text being compressed The prior text makes a very good dictionary, since it has usually the same style and language of the upcoming text Slide 8 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20068 The key idea - II The dictionary does not have to be transmitted with the compressed text, since the decompressor can build it the same way the compressor does The many variants of Lempel-Ziv methods differ in how pointers are represented and in the limitations on what the pointers are able to refer to The presence of so many variants is also caused by same patents, and by the disputes over patenting Slide 9 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-20069 The LZ77 family +Quite easy to implement +Fast decoding with little use of memory The output of the encoding consists of a series of triples the first component indicates how far back to look in the previously decoded text the second component is the length of the phrase the third is next character for the input Slide 10 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-200610 An example - encoding alphabet {a,b} aaaabbb aabb Slide 11 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-200611 An example - decoding SOL. x y xz xx yxzz xxyz zxz Slide 12 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-200612 A recursive example Despite the recursive references, each character is available when needed acaaacb??bbbbbbbbbba Slide 13 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-200613 Further details on LZ77 LZ77 algorithm places limitations on how far back a pointer can refer (i.e. on the length of the first component of the triple) and on the maximum size of the string referred to (i.e. on the length of the second component) For example, in English text there is no gain in using a sliding windows of more than a few thousand characters We can use a windows of 8.192 characters, i.e. 13 bits Slide 14 Gabriele Monfardini - Corso di Basi di Dati Multimediali a.a. 2005-200614 Further details on LZ77 At the same time, the length of the match is rarely over 16 characters, so the extra cost to allow longer match usually is not justified Exercise: encode the sequence 010020$0110$$0111 with a sliding window of 7 symbols and a maximal match length of 3. Calculate the compression ratio SOL., 0000000 0000001 0100100 0000010 0100111 1111001 1011011 1101101 C=(17*2)/7*8=0.607