Efficient encoding methods Coding theory refers to study of code properties and their suitability...

24
Efficient encoding methods Coding theory refers to study of code properties and their suitability to specific applications. Efficient codes are used, e.g., in data compression, cryptography, error-correction, and group testing. Codes play a central part in information theory, in particular in the design of efficient and reliable data transmission methods. Encoding methods focus on reduction (clever use) of redundancy in data compression (in error detection and correction mechanisms) 20/06/22 Applied Algorithmics - week6 1

Transcript of Efficient encoding methods Coding theory refers to study of code properties and their suitability...

Page 1: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

Efficient encoding methods Coding theory refers to study of code properties and

their suitability to specific applications. Efficient codes are used, e.g., in data compression,

cryptography, error-correction, and group testing. Codes play a central part in information theory, in

particular in the design of efficient and reliable data transmission methods.

Encoding methods focus on reduction (clever use) of redundancy in data compression (in error detection and correction mechanisms)

19/04/23 Applied Algorithmics - week6 1

Page 2: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 2

Data compression Data compression is the process of encoding information

using fewer bits or other information-bearing units. Compression is possible where the input data have

statistical redundancy (e.g., in text files) or when relatively minor changes leading to smaller representation do not affect the quality/fidelity of the input (e.g., in pictures, video, or audio files).

Popular instances of data compression that many computer users are familiar with is the ZIP file format (texts), jpeg format (pictures) and mpeg format (for audio and video).

Page 3: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 3

Data compression Some compression schemes are reversible so that the

original data can be reconstructed (lossless data compression), while others accept some loss of data in order to achieve higher compression (lossy data compression).

Compression is important because it helps reduce the consumption of expensive resources, such as disk space or connection bandwidth. However, compression requires increased information processing power, which can also be expensive.

Page 4: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 4

Data compression - simple example Run-Length Encoding

Data files frequently contain the same character repeated many times in a row.

For example, text files use multiple spaces to separate sentences, indent paragraphs, format tables & charts, etc.

Digitized signals can also have runs of the same value, indicating that the signal is not changing.

For example, an image of the night-time sky would contain long runs of the character or characters representing the black background.

Page 5: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 5

Data compression - simple example Run-Length Encoding

In this scheme we focus on long runs of characters. Each time a long run is encountered in the input data, two

values are written to the output file. The first of these values is the character itself, i.e., a flag to

indicate that run-length compression is beginning. The second value is the number of characters in the run.

Page 6: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 6

Move to Front Transform Move to Front (MTF) transform is an encoding of

data (typically a stream of bytes) designed to improve the performance of entropy encoding (coding scheme that assigns codes to symbols so as to match code lengths with the probabilities of the symbols) techniques of compression.

When properly implemented, it is fast enough that its benefits usually justify including it as an extra step in data compression algorithms.

Page 7: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 7

Move to Front Transform In the context of MTF each byte value is encoded

by its index in a list, which changes over the course of the algorithm.

The list is initially stored, e.g., in order by byte value (0, 1, 2, 3, ..., 255). Therefore, the first byte is always encoded by its own value.

However, after encoding a byte, that value is moved to the front of the list before continuing to the next byte.

Page 8: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 8

Move to Front Transform - example Let S=<9,9,8,8,8,1,9,9,9> be an input sequence and the initial content of the

queue Q is [0,1,2,3,4,5,6,7,8,9] The encoding process will transform S as follows:

S=<9,9,8,8,8,1,9,9,9> and Q=[0,1,2,3,4,5,6,7,8,9] S=<9,9,8,8,8,1,9,9,9> and Q=[9,0,1,2,3,4,5,6,7,8] S=<9,0,8,8,8,1,9,9,9> and Q=[9,0,1,2,3,4,5,6,7,8] S=<9,0,9,8,8,1,9,9,9> and Q=[8,9,0,1,2,3,4,5,6,7] S=<9,0,9,0,8,1,9,9,9> and Q=[8,9,0,1,2,3,4,5,6,7] S=<9,0,9,0,0,1,9,9,9> and Q=[8,9,0,1,2,3,4,5,6,7] S=<9,0,9,0,0,3,9,9,9> and Q=[1,8,9,0,2,3,4,5,6,7] S=<9,0,9,0,0,3,2,9,9> and Q=[9,1,8,0,2,3,4,5,6,7] S=<9,0,9,0,0,3,2,0,9> and Q=[9,1,8,0,2,3,4,5,6,7] S=<9,0,9,0,0,3,2,0,0> and Q=[9,1,8,0,2,3,4,5,6,7] Where the blue value refers to the position of the symbol in the last instance of Q

Page 9: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 9

Burrows-Wheeler Transform The Burrows-Wheeler transform (BWT), a.k.a. block-sorting

compression, is one of the most popular method in data compression. It was invented by Michael Burrows and David Wheeler, in 90-ties. When a character string is transformed by the BWT, none of its

characters change value. The transform rearranges in clever for the order of the characters in the string.

If the original string had several substrings that occurred frequently, then the transformed string will have several places where a single character is repeated multiple times in a row.

This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding.

Page 10: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 10

Cyclic rotations For 0≤k ≤n-1, the kth cyclic rotation of the string

w=w[0..n-1] is another string v = v[0..n-1] , s.t., v[i]=w[(i+k) mod n]

x

xy

y

k

w

v

Page 11: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 11

Burrows-Wheeler Transform The Burrows-Wheeler Transform transform for

the string w = w[0..n-1] is defined as follows: Create a square matrix M[nxn] in which the kth row

contains the kth cyclic rotation of w Sort rows of M in lexicographic order Store the string represented by the last column of M And the index of row which contains the position of

the original string w (i.e., 0th cyclic rotation)

Page 12: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 12

Burrows-Wheeler Transform (example) Consider 5th Fibonacci word f5=babbabab

b a b b a b a ba b b a b a b bb b a b a b b ab a b a b b a ba b a b b a b bb a b b a b b aa b b a b b a bb b a b b a b a

[0]

[1][2][3]

[4][5][6][7]

a b a b b a b ba b b a b a b ba b b a b b a bb a b a b b a bb a b b a b a bb a b b a b b ab b a b a b b ab b a b b a b a

BWT

[7]

[0]

[4][1][6][3]

[5][2]

[0]

[1][2][3]

[4][5][6][7]

The output string bbbbbaaa and position [4]

Page 13: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 13

Burrows-Wheeler Transform The Burrows-Wheeler transform can be computed

by the algorithm that constructs suffix arrays Which means that the Burrows-Wheeler transform

can be computed in linear time The Burrows-Wheeler transform is reversible and

the original string can be recovered efficiently via generation of consecutive columns of matrix M

Page 14: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 14

Burrows-Wheeler (reverse) Transform --hard way

b

bbbba

aa

a

aabbb

bb

……..……..……..……..……..……..……..……..

bbbba

aa

ababbababa

bbbb

……..……..……..……..……..……..…….

bab ……..ba

bababbbbab

abab

bab

babbabbbabbaaba

abbabb

bbbba

aa

abbabbbabbabbab

bbabba

……..……..……..……..……..……..…….

baba ……..

bbbba

aa

abbaabbababababbbabb

bbabbbab

……..……..……..……..……..……..…….

babab ……..

babbbabbbbabbbababab

abbaabba

baba

babbababbabbababbabbababb

abbababbab

babab

bbbba

aa

abbababbabbababbabbababba

bbabbbbaba

bababb …..…..…..…..…..…..…..…..

babbabbabbabbbababbbabbaababba

abbabbabbaba

bababb

bbbba

aa

abbabaabbabbbababbbabbabbabbab

bbabbabbabab

bababba…..…..…..…..…..…..…..…..

babbabababbabbbbababbbbabbabababbab

abbabbaabbabab

bababba

bbbba

aa

abbabababbabbabababbababbabababbabb

bbabbabbbababb

bababbab ................

babbababbabbabbabbababbabbabbabaababbabb

abbabbababbababb

bababbab

abbababbabbabbabbababbabbabbababbabbabba

bbabbababbababba

ababbabb

Page 15: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week6 15

Burrows-Wheeler (reverse) Transform --easy way based on stable sorting property

a

b

bb

b

b

aa

aaa

bbbbb

BWT 1st col

Corresponding symbols

a

b

bb

b

b

aa

aa

bbbbb

a

Structure

a

b

bb

b

b

aa

aaa

bbbbb

BWT 1st col

Reverse BWT

5

4

6

3

1

0

27

Just follow the cycle

b a b b a b a b0 1 2 3 4 5 6 7

Page 16: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week7 16

Lempel-Ziv-Welch Compression The Lempel-Ziv-Welch (LZW) compression algorithm is

an example of dictionary based methods, in which longer fragments of the input text are replaced by much shorter references to code words stored in the special set called dictionary

LZW is an implementation of a lossless data compression algorithm developed by Abraham Lempel and Jacob Ziv.

It was published by Terry Welch in 1984 as an improved version of the LZ78 dictionary coding algorithm developed by Lempel and Ziv.

Page 17: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week7 17

LZW Compression The key insight of the method is that it is possible to

automatically build a dictionary of previously seen strings in the text being compressed.

The dictionary starts off with 256 entries, one for each possible character (single byte string).

Every time a string not already in the dictionary is seen, a longer string consisting of that string appended with the single character following it in the text, is stored in the dictionary.

Page 18: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week7 18

LZW Compression The output consists of integer indices into the dictionary.

These initially are 9 bits each, and as the dictionary grows, can increase to up to 16 bits.

A special symbol is reserved for "flush the dictionary" which takes the dictionary back to the original 256 entries, and 9 bit indices. This is useful if compressing a text which has variable characteristics, since a dictionary of early material is not of much use later in the text.

This use of variably increasing index sizes is one of Welch's contributions. Another was to specify an efficient data structure to store the dictionary.

Page 19: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week7 19

LZW Compression - example Fibonacci language: w-1 =a, w-2=b, wi = wi-1·wi-2 for i>1 For example, w6 = babbababbabba We show how LZW compresses babbababbabba

b a b a b b a b a b b a b b a

-2 -1

Virtual part

In general:

CW4 = CW3 o First(CW5)

And in particular:

CWi = CWj o First(CWi+1) and j<i

CW0 CW1 CW2 CW3 CW4 CW5

0 1 2 3 4 5 6 7 11 12 13 14 15

Page 20: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week7 20

LZW Compression - example cw-2 = b

cw-1 = a

cw0 = ba

cw1 = ab

cw2 = bb

cw3 = bab

cw4 = babb

cw5 = babba

a

a

a

b

b

b

bcw-2cw-1

b

cw5

cw0

cw1 cw2

cw3

cw4

Page 21: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week7 21

LZW Compression - compression stage

Page 22: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week7 22

LZW Compression - compression stagecw ;

while ( read next symbol s from IN )

if cw·s exists in the dictionary then

cw cw·s;

else

add cw·s to the dictionary;

save the index of cw in OUT;

cw s;

Page 23: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week7 23

Decompression stage

Copy all numbers from file IN to vector V [256………..Z+255] Create vector F [256………..Z+255] containing first characters of

each code word Create vector CW [256………..Z+255] of all code words for i=256 to Z+255 do if V[i] < 256 then

CW[i] Concatenate(char(V[i]), F[i+1]) else

CW[i] Concatenate(CW(V[i]), F[i+1]) Write to the output file OUT all code words without their last

symbols

Input IN – Compressed file of integers.Output OUT – Decompressed file of characters. |IN| = Z – Size of the compressed file.

Page 24: Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,

19/04/23 Applied Algorithmics - week7 24

LZW text compression Theorem: For any input string S LZW algorithm

computes its compressed counterpart in time O(n), where n is the length of S.

Sketch of proof: The most complex operations are performed on dictionary. With a help of hash tables all operations can be performed in linear time.

Also the decompression stage is linear.