1 Binary files Bit operations Using trees, bit ops and binary files Huffman compression CSE 30331...

1

Binary files Bit operations Using trees, bit ops and binary files

Huffman compression

CSE 30331Lectures 22 – Huffman codes

2

File Structure

A text file contains ASCII characters with a newline sequence separating lines.

A binary file consists of data objects that vary from a single character (byte) to more complex structures that include integers, floating point values, programmer-generated class objects, and arrays.

each data object in a file is a record

R 0 R 1 R 2 R 3 R 4 R i R n - 2 R n - 1

c u rre n t P o s0 1 2 3 4 n -2 n -1

F ile a s a d ire c t a c c e s s s t ru c t u re

3

Direct File Access

The functions seekg() and seekp() reposition the read and write position, respectively.

They take an offset argument indicating the number of bytes from the beginning (beg), ending (end), or current position (cur) in the file.

The functions tellg() and tellp() return the current read and write position.

b e g c u r e n d

o ffs e t o ffs e to ffs e to ffs e te x p a n d sfi l e

4

Reading & writing

To read from a binary file Use read(char *p, int num); This reads num bytes of data from the file beginning at the

current read position in the file Example: //read 5th accountType record out of file

accountType acct;

int n = 5;

ifstream infile;

infile.open(“accounts.dat”, ios::in | ios::binary);

infile.seekg(n*sizeof(accountType), ios::beg);

infile.read((char *)&acct, sizeof(accountType));

5

Reading & writing

To write to a binary file Use write(char *p, int num); This writes num bytes of data from the file beginning at the

current write position in the file Example: //write 5th accountType record out of file

accountType acct;

int n = 5;

ofstream outfile;

outfile.open(“accounts.dat”, ios::out | ios::binary);

outfile.seekp(n*sizeof(accountType), ios::beg);

outfile.write((char *)&acct, sizeof(accountType));

6

Bit operations (a reminder)

Bitwise ops And ( & ) 0101 & 0110 -> 0100 Or ( | ) 0101 | 0110 -> 0111 Xor ( ^ ) 0101 ^ 0110 -> 0011 Not ( ~ ) ~0101 -> 1010

7

Implementing a bitVector Class

x x x x 0 x x

x x x x 1 x x

x x x

1 1 1 1 0 1 1

x 1 x x

0 0 0 0 1 0 0

x x x x 0 x x x

0

b it M as k (i)

xm em b er[v ect o rIn d ex(i)]

b it M as k (i) | m em b er[v ect o rIn d ex(i)]

x

1

~b it M as k (i)

xm em b er[v ect o rIn d ex(i)]

~b it M as k (i ) & m em b er[v ect o rIn d ex(i)]

S et b it i

C lear b it i

bitMask() returns an unsigned character value containing a 1 in the bit position representing i.

8

Lossless Compression Data compression loses no information Original data can be recovered exactly from the

compressed data Normally applied to "discrete" data, such as text,

word processing files, computer applications, and so forth

T h is p ap er ... ... ... ... ... Su b m it t ed b y

J . Q . St u d en t

1 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 0 1 0 . . . . . . . . .1 0 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 0 1 1

T h is p ap er ... ... ... ... ... Su b m it t ed b y

J . Q . St u d en t

C o m p res s R eco n s t ru ct

9

Lossy Compression

Loses some information during compression and the data cannot be recovered exactly

Shrinks the data further than lossless compression techniques

Sound files often use this type of compression

1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 0 . . . . . . . . .0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 1

C o m p res s R eco n s t ru ct

1 0 1 1 0 1 0 1 0 1 1 1 0 1 . . .1 0 1 1 1 0 1 1 1 0 1 1 1 01 1 0 1 1 0 1 0 1 1 0 1 1 1 . . .1 1 0 1 0 1 1 0 1 1 0 1 0 11 1 0 1 1 0 1 1 0 1 1 0 1 1

1 0 1 1 0 1 1 1

1 0 1 1 0 1 0 1 0 1 1 1 0 1 . . .1 0 1 1 1 0 1 1 1 0 1 1 1 01 1 0 1 1 . . .1 1 0 1 0 1 1 0 1 1 0 1 0 11 1 0 1 1 0 1 1 0 1 1 0 1 1

10

Huffman Compression A lossless compression technique Counts occurrences of eight bit characters in data Uses counts to construct variable length codes

shorter for more frequently occurring characters Each code has a unique prefix The encoding (compression) process creates an “optimal” binary

tree representing these prefix codes Uses a “greedy approach”

Makes use of data on hand to choose best option Example: Dijkstra’s algorithm is a greedy approach

Achieves compression ratios of at least 1.8 (45% reduction) on text not as good on binary data

11

Example Huffman Tree

57

3621

13c:8

d:6

a:16e:20

b:4

7

f:3

Internal nodes contain sum of its children’s frequencies

Leaves contain original letters and their frequencies

Edge to left child is a 0 bit and to a right

child is a 1

Codesa 11b 0111c 00d 010e 10f 0110

12

Building Huffman Code Trees Read file and determine frequencies of each letter Store nodes (letters and frequencies) in a minimum priority queue

Probably implemented as a heap, with ordering based on frequencies Loop until only one node left in queue

Remove two smallest valued nodes from queue Make them the two children of new root node with value equaling their sum Add new node to queue

Result is tree rooted at last node remaining in queue Codes all have unique prefixes

Derived for each letter (leaf node) based on traversing links in the tree from root to leaf

Left is 0 bit in code – Right is 1 bit in code The length of each code is depth of the leaf in the tree

So … Shortest codes for most frequently occurring data value Longest codes for least frequently occurring data values

13

Huffman tree

The Huffman code tree is optimal in this sense All internal nodes have two children and so there are no

unused unique prefixes So, the number of shorter codes is the maximum number

possible given the frequencies in the data

The degree of compression (size of compressed data) is …

Where f(ch) is the frequency of ch and d(ch) is the number of bits in its code

fileinch

uniqueall

chdchft )()(cos

14

Building a Huffman Tree

a:1 6

f:3e:2 0

d :6

c:8

b :4

P rio rit y Q u eu e

15

Building a Huffman Tree (after first pass)

a:1 6

e:2 0

d :6

c:8

P rio rit y Q u eu e

f:3 b :4

7

(f:3) and (b:4) were lowest

frequency nodes, so they were joined to a parent (7), which was

then added back to the

queue

16

Building a Huffman Tree (after second pass)

(d:6) and (7) were lowest

frequency nodes, so they were joined to a

parent (13), which was


queue

a:16

e:20

c:8

Priority Queue

d:6

f:3 b:4

7

13

17

Building a Huffman Tree (after third pass)

(c:8) and (13) were lowest

frequency nodes, so they were joined to a

parent (21), which was


queue

a:16

e:20

Priority Queue

c:8

d:6

f:3 b:4

7

13

21

18

Building a Huffman Tree (after fourth pass)

(e:20) and (a:16) were lowest

frequency nodes, so they were

joined to a parent (21), which was

then added back to the queue

Priority Queue

c:8

d:6

f:3 b:4

7

13

21

a:16e:20

36

19

Building a Huffman Tree (after last pass)

(21) and (36) were lowest

frequency nodes, so they were

joined to a parent (57), which was

then added back to the queue

Priority Queue

c:8

d:6

f:3 b:4

7

13

21

a:16e:20

36

57

20

The Huffman tree in memory

c:8

d:6

f:3 b:4

7

13

21

a:16e:20

36

57ID ch freq pID left right code

0 A 16 9 -1 -1 11

1 B 4 6 -1 -1 0111

2 C 8 8 -1 -1 00

3 D 6 7 -1 -1 010

4 E 20 9 -1 -1 10

5 F 3 6 -1 -1 0110

6 Int 7 7 5 1

7 Int 13 8 3 6

8 Int 21 10 2 7

9 Int 36 10 4 0

10 Int 57 0 8 9

Sample compression“face” = 0110 11 00 10# of bits4*8=32 vs. 4+2+2+2=10

21

The Huffman tree in fileID ch freq pID left right code

0 A 16 9 -1 -1 11

1 B 4 6 -1 -1 0111

2 C 8 8 -1 -1 00

3 D 6 7 -1 -1 010

4 E 20 9 -1 -1 10

5 F 3 6 -1 -1 0110

6 Int 7 7 5 1

7 Int 13 8 3 6

8 Int 21 10 2 7

9 Int 36 10 4 0

10 Int 57 0 8 9

Only the gray fields are written to store the tree in the compressed file

Tree can then be rebuilt from ch and left and right child indices read from file.

Last node is root and codes can be rediscovered as bits are read from file and tree is followed from root to leaf

22

Format of compressed file

There are four parts Size of tree The Tree – vector of (ch,leftID,rightID) data Size of compressed data The compressed data

23

Uncompressing tree in fileID ch left right code

0 A -1 -1 11

1 B -1 -1 0111

2 C -1 -1 00

3 D -1 -1 010

4 E -1 -1 10

5 F -1 -1 0110

6 Int 5 1

7 Int 3 6

8 Int 2 7

9 Int 4 0

10 Int 8 9

(1) Read size of tree

(2) Read tree from file into vector or array

(3) Read size of compressed data

(4) Start at root (node[0])

(5) For each bit (b) read

(1) If (b==0) move to left child

(2) If (b==1) move to right child

(3) If now at a leaf append leaf’s letter to uncompressed data, and return to root

24

Uncompressing “face”Bit data: 0110110010Bit node letter 10 0 8 1 7 1 6 0 5 ‘f’ 10 1 9 1 0 ‘a’ 10 0 8 0 2 ‘c’ 10 1 9 0 4 ‘e’ 10

ID ch left right code

0 A -1 -1 11

1 B -1 -1 0111

2 C -1 -1 00

3 D -1 -1 010

4 E -1 -1 10

5 F -1 -1 0110

6 Int 5 1

7 Int 3 6

8 Int 2 7

9 Int 4 0

10 Int 8 9

25

Summary

Binary File A sequence of 8-bit characters without the requirement that

a character be printable and with no concern for a newline sequence that terminates lines

Often organized as a sequence of records: record 0, record 1, record 2, ..., record n-1.

Used for both input and output, and the C++ file <fstream> contains the operations to support these types of files.

The open() function must use the attribute ios::binary

26

Summary

Binary File (Cont…) For direct access to a file record, use the function seekg(),

which moves the file pointer to a file record Accepts an argument that specifies motion from the

beginning of the file (ios::beg), from the current position of the file pointer (ios::cur), and from the end of the file (ios::end)

Use read() function to inputs a sequence of bytes from the file into block of memory and write() function to output from a block of memory to a binary file

27

Summary

Bit Manipulation Operators | (OR), & (AND), ^ (XOR), ~ (NOT), << (shift left), and >>

(shift right) Use to perform operations on specific bits within a

character or integer value. The class, bitVector, use operator overloading

treat a sequence of bits as an array, with bit 0 the left-most bit of the sequence

bit(), set(), and clear() allow access to specific bits The class has I/O operations for binary files and the stream

operator << that outputs a bit vector as an ASCII sequence of 0 and 1 values.

28

Summary

File Compression Algorithm Encodes a file as sequence of characters that consume

less disk space than the original file. Two types of compression algorithms:

1) lossless compression Restores the original file. Approach: count the frequency of occurrence of each

character in the file and assign a prefix bit code to each character

File size: the sum of the products of each bit-code length and the frequency of occurrence of the corresponding character.

29

Summary

File Compression Algorithm (Cont…) 2) lossy compression Loses some information during compression and the data

cannot be recovered exactly Normally used with sound and video files

The Huffman compression algorithm is a lossless process that builds optimal prefix codes by constructing a tree with the … most frequently occurring characters and shorter bit codes

as leaves close to the root less frequently occurring characters and longer bit codes

as farther from the root.

30

Summary

File Compression Algorithm (Cont…) If the file contains n distinct characters, the loop concludes

after n-1 iterations, having built the Huffman Tree containing n-1 internal nodes.

Implementation requires the use of a minimum priority queue (heap), bit operations, and binary files

The use of the bitVector class simplifies the construction of the classes hCompress and hDecompress, which perform Huffman compression and decompression.

Works better with textfiles; they tend to have fewer unique characters than binary files.

1 Binary files Bit operations Using trees, bit ops and binary files Huffman compression CSE 30331...

Documents

Transcript of 1 Binary files Bit operations Using trees, bit ops and binary files Huffman compression CSE 30331...