1 Binary files Bit operations Using trees, bit ops and binary files Huffman compression CSE 30331...
-
Upload
leonard-tate -
Category
Documents
-
view
235 -
download
2
Transcript of 1 Binary files Bit operations Using trees, bit ops and binary files Huffman compression CSE 30331...
1
Binary files Bit operations Using trees, bit ops and binary files
Huffman compression
CSE 30331Lectures 22 – Huffman codes
2
File Structure
A text file contains ASCII characters with a newline sequence separating lines.
A binary file consists of data objects that vary from a single character (byte) to more complex structures that include integers, floating point values, programmer-generated class objects, and arrays.
each data object in a file is a record
R 0 R 1 R 2 R 3 R 4 R i R n - 2 R n - 1
c u rre n t P o s0 1 2 3 4 n -2 n -1
F ile a s a d ire c t a c c e s s s t ru c t u re
3
Direct File Access
The functions seekg() and seekp() reposition the read and write position, respectively.
They take an offset argument indicating the number of bytes from the beginning (beg), ending (end), or current position (cur) in the file.
The functions tellg() and tellp() return the current read and write position.
b e g c u r e n d
o ffs e t o ffs e to ffs e to ffs e te x p a n d sfi l e
4
Reading & writing
To read from a binary file Use read(char *p, int num); This reads num bytes of data from the file beginning at the
current read position in the file Example: //read 5th accountType record out of file
accountType acct;
int n = 5;
ifstream infile;
infile.open(“accounts.dat”, ios::in | ios::binary);
infile.seekg(n*sizeof(accountType), ios::beg);
infile.read((char *)&acct, sizeof(accountType));
5
Reading & writing
To write to a binary file Use write(char *p, int num); This writes num bytes of data from the file beginning at the
current write position in the file Example: //write 5th accountType record out of file
accountType acct;
int n = 5;
ofstream outfile;
outfile.open(“accounts.dat”, ios::out | ios::binary);
outfile.seekp(n*sizeof(accountType), ios::beg);
outfile.write((char *)&acct, sizeof(accountType));
6
Bit operations (a reminder)
Bitwise ops And ( & ) 0101 & 0110 -> 0100 Or ( | ) 0101 | 0110 -> 0111 Xor ( ^ ) 0101 ^ 0110 -> 0011 Not ( ~ ) ~0101 -> 1010
7
Implementing a bitVector Class
x x x x 0 x x
x x x x 1 x x
x x x
1 1 1 1 0 1 1
x 1 x x
0 0 0 0 1 0 0
x x x x 0 x x x
0
b it M as k (i)
xm em b er[v ect o rIn d ex(i)]
b it M as k (i) | m em b er[v ect o rIn d ex(i)]
x
1
~b it M as k (i)
xm em b er[v ect o rIn d ex(i)]
~b it M as k (i ) & m em b er[v ect o rIn d ex(i)]
S et b it i
C lear b it i
bitMask() returns an unsigned character value containing a 1 in the bit position representing i.
8
Lossless Compression Data compression loses no information Original data can be recovered exactly from the
compressed data Normally applied to "discrete" data, such as text,
word processing files, computer applications, and so forth
T h is p ap er ... ... ... ... ... Su b m it t ed b y
J . Q . St u d en t
1 0 0 1 0 1 1 1 0 1 0 1 0 1 1 1 0 0 1 0 1 0 . . . . . . . . .1 0 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 0 1 1
T h is p ap er ... ... ... ... ... Su b m it t ed b y
J . Q . St u d en t
C o m p res s R eco n s t ru ct
9
Lossy Compression
Loses some information during compression and the data cannot be recovered exactly
Shrinks the data further than lossless compression techniques
Sound files often use this type of compression
1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 0 . . . . . . . . .0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 1
C o m p res s R eco n s t ru ct
1 0 1 1 0 1 0 1 0 1 1 1 0 1 . . .1 0 1 1 1 0 1 1 1 0 1 1 1 01 1 0 1 1 0 1 0 1 1 0 1 1 1 . . .1 1 0 1 0 1 1 0 1 1 0 1 0 11 1 0 1 1 0 1 1 0 1 1 0 1 1
1 0 1 1 0 1 1 1
1 0 1 1 0 1 0 1 0 1 1 1 0 1 . . .1 0 1 1 1 0 1 1 1 0 1 1 1 01 1 0 1 1 . . .1 1 0 1 0 1 1 0 1 1 0 1 0 11 1 0 1 1 0 1 1 0 1 1 0 1 1
10
Huffman Compression A lossless compression technique Counts occurrences of eight bit characters in data Uses counts to construct variable length codes
shorter for more frequently occurring characters Each code has a unique prefix The encoding (compression) process creates an “optimal” binary
tree representing these prefix codes Uses a “greedy approach”
Makes use of data on hand to choose best option Example: Dijkstra’s algorithm is a greedy approach
Achieves compression ratios of at least 1.8 (45% reduction) on text not as good on binary data
11
Example Huffman Tree
57
3621
13c:8
d:6
a:16e:20
b:4
7
f:3
Internal nodes contain sum of its children’s frequencies
Leaves contain original letters and their frequencies
Edge to left child is a 0 bit and to a right
child is a 1
Codesa 11b 0111c 00d 010e 10f 0110
12
Building Huffman Code Trees Read file and determine frequencies of each letter Store nodes (letters and frequencies) in a minimum priority queue
Probably implemented as a heap, with ordering based on frequencies Loop until only one node left in queue
Remove two smallest valued nodes from queue Make them the two children of new root node with value equaling their sum Add new node to queue
Result is tree rooted at last node remaining in queue Codes all have unique prefixes
Derived for each letter (leaf node) based on traversing links in the tree from root to leaf
Left is 0 bit in code – Right is 1 bit in code The length of each code is depth of the leaf in the tree
So … Shortest codes for most frequently occurring data value Longest codes for least frequently occurring data values
13
Huffman tree
The Huffman code tree is optimal in this sense All internal nodes have two children and so there are no
unused unique prefixes So, the number of shorter codes is the maximum number
possible given the frequencies in the data
The degree of compression (size of compressed data) is …
Where f(ch) is the frequency of ch and d(ch) is the number of bits in its code
fileinch
uniqueall
chdchft )()(cos
14
Building a Huffman Tree
a:1 6
f:3e:2 0
d :6
c:8
b :4
P rio rit y Q u eu e
15
Building a Huffman Tree (after first pass)
a:1 6
e:2 0
d :6
c:8
P rio rit y Q u eu e
f:3 b :4
7
(f:3) and (b:4) were lowest
frequency nodes, so they were joined to a parent (7), which was
then added back to the
queue
16
Building a Huffman Tree (after second pass)
(d:6) and (7) were lowest
frequency nodes, so they were joined to a
parent (13), which was
then added back to the
queue
a:16
e:20
c:8
Priority Queue
d:6
f:3 b:4
7
13
17
Building a Huffman Tree (after third pass)
(c:8) and (13) were lowest
frequency nodes, so they were joined to a
parent (21), which was
then added back to the
queue
a:16
e:20
Priority Queue
c:8
d:6
f:3 b:4
7
13
21
18
Building a Huffman Tree (after fourth pass)
(e:20) and (a:16) were lowest
frequency nodes, so they were
joined to a parent (21), which was
then added back to the queue
Priority Queue
c:8
d:6
f:3 b:4
7
13
21
a:16e:20
36
19
Building a Huffman Tree (after last pass)
(21) and (36) were lowest
frequency nodes, so they were
joined to a parent (57), which was
then added back to the queue
Priority Queue
c:8
d:6
f:3 b:4
7
13
21
a:16e:20
36
57
20
The Huffman tree in memory
c:8
d:6
f:3 b:4
7
13
21
a:16e:20
36
57ID ch freq pID left right code
0 A 16 9 -1 -1 11
1 B 4 6 -1 -1 0111
2 C 8 8 -1 -1 00
3 D 6 7 -1 -1 010
4 E 20 9 -1 -1 10
5 F 3 6 -1 -1 0110
6 Int 7 7 5 1
7 Int 13 8 3 6
8 Int 21 10 2 7
9 Int 36 10 4 0
10 Int 57 0 8 9
Sample compression“face” = 0110 11 00 10# of bits4*8=32 vs. 4+2+2+2=10
21
The Huffman tree in fileID ch freq pID left right code
0 A 16 9 -1 -1 11
1 B 4 6 -1 -1 0111
2 C 8 8 -1 -1 00
3 D 6 7 -1 -1 010
4 E 20 9 -1 -1 10
5 F 3 6 -1 -1 0110
6 Int 7 7 5 1
7 Int 13 8 3 6
8 Int 21 10 2 7
9 Int 36 10 4 0
10 Int 57 0 8 9
Only the gray fields are written to store the tree in the compressed file
Tree can then be rebuilt from ch and left and right child indices read from file.
Last node is root and codes can be rediscovered as bits are read from file and tree is followed from root to leaf
22
Format of compressed file
There are four parts Size of tree The Tree – vector of (ch,leftID,rightID) data Size of compressed data The compressed data
23
Uncompressing tree in fileID ch left right code
0 A -1 -1 11
1 B -1 -1 0111
2 C -1 -1 00
3 D -1 -1 010
4 E -1 -1 10
5 F -1 -1 0110
6 Int 5 1
7 Int 3 6
8 Int 2 7
9 Int 4 0
10 Int 8 9
(1) Read size of tree
(2) Read tree from file into vector or array
(3) Read size of compressed data
(4) Start at root (node[0])
(5) For each bit (b) read
(1) If (b==0) move to left child
(2) If (b==1) move to right child
(3) If now at a leaf append leaf’s letter to uncompressed data, and return to root
24
Uncompressing “face”Bit data: 0110110010Bit node letter 10 0 8 1 7 1 6 0 5 ‘f’ 10 1 9 1 0 ‘a’ 10 0 8 0 2 ‘c’ 10 1 9 0 4 ‘e’ 10
ID ch left right code
0 A -1 -1 11
1 B -1 -1 0111
2 C -1 -1 00
3 D -1 -1 010
4 E -1 -1 10
5 F -1 -1 0110
6 Int 5 1
7 Int 3 6
8 Int 2 7
9 Int 4 0
10 Int 8 9
25
Summary
Binary File A sequence of 8-bit characters without the requirement that
a character be printable and with no concern for a newline sequence that terminates lines
Often organized as a sequence of records: record 0, record 1, record 2, ..., record n-1.
Used for both input and output, and the C++ file <fstream> contains the operations to support these types of files.
The open() function must use the attribute ios::binary
26
Summary
Binary File (Cont…) For direct access to a file record, use the function seekg(),
which moves the file pointer to a file record Accepts an argument that specifies motion from the
beginning of the file (ios::beg), from the current position of the file pointer (ios::cur), and from the end of the file (ios::end)
Use read() function to inputs a sequence of bytes from the file into block of memory and write() function to output from a block of memory to a binary file
27
Summary
Bit Manipulation Operators | (OR), & (AND), ^ (XOR), ~ (NOT), << (shift left), and >>
(shift right) Use to perform operations on specific bits within a
character or integer value. The class, bitVector, use operator overloading
treat a sequence of bits as an array, with bit 0 the left-most bit of the sequence
bit(), set(), and clear() allow access to specific bits The class has I/O operations for binary files and the stream
operator << that outputs a bit vector as an ASCII sequence of 0 and 1 values.
28
Summary
File Compression Algorithm Encodes a file as sequence of characters that consume
less disk space than the original file. Two types of compression algorithms:
1) lossless compression Restores the original file. Approach: count the frequency of occurrence of each
character in the file and assign a prefix bit code to each character
File size: the sum of the products of each bit-code length and the frequency of occurrence of the corresponding character.
29
Summary
File Compression Algorithm (Cont…) 2) lossy compression Loses some information during compression and the data
cannot be recovered exactly Normally used with sound and video files
The Huffman compression algorithm is a lossless process that builds optimal prefix codes by constructing a tree with the … most frequently occurring characters and shorter bit codes
as leaves close to the root less frequently occurring characters and longer bit codes
as farther from the root.
30
Summary
File Compression Algorithm (Cont…) If the file contains n distinct characters, the loop concludes
after n-1 iterations, having built the Huffman Tree containing n-1 internal nodes.
Implementation requires the use of a minimum priority queue (heap), bit operations, and binary files
The use of the bitVector class simplifies the construction of the classes hCompress and hDecompress, which perform Huffman compression and decompression.
Works better with textfiles; they tend to have fewer unique characters than binary files.