5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the...
Transcript of 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the...
5. Hashing
5. Hashing
5.1 General Idea
5.2 Hash Function
5.3 Separate Chaining
5.4 Open Addressing
5.5 Rehashing
5.6 Extendible Hashing
Malek Mouhoub, CS340 Fall 2004 1
5. Hashing
5. Hashing
Sequential access : O(n).
Binary search : O(log(n)).
Direct access : O(1).
Malek Mouhoub, CS340 Fall 2004 2
5. Hashing
5.1 General Idea
Goal : reduce the number of disk access when searching for a
particular record.
Solution : use information in the record.
Hash function : h(K)• Transforms a key K into an address.
• The resulting address is used as the basis for storing and
retrieving records.
Malek Mouhoub, CS340 Fall 2004 3
5. Hashing
5.2 Hash Function
Case 1 : h(K) = K
• For a student record with Id = 001111111 , we can store it in
record number 1111111.
• To find a student record with Id = 001001111 we have to read
the 1001111st record.
– If the Id of the record = 001001111, we found the record,
– otherwise, the record does not exist.
– With only 25,000 students, it is impractical to use more than one
million records to store them (waste 99,75 % of disk space).
– We must define a better hashing function to map the key value (Id)
to a smaller range.
Malek Mouhoub, CS340 Fall 2004 4
5. Hashing
5.2 Hash Function
Case 2 : h(K) = K Mod TableSize
• For a student record with Id = 001234567 , we can store it in record
number 4567 i.e hashing function h1(Id) = Id Mod 10000.
• Is h a good hashing function ?
– Waste less memory space but collision is possible.
∗ Collision : Given a hashing function h and keys k1 and k2. If
h(k1) = h(k2) = r, then k1 and k2 have a collision at r under h.
∗ We may have student records with Id = 001114567,
001104567, 001014567 , and so on.
∗ Collision is almost inevitable in most applications. We must have a
collision resolution policy before hashing can be used.
Malek Mouhoub, CS340 Fall 2004 5
5. Hashing
Collision resolution policy
• Blocking
• Separate Chaining
• Open addressing
Malek Mouhoub, CS340 Fall 2004 6
5. Hashing
Blocking
• Allow more than one logical record to be stored in a physical record (location).
• Example :
Each physical record in the relative could store three student records (logical
records).
record-1-1 record-1-2 record-1-3
record-3-1 record-3-2
record-5-1
• A physical record should not be larger than a cluster.
• Can blocking solve the collision problem ?
Malek Mouhoub, CS340 Fall 2004 7
5. Hashing
Blocking
Example :
r
r r r . . . r
r r
r
• Record number 1 contains 1 logical record.
• Record number 2 contains 900 logical records.
• Record number 3 contains 5 logical records.
Problems :
1. Cannot access the entire physical record in one disk access.
2. The distribution may not be uniform. That is, disk space can be wasted.
3. To update a record which is stored in record number 2 may require a sequential search
through 900 records.
Malek Mouhoub, CS340 Fall 2004 8
5. Hashing
Blocking
Ideal case :
The hashing function distributes the logical records evenly. That
is, it generates a uniform distribution.
⇒ No overflowing in a physical record,
no waste of disk space,
and no problem arising from collision.
The distribution depends on the hashing function and it is almost
impossible to obtain a uniform distribution.
Malek Mouhoub, CS340 Fall 2004 9
5. Hashing
Methods for generating random distributions
Methods to generate a random distribution and reduce the size of the
relative file .
• Prime division.
• Radix transformation.
• Truncation.
• Extraction.
• Folding.
• Mid-square method.
• Combine different methods.
Malek Mouhoub, CS340 Fall 2004 10
5. Hashing
Prime division
• Pick a prime number p which is approximately the size of the
desired relative file.
• Divide the record key by p.
• Add 1 to the remainder and use it as the home address for the
given record.
h(key) = key Mod p + 1
Note : We assume that all record keys are integer. For a non-numeric
key, we can convert it to a numeric key first.
Malek Mouhoub, CS340 Fall 2004 11
5. Hashing
Prime division
Example :
1. Name is the record key. Key = Tom.
2. Use the alphabetical order to convert it to a numeric key.
T = 20, o = 15,m = 13
3. The numeric value of the record key becomes :
20 ∗ 100 + 15 ∗ 10 + 13 = 2163
or add the ASCII values of the letters.
Malek Mouhoub, CS340 Fall 2004 12
5. Hashing
Radix transformation
Assume that the value of the record key has a base other than 10.
Example :
Record key Operation Home address
1100 1 ∗ 73 + 1 ∗ 72 392
1020 1 ∗ 73 + 2 ∗ 71 357
or
1100 1 ∗ 33 + 1 ∗ 32 36
1100 1 ∗ 33 + 2 ∗ 31 45
Malek Mouhoub, CS340 Fall 2004 13
5. Hashing
Truncation and extraction
Truncation : take the rightmost n digits.
Example : for a student record with Id = 00123 4567, we can
take the last four digit and store it in record number 4567.
Extraction : choose some digits from the record key.
Example : for a student record with Id = 000 923456, we can
take the last four digit and store it in record number 92456.
Malek Mouhoub, CS340 Fall 2004 14
5. Hashing
Folding and Mid-square method
Folding : split the key into parts and add them together.
Example : for a student record with Id = 000 923456 , we can
add 923 to 456 and store it in record number 1397.
Mid-square method : square the key value, take the middle r bits
as the home address.
Malek Mouhoub, CS340 Fall 2004 15
5. Hashing
5.3 Separate Chaining
• Similar to blocking
• Keep a list of all elements that hash to the same value.
• Implementation : use linked lists.
– To perform a find :
∗ use the hash function to determine which list to traverse,
∗ and then perform a find in this list.
– To perform an insert :
∗ check the appropriate list to see whether the element is
already in place.
∗ If the element is new, insert it at the front of the list.
Malek Mouhoub, CS340 Fall 2004 16
5. Hashing
5.3 Separate Chaining
0
81 1
464
25
1636
949
0
1
2
3
4
5
6
7
8
9
A separate chaining hash tablehash function : hash(x) = x mod 10
Malek Mouhoub, CS340 Fall 2004 17
5. Hashing
5.4 Open Addressing
Problems with Separate Chaining :
• Requires the implementation of a second data structure.
• Using linked lists affects the performance (in time) of the
algorithm because of the time required to allocate new cells.
Malek Mouhoub, CS340 Fall 2004 18
5. Hashing
5.4 Open Addressing
General idea :
• If a collision occurs, alternative cells are tried until an empty cell is
found.
• Instead of using a single hash function h(x) to calculate the address of
the element, h0(x), h1(x), h2(x) . . . are tried in succession, where :
– hi(x) = (hash(x) + f(i)) Mod TableSize
– f is the collision resolution strategy
∗ f(i) = i : linear probing.
∗ f(i) = i2 : quadratic probing
∗ f(i) = i× hash2(x) : double hashing
– f(0) = 0
Malek Mouhoub, CS340 Fall 2004 19
5. Hashing
Linear Probing
• hi(x) = (hash(x) + i) Mod TableSize
• Trying cells sequentially in search of an empty cell.
• Problem : Primary Clustering
– Any key that hashes into the cluster will require several
attempts to resolve the collision, and then it will add to the
cluster.
Malek Mouhoub, CS340 Fall 2004 20
5. Hashing
Example : inserting keys {89, 18, 49, 58, 69}
Empty Table After 89 After 18 After 49 After 58 After 69
0 49 49 49
1 58 58
2 69
3
4
5
6
7
8 18 18 18 18
9 89 89 89 89 89
Malek Mouhoub, CS340 Fall 2004 21
5. Hashing
Quadratic Probing
• hi(x) = (hash(x) + i2) Mod TableSize
• Eliminates the primary clustering problem of linear probing.
• No guarantee of finding an empty cell once the table gets more
than half full, or even before the table gets half full if the table
size is not prime.
• however simpler and faster in practice.
Malek Mouhoub, CS340 Fall 2004 22
5. Hashing
Example : inserting keys {89, 18, 49, 58, 69}
Empty Table After 89 After 18 After 49 After 58 After 69
0 49 49 49
1
2 58 58
3 69
4
5
6
7
8 18 18 18 18
9 89 89 89 89 89
Malek Mouhoub, CS340 Fall 2004 23
5. Hashing
Double Hashing
• hi(x) = (hash(x) + i× hash2(x)) Mod TableSize
• hash2(x) = R− (x mod R)
– R is a prime smaller than TableSize
Malek Mouhoub, CS340 Fall 2004 24
5. Hashing
Example : inserting keys {89, 18, 49, 58, 69}
Empty Table After 89 After 18 After 49 After 58 After 69
0 69
1
2
3 58 58
4
5
6 49 49 49
7
8 18 18 18 18
9 89 89 89 89 89
Malek Mouhoub, CS340 Fall 2004 25
5. Hashing
5.5 Rehashing
• When the table gets too full :
– the running time for the operations will start taking too long.
– Insertions will fail for quadratic probing (if the table gets more than half full).
• Solution :
– Build another table that is about twice big and scan down the original hash
table.
– A new hash function is used to compute the new hash value for each
element and inserting it in the new table.
– Running time : O(N) where N is the number of elements to rehash.
– In general, rehash after N/2 insertions.
– Rehashing can be used in other data structures (remember the case of the
ADT queue . . . in the midterm).
Malek Mouhoub, CS340 Fall 2004 26
5. Hashing
5.6 Extendible Hashing
• If the amount of data is too large to fit in main memory, the main
consideration is the number of disk accesses required to
retrieve data.
• Collision could cause several blocks to be examined during a
find .
• Rehashing is extremely expensive since it requires O(N) disk
access.
• Extendible hashing allows a find to be performed in two disk
accesses and insertions in few accesses.
Malek Mouhoub, CS340 Fall 2004 27
5. Hashing
Original data
000100001000001010001011
010100011000
100000101000101100101110
111000111001
00 01 10 11
(2) (2) (2) (2)
Malek Mouhoub, CS340 Fall 2004 28
5. Hashing
After insertion of 100100 and directory split
000100001000001010001011
010100011000
100000100100
101000101100101110
000 001 010 011
(2) (2) (3) (3)
100 101 110 111
111000111001
(2)
Malek Mouhoub, CS340 Fall 2004 29
5. Hashing
After insertion of 000000 and leaf split
000000000100
010100011000
100000100100
101000101100101110
000 001 010 011
(3) (2) (3) (3)
100 101 110 111
111000111001
(2)
001000001010001011
(3)
Malek Mouhoub, CS340 Fall 2004 30