5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the...

5. Hashing

5. Hashing

5.1 General Idea

5.2 Hash Function

5.3 Separate Chaining

5.4 Open Addressing

5.5 Rehashing

5.6 Extendible Hashing

Malek Mouhoub, CS340 Fall 2004 1

5. Hashing

5. Hashing

Sequential access : O(n).

Binary search : O(log(n)).

Direct access : O(1).


5. Hashing

5.1 General Idea

Goal : reduce the number of disk access when searching for a

particular record.

Solution : use information in the record.

Hash function : h(K)• Transforms a key K into an address.

• The resulting address is used as the basis for storing and

retrieving records.


5. Hashing

5.2 Hash Function

Case 1 : h(K) = K

• For a student record with Id = 001111111 , we can store it in

record number 1111111.

• To find a student record with Id = 001001111 we have to read

the 1001111st record.

– If the Id of the record = 001001111, we found the record,

– otherwise, the record does not exist.

– With only 25,000 students, it is impractical to use more than one

million records to store them (waste 99,75 % of disk space).

– We must define a better hashing function to map the key value (Id)

to a smaller range.


5. Hashing

5.2 Hash Function

Case 2 : h(K) = K Mod TableSize

• For a student record with Id = 001234567 , we can store it in record

number 4567 i.e hashing function h1(Id) = Id Mod 10000.

• Is h a good hashing function ?

– Waste less memory space but collision is possible.

∗ Collision : Given a hashing function h and keys k1 and k2. If

h(k1) = h(k2) = r, then k1 and k2 have a collision at r under h.

∗ We may have student records with Id = 001114567,

001104567, 001014567 , and so on.

∗ Collision is almost inevitable in most applications. We must have a

collision resolution policy before hashing can be used.


5. Hashing

Collision resolution policy

• Blocking

• Separate Chaining

• Open addressing


5. Hashing

Blocking

• Allow more than one logical record to be stored in a physical record (location).

• Example :

Each physical record in the relative could store three student records (logical

records).

record-1-1 record-1-2 record-1-3

record-3-1 record-3-2

record-5-1

• A physical record should not be larger than a cluster.

• Can blocking solve the collision problem ?


5. Hashing

Blocking

Example :

r

r r r . . . r

r r

r

• Record number 1 contains 1 logical record.

• Record number 2 contains 900 logical records.

• Record number 3 contains 5 logical records.

Problems :

1. Cannot access the entire physical record in one disk access.

2. The distribution may not be uniform. That is, disk space can be wasted.

3. To update a record which is stored in record number 2 may require a sequential search

through 900 records.


5. Hashing

Blocking

Ideal case :

The hashing function distributes the logical records evenly. That

is, it generates a uniform distribution.

⇒ No overflowing in a physical record,

no waste of disk space,

and no problem arising from collision.

The distribution depends on the hashing function and it is almost

impossible to obtain a uniform distribution.


5. Hashing

Methods for generating random distributions

Methods to generate a random distribution and reduce the size of the

relative file .

• Prime division.

• Radix transformation.

• Truncation.

• Extraction.

• Folding.

• Mid-square method.

• Combine different methods.


5. Hashing

Prime division

• Pick a prime number p which is approximately the size of the

desired relative file.

• Divide the record key by p.

• Add 1 to the remainder and use it as the home address for the

given record.

h(key) = key Mod p + 1

Note : We assume that all record keys are integer. For a non-numeric

key, we can convert it to a numeric key first.


5. Hashing

Prime division

Example :

1. Name is the record key. Key = Tom.

2. Use the alphabetical order to convert it to a numeric key.

T = 20, o = 15,m = 13

3. The numeric value of the record key becomes :

20 ∗ 100 + 15 ∗ 10 + 13 = 2163

or add the ASCII values of the letters.


5. Hashing

Radix transformation

Assume that the value of the record key has a base other than 10.

Example :

Record key Operation Home address

1100 1 ∗ 73 + 1 ∗ 72 392

1020 1 ∗ 73 + 2 ∗ 71 357

or

1100 1 ∗ 33 + 1 ∗ 32 36

1100 1 ∗ 33 + 2 ∗ 31 45


5. Hashing

Truncation and extraction

Truncation : take the rightmost n digits.

Example : for a student record with Id = 00123 4567, we can

take the last four digit and store it in record number 4567.

Extraction : choose some digits from the record key.

Example : for a student record with Id = 000 923456, we can

take the last four digit and store it in record number 92456.


5. Hashing

Folding and Mid-square method

Folding : split the key into parts and add them together.

Example : for a student record with Id = 000 923456 , we can

add 923 to 456 and store it in record number 1397.

Mid-square method : square the key value, take the middle r bits

as the home address.


5. Hashing


• Similar to blocking

• Keep a list of all elements that hash to the same value.

• Implementation : use linked lists.

– To perform a find :

∗ use the hash function to determine which list to traverse,

∗ and then perform a find in this list.

– To perform an insert :

∗ check the appropriate list to see whether the element is

already in place.

∗ If the element is new, insert it at the front of the list.


5. Hashing


0

81 1

464

25

1636

949

0

1

2

3

4

5

6

7

8

9

A separate chaining hash tablehash function : hash(x) = x mod 10


5. Hashing

5.4 Open Addressing

Problems with Separate Chaining :

• Requires the implementation of a second data structure.

• Using linked lists affects the performance (in time) of the

algorithm because of the time required to allocate new cells.


5. Hashing

5.4 Open Addressing

General idea :

• If a collision occurs, alternative cells are tried until an empty cell is

found.

• Instead of using a single hash function h(x) to calculate the address of

the element, h0(x), h1(x), h2(x) . . . are tried in succession, where :

– hi(x) = (hash(x) + f(i)) Mod TableSize

– f is the collision resolution strategy

∗ f(i) = i : linear probing.

∗ f(i) = i2 : quadratic probing

∗ f(i) = i× hash2(x) : double hashing

– f(0) = 0


5. Hashing

Linear Probing

• hi(x) = (hash(x) + i) Mod TableSize

• Trying cells sequentially in search of an empty cell.

• Problem : Primary Clustering

– Any key that hashes into the cluster will require several

attempts to resolve the collision, and then it will add to the

cluster.


5. Hashing

Example : inserting keys {89, 18, 49, 58, 69}

Empty Table After 89 After 18 After 49 After 58 After 69

0 49 49 49

1 58 58

2 69

3

4

5

6

7

8 18 18 18 18

9 89 89 89 89 89


5. Hashing

Quadratic Probing

• hi(x) = (hash(x) + i2) Mod TableSize

• Eliminates the primary clustering problem of linear probing.

• No guarantee of finding an empty cell once the table gets more

than half full, or even before the table gets half full if the table

size is not prime.

• however simpler and faster in practice.


5. Hashing



0 49 49 49

1

2 58 58

3 69

4

5

6

7

8 18 18 18 18

9 89 89 89 89 89


5. Hashing

Double Hashing

• hi(x) = (hash(x) + i× hash2(x)) Mod TableSize

• hash2(x) = R− (x mod R)

– R is a prime smaller than TableSize


5. Hashing



0 69

1

2

3 58 58

4

5

6 49 49 49

7

8 18 18 18 18

9 89 89 89 89 89


5. Hashing

5.5 Rehashing

• When the table gets too full :

– the running time for the operations will start taking too long.

– Insertions will fail for quadratic probing (if the table gets more than half full).

• Solution :

– Build another table that is about twice big and scan down the original hash

table.

– A new hash function is used to compute the new hash value for each

element and inserting it in the new table.

– Running time : O(N) where N is the number of elements to rehash.

– In general, rehash after N/2 insertions.

– Rehashing can be used in other data structures (remember the case of the

ADT queue . . . in the midterm).


5. Hashing

5.6 Extendible Hashing

• If the amount of data is too large to fit in main memory, the main

consideration is the number of disk accesses required to

retrieve data.

• Collision could cause several blocks to be examined during a

find .

• Rehashing is extremely expensive since it requires O(N) disk

access.

• Extendible hashing allows a find to be performed in two disk

accesses and insertions in few accesses.


5. Hashing

Original data

000100001000001010001011

010100011000

100000101000101100101110

111000111001

00 01 10 11

(2) (2) (2) (2)


5. Hashing

After insertion of 100100 and directory split

000100001000001010001011

010100011000

100000100100

101000101100101110

000 001 010 011

(2) (2) (3) (3)

100 101 110 111

111000111001

(2)


5. Hashing

After insertion of 000000 and leaf split

000000000100

010100011000

100000100100

101000101100101110

000 001 010 011

(3) (2) (3) (3)

100 101 110 111

111000111001

(2)

001000001010001011

(3)


5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the...

Documents

Transcript of 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the...