5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the...

30
5. Hashing 5. Hashing 5.1 General Idea 5.2 Hash Function 5.3 Separate Chaining 5.4 Open Addressing 5.5 Rehashing 5.6 Extendible Hashing Malek Mouhoub, CS340 Fall 2004 1

Transcript of 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the...

Page 1: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

5. Hashing

5.1 General Idea

5.2 Hash Function

5.3 Separate Chaining

5.4 Open Addressing

5.5 Rehashing

5.6 Extendible Hashing

Malek Mouhoub, CS340 Fall 2004 1

Page 2: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

5. Hashing

Sequential access : O(n).

Binary search : O(log(n)).

Direct access : O(1).

Malek Mouhoub, CS340 Fall 2004 2

Page 3: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

5.1 General Idea

Goal : reduce the number of disk access when searching for a

particular record.

Solution : use information in the record.

Hash function : h(K)• Transforms a key K into an address.

• The resulting address is used as the basis for storing and

retrieving records.

Malek Mouhoub, CS340 Fall 2004 3

Page 4: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

5.2 Hash Function

Case 1 : h(K) = K

• For a student record with Id = 001111111 , we can store it in

record number 1111111.

• To find a student record with Id = 001001111 we have to read

the 1001111st record.

– If the Id of the record = 001001111, we found the record,

– otherwise, the record does not exist.

– With only 25,000 students, it is impractical to use more than one

million records to store them (waste 99,75 % of disk space).

– We must define a better hashing function to map the key value (Id)

to a smaller range.

Malek Mouhoub, CS340 Fall 2004 4

Page 5: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

5.2 Hash Function

Case 2 : h(K) = K Mod TableSize

• For a student record with Id = 001234567 , we can store it in record

number 4567 i.e hashing function h1(Id) = Id Mod 10000.

• Is h a good hashing function ?

– Waste less memory space but collision is possible.

∗ Collision : Given a hashing function h and keys k1 and k2. If

h(k1) = h(k2) = r, then k1 and k2 have a collision at r under h.

∗ We may have student records with Id = 001114567,

001104567, 001014567 , and so on.

∗ Collision is almost inevitable in most applications. We must have a

collision resolution policy before hashing can be used.

Malek Mouhoub, CS340 Fall 2004 5

Page 6: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Collision resolution policy

• Blocking

• Separate Chaining

• Open addressing

Malek Mouhoub, CS340 Fall 2004 6

Page 7: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Blocking

• Allow more than one logical record to be stored in a physical record (location).

• Example :

Each physical record in the relative could store three student records (logical

records).

record-1-1 record-1-2 record-1-3

record-3-1 record-3-2

record-5-1

• A physical record should not be larger than a cluster.

• Can blocking solve the collision problem ?

Malek Mouhoub, CS340 Fall 2004 7

Page 8: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Blocking

Example :

r

r r r . . . r

r r

r

• Record number 1 contains 1 logical record.

• Record number 2 contains 900 logical records.

• Record number 3 contains 5 logical records.

Problems :

1. Cannot access the entire physical record in one disk access.

2. The distribution may not be uniform. That is, disk space can be wasted.

3. To update a record which is stored in record number 2 may require a sequential search

through 900 records.

Malek Mouhoub, CS340 Fall 2004 8

Page 9: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Blocking

Ideal case :

The hashing function distributes the logical records evenly. That

is, it generates a uniform distribution.

⇒ No overflowing in a physical record,

no waste of disk space,

and no problem arising from collision.

The distribution depends on the hashing function and it is almost

impossible to obtain a uniform distribution.

Malek Mouhoub, CS340 Fall 2004 9

Page 10: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Methods for generating random distributions

Methods to generate a random distribution and reduce the size of the

relative file .

• Prime division.

• Radix transformation.

• Truncation.

• Extraction.

• Folding.

• Mid-square method.

• Combine different methods.

Malek Mouhoub, CS340 Fall 2004 10

Page 11: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Prime division

• Pick a prime number p which is approximately the size of the

desired relative file.

• Divide the record key by p.

• Add 1 to the remainder and use it as the home address for the

given record.

h(key) = key Mod p + 1

Note : We assume that all record keys are integer. For a non-numeric

key, we can convert it to a numeric key first.

Malek Mouhoub, CS340 Fall 2004 11

Page 12: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Prime division

Example :

1. Name is the record key. Key = Tom.

2. Use the alphabetical order to convert it to a numeric key.

T = 20, o = 15,m = 13

3. The numeric value of the record key becomes :

20 ∗ 100 + 15 ∗ 10 + 13 = 2163

or add the ASCII values of the letters.

Malek Mouhoub, CS340 Fall 2004 12

Page 13: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Radix transformation

Assume that the value of the record key has a base other than 10.

Example :

Record key Operation Home address

1100 1 ∗ 73 + 1 ∗ 72 392

1020 1 ∗ 73 + 2 ∗ 71 357

or

1100 1 ∗ 33 + 1 ∗ 32 36

1100 1 ∗ 33 + 2 ∗ 31 45

Malek Mouhoub, CS340 Fall 2004 13

Page 14: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Truncation and extraction

Truncation : take the rightmost n digits.

Example : for a student record with Id = 00123 4567, we can

take the last four digit and store it in record number 4567.

Extraction : choose some digits from the record key.

Example : for a student record with Id = 000 923456, we can

take the last four digit and store it in record number 92456.

Malek Mouhoub, CS340 Fall 2004 14

Page 15: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Folding and Mid-square method

Folding : split the key into parts and add them together.

Example : for a student record with Id = 000 923456 , we can

add 923 to 456 and store it in record number 1397.

Mid-square method : square the key value, take the middle r bits

as the home address.

Malek Mouhoub, CS340 Fall 2004 15

Page 16: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

5.3 Separate Chaining

• Similar to blocking

• Keep a list of all elements that hash to the same value.

• Implementation : use linked lists.

– To perform a find :

∗ use the hash function to determine which list to traverse,

∗ and then perform a find in this list.

– To perform an insert :

∗ check the appropriate list to see whether the element is

already in place.

∗ If the element is new, insert it at the front of the list.

Malek Mouhoub, CS340 Fall 2004 16

Page 17: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

5.3 Separate Chaining

0

81 1

464

25

1636

949

0

1

2

3

4

5

6

7

8

9

A separate chaining hash tablehash function : hash(x) = x mod 10

Malek Mouhoub, CS340 Fall 2004 17

Page 18: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

5.4 Open Addressing

Problems with Separate Chaining :

• Requires the implementation of a second data structure.

• Using linked lists affects the performance (in time) of the

algorithm because of the time required to allocate new cells.

Malek Mouhoub, CS340 Fall 2004 18

Page 19: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

5.4 Open Addressing

General idea :

• If a collision occurs, alternative cells are tried until an empty cell is

found.

• Instead of using a single hash function h(x) to calculate the address of

the element, h0(x), h1(x), h2(x) . . . are tried in succession, where :

– hi(x) = (hash(x) + f(i)) Mod TableSize

– f is the collision resolution strategy

∗ f(i) = i : linear probing.

∗ f(i) = i2 : quadratic probing

∗ f(i) = i× hash2(x) : double hashing

– f(0) = 0

Malek Mouhoub, CS340 Fall 2004 19

Page 20: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Linear Probing

• hi(x) = (hash(x) + i) Mod TableSize

• Trying cells sequentially in search of an empty cell.

• Problem : Primary Clustering

– Any key that hashes into the cluster will require several

attempts to resolve the collision, and then it will add to the

cluster.

Malek Mouhoub, CS340 Fall 2004 20

Page 21: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Example : inserting keys {89, 18, 49, 58, 69}

Empty Table After 89 After 18 After 49 After 58 After 69

0 49 49 49

1 58 58

2 69

3

4

5

6

7

8 18 18 18 18

9 89 89 89 89 89

Malek Mouhoub, CS340 Fall 2004 21

Page 22: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Quadratic Probing

• hi(x) = (hash(x) + i2) Mod TableSize

• Eliminates the primary clustering problem of linear probing.

• No guarantee of finding an empty cell once the table gets more

than half full, or even before the table gets half full if the table

size is not prime.

• however simpler and faster in practice.

Malek Mouhoub, CS340 Fall 2004 22

Page 23: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Example : inserting keys {89, 18, 49, 58, 69}

Empty Table After 89 After 18 After 49 After 58 After 69

0 49 49 49

1

2 58 58

3 69

4

5

6

7

8 18 18 18 18

9 89 89 89 89 89

Malek Mouhoub, CS340 Fall 2004 23

Page 24: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Double Hashing

• hi(x) = (hash(x) + i× hash2(x)) Mod TableSize

• hash2(x) = R− (x mod R)

– R is a prime smaller than TableSize

Malek Mouhoub, CS340 Fall 2004 24

Page 25: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Example : inserting keys {89, 18, 49, 58, 69}

Empty Table After 89 After 18 After 49 After 58 After 69

0 69

1

2

3 58 58

4

5

6 49 49 49

7

8 18 18 18 18

9 89 89 89 89 89

Malek Mouhoub, CS340 Fall 2004 25

Page 26: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

5.5 Rehashing

• When the table gets too full :

– the running time for the operations will start taking too long.

– Insertions will fail for quadratic probing (if the table gets more than half full).

• Solution :

– Build another table that is about twice big and scan down the original hash

table.

– A new hash function is used to compute the new hash value for each

element and inserting it in the new table.

– Running time : O(N) where N is the number of elements to rehash.

– In general, rehash after N/2 insertions.

– Rehashing can be used in other data structures (remember the case of the

ADT queue . . . in the midterm).

Malek Mouhoub, CS340 Fall 2004 26

Page 27: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

5.6 Extendible Hashing

• If the amount of data is too large to fit in main memory, the main

consideration is the number of disk accesses required to

retrieve data.

• Collision could cause several blocks to be examined during a

find .

• Rehashing is extremely expensive since it requires O(N) disk

access.

• Extendible hashing allows a find to be performed in two disk

accesses and insertions in few accesses.

Malek Mouhoub, CS340 Fall 2004 27

Page 28: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

Original data

000100001000001010001011

010100011000

100000101000101100101110

111000111001

00 01 10 11

(2) (2) (2) (2)

Malek Mouhoub, CS340 Fall 2004 28

Page 29: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

After insertion of 100100 and directory split

000100001000001010001011

010100011000

100000100100

101000101100101110

000 001 010 011

(2) (2) (3) (3)

100 101 110 111

111000111001

(2)

Malek Mouhoub, CS340 Fall 2004 29

Page 30: 5. Hashingmouhoubm/=postscript/=c3620/c36205.pdf · 5. Hashing 5.6 Extendible Hashing • If the amount of data is too large to fit in main memory, the main consideration is the

5. Hashing

After insertion of 000000 and leaf split

000000000100

010100011000

100000100100

101000101100101110

000 001 010 011

(3) (2) (3) (3)

100 101 110 111

111000111001

(2)

001000001010001011

(3)

Malek Mouhoub, CS340 Fall 2004 30