CS2420: Lecture 33 Vladimir Kulyukin Computer Science Department Utah State University.

22
CS2420: Lecture 33 Vladimir Kulyukin Computer Science Department Utah State University
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of CS2420: Lecture 33 Vladimir Kulyukin Computer Science Department Utah State University.

CS2420: Lecture 33

Vladimir Kulyukin

Computer Science Department

Utah State University

Outline

• Hash Tables (Chapter 5)

Motivation• Recall Big Question 4:

– How can I retrieve/search data efficiently?

• After investigating the balanced binary search trees (AVL, Red-Black), we can ask:

– Is it possible to break the log(n) barrier for insertion and deletion?

Hash Tables

• A hash table is a data structure that was invented as an attempt to break the log(N) insertion and deletion barrier of the balanced binary search trees.

• Conceptually, a hash table is an array of items plus a hash function that maps arbitrary objects to indices of the array.

• A hash function first extracts a key from a given object and then maps the key into a legal array index.

• For example, if an object is an employee record, the key could be the employee’s SSN or the employee’s first and last names.

• Typical keys are numbers and strings.

Example: A Hash Table

“Mark”

“Rachel”

“David”

“Deborah”

0

1

2

3

4

5

6

7

8

9

“John”

Hash Functions

0

1

2

3

4

5

6

HashingKey

KeyExtraction

legal index

Object

Hash Functions

• It is impossible to find a hash function that computes indices (two different array cells) for any two distinct keys. Why? Because there are infinitely many keys, but only finitely many slots in the table.

• Question: What are we to do?

• Answer: Look for hash functions that distribute keys evenly among the cells.

Three Hashing Problems

• Choose a hash function:– Simple and fast;– Distributes keys evenly.

• Choose a table size.

• Choose a collision resolution strategy (what to do when several keys are mapped to the same index).

Choosing a Hash Function

• If keys are integers, Key Mod TableSize is a sensible strategy.

• Caveat: Keys should be random and should not have some undesirable properties.

• For example, if TableSize = 10 and all keys end in 0, Key Mod TableSize is not a sensible strategy.

Choosing a Table Size

• To avoid the situations with uneven key distributions, TableSize is typically a prime number.

• When keys are random integers Key Mod TableSize works fairly well.

A Hash Function: Example 1

1

0

].[1Keylength

i

iKeyKeyhash

A Hash Function: Example 1

int hash(const string& key, int tableSize){

int hashVal = 0;

for(int i = 0; i < key.length(); i++) {hashVal += key[i];

}

return hashVal % tableSize;}

Comments on hash1

• Easy to compute and fast.• If the TableSize is large, the function may

not distribute keys well.• Why? • Suppose TableSize = 10,007 (a prime)

and all keys are ASCII strings of length 8 or smaller.

• hash1’s range is [0, 127*8=1016].• This is NOT an acceptable distribution.

Hash Function: Example 2

.37]0[

...37]2[37]1[

3712

1

10

1

0

KeyLength

Keylength

i

i

Key

KeylengthKeyKeylengthKey

iKeylengthKeyKeyhash

Hash Function: Example 2int hash2(const string &key, int tableSize){

int hashVal = 0;for(int j=0; j < key.length(); j++) {

hashVal = 37 * hashVal + key[j];}

hashVal %= tableSize;if ( hashVal < 0 ) {

hashVal += tableSize;}

return hashVal;}

Comments On Hash2

• Easy to compute.

• Fast on relatively short keys.

• Distributes keys fairly well.

• Potential problems with very long keys, because there will be lots of buffer overflows and collisions.

Collision Resolution

A collision occurs when an element is inserted under a key that hashes to the cell that is already occupied with a different element.

Collision Resolution Strategies

• Separate chaining

• Open addressing

Separate Chaining

• Separate chaining keeps a list of all elements whose keys hash to the same index.

• What does it mean?• Under separate chaining, a hash table is an

array of lists.• The term “lists” is used rather loosely in the

previous statement. It can be an array of AVL search trees or an array of has tables. But the linked list remains the most common choice.

Hash Table: Implementation

template <class T>class CHashTable {

…private:

vector<list<T> > m_Lists;int m_Size;

…};int hash(const string &key) { …}int hash(const string &key) { …}

Hash Table: Implementation

class CEmployee {private:

string m_Name;double m_Salary;

…};int hash(const Employee &x) {

return hash(x.GetName());}

Hash Table: Implementation

template <class T>int CHashTable<T>::hashIndex(const T& x) const{

int index = hash(x);index %= m_Lists.size();if ( index < 0 )

index += m_Lists.size();

return index;}