Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that...

Post on 31-Dec-2015

244 views 1 download

Transcript of Dictionaries and Hash Tables. Dictionary A dictionary, in computer science, implies a container that...

Dictionaries and Hash Tables

Dictionary

A dictionary, in computer science, implies a container that stores key-element pairs called items, and allows for quick retrieval.

– Items must be stored in a way that allows them to be located with the key

– Not necessary to store the items in order Unordered dictionary Ordered dictionary

Dictionary ADT

Operations in a Dictionary ADT:int size()bool isEmpty()iter elements()iter keys()pos find( key )iter findAll( key )void insertItem( key, elem )void removeElement( key )void removeAllElements( key )

Dictionary Examples

Natural language dictionary• word is key

• element contains word, definition, pronunciation, etc.

Web pages• URL is key

• html or other file is element

Any typical database (e.g. student record)• has one or more search keys

• each key may require own organizational dictionary

Implementing a Dictionary

There are many ways a dictionary can be implemented. Some of them are:– Log file or Audit Trail– Ordered Dictionary and Binary search trees– Hash table

Log File or Audit Trail

This is the simplest way to implement a dictionary. It uses an unordered vector, list or sequence to store the key-element pairs.void insertItem( key, elem )

Each new item is appended at the end – O(1)

pos find( key ) Scan the entire list and examine each key – O(n)

void removeElement( key ) Scan the entire list to find the item, then remove it – O(n)

This allows for fast insertions. However, find and retrieval are slow.

– Good solution for storing items that are stored frequently but retrieved rarely such as archiving database and operating systems transactions.

– Storing log file

Ordered Dictionary ADT

All of the Dictionary operations, e.g. find(k), insertItem(k,e), removeElement(k)

Additional operationspos closestBefore( key )

pos closestAfter( key )

Look-Up Tables

A look-up table is an implementation of an ordered dictionary ( eg. trigonometry table )

Here is an example, where all items are stored in a vector, in ascending order of the keys.

0 1 2 3 4

A

5 6 7 8 9 10

13 265 3716 2115

Lookup Table Performance

In a look-up table, inserting or removing may require shifting elements

0 1 2 3 4

A

5 6 7 8 9 10

13 265 3716 2115

0 1 2 3 4

A

5 6 7 8 9 10

13 265 3716 21152

Example:Insert an item with a key of 2

n elements shifted to make room

insertItem(k,e) takes O(n) time in the worst caseremoveElement(k) takes O(n) time in the worst case

Lookup Table – find(k)

However, since the items in a lookup table are ordered, we can implement find(k) with a binary search algorithmA binary search algorithm (or binary chop) is a technique for finding a particular value in a linear array, by ruling out half of the data at each step. A binary search finds the median, makes a comparison to determine whether the desired value comes before or after it, and then searches the remaining half in the same manner. A binary search is an example of a divide and conquer algorithm.

0 1 2 3 4

A

5 6 7 8 9 10 11 12 13 14 15

Binary Search

5 124 148 972 22 3319 3727 282517

Example: find(22)

low highmid

0 1 2 3 4

A

5 6 7 8 9 10 11 12 13 14 15

22 3319 3727 2817

mid highlow

5 124 148 972 25

A 2217mid highlow

5 124 148 972 33 3727 282519

A

low = mid = high

5 124 148 972 33 3727 28252217 19

Binary Search Algorithm

Algorithm BinarySearch( A, k, low, high)if low > high then return Nullelse mid = (low + high) / 2 if ( k == key(mid) ) then return Position(mid) else if ( k < key(mid) ) then return BinarySearch( A, k, low, mid – 1 ) else return BinarySearch( A, k, mid + 1, high )

Hash Tables

In computer science, a hash table, or a hash map, is a data structure that associates keys with values. The primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the corresponding value (e.g. that person's telephone number). It works by transforming the key using a hash function into a hash, a number that the hash table uses to locate the desired value.

This is considered the most efficient way to implement a dictionary.

Hash Table

Bucket Arrays

A Bucket array for a hash table, is an array A of size N, where each cell of A is thought of as a ‘bucket’, and N defines the capacity of the array.Example

– Small company with less than 100 employees– Each employee has an ID number in the range 0–99– Store employee records in an array, so that the employee ID

number matches the array index

EMPTY

01Turing, A.

02Babbage, C. EMPTY

04Gates,

W.

0 1 2 3 4

A …

Bucket Arrays

If the keys are unique, then searches, insertions and removals in the bucket array take worst-case time of O(1).

However, bucket arrays have 2 drawbacks. – It requires a capacity of N (which is the

maximum number of elements possible– The key has to be a integer in the range [0, N-1]

Hash Functions

A good hash function is essential for good hash table performance. If a hash function tends to produce similar values, slow searches will result.

Example– Small company with less than 100 employees– Already uses a 5-digit ID number

A simple hash function for this example is ( ID % 100 )

EMPTY

55301Turing, A.

81202Babbage, C. EMPTY

77404Gates,

W.

0 1 2 3 4

A …

Hash Functions

A hash function is a way of creating a small digital "fingerprint" from any kind of data. The function chops and mixes the data to create the fingerprint, often called a hash value. A good hash function is one that yields few hash collisions in expected input domains.

To do this, the index into the hash table's array is generally calculated in two steps:

– A generic hash value is calculated to map the key to an integer ( hash code )

– This value is reduced to a valid array index ( compression map )

Hash Code

Take an arbitrary key k and assigning it to an integer value h. Then h is know as the hash code or hash value of k.

key -> integer

This integer h does not need to be in the range of the array that is being used for hashing and may even be a negative number, but we want the set of hash codes assigned to our keys to avoid collisions as much as possible.

Hash coding can be done in many ways:

– Integer cast

– Summing components

– Polynomial accumulations

Hash code – Integer Cast

int hashCode( int key ){ return key; }

int hashCode( char key ){ return hashCode( int(key) ); // cast it

// to an integer }

Hash code – Summing Components

If the long int has twice as many bits as the int datatype, e.g. 32 bits for int, 64 bits for long

Treat the high-order bits as an integer and the low-order bits as an integer, then sum them

int hashCode( long key ){ typedef unsigned long ulong; return hashCode( int( ulong(key) >> 32 ) + int( key ) ); }

Hash code – Summing Components Applied to Strings

One approach is to sum the ASCII values of all the chars in the string– Problem: too many collisions because many

different words will have the same result– For example, stop, tops, pots, spot

ASCII

s = 115t = 116o = 111p = + 112

Hashcode = 454

Hash code – Polynomial Accumulation

Better approach for string keys– Modify each char’s ASCII value by a number based on its

position in the string– Then sum the results– Where x represents a char, k is the total number of chars, and a is a constant (but not 1), the following formula can be used:

x0ak-1 + x1ak-2 + … + xk-2a + xk-1

s = 115 * 103 = 115000t = 116 * 102 = 11600o = 111 * 101 = 1110p = 112 * 100 = + 112

Hashcode = 127822

Example, assume thatthe string is “stop” and a = 10

Compression Maps

This is the second part of the hash function action. Once we have a hash code, we need to map it to an integer in the range of array index numbers

This can me accomplished in many ways:– Truncation– Truncation and Summation– Division method– MAD method

Compression Maps - Truncation

One way would be to simply ignore parts of the key and use the remaining part.

Eg:employee number: 15436578bucket size: 1000possibility 1: k = last 3 digits = 578possibility 2: k = digits 4, 6 and 8 = 358

This is a fast scheme, but it fails to give an even distribution of keys throughout the table.

Compression Maps – Truncation and Summation

This method might use a combination of truncating and summing parts of the key.

Eg:employee number: 15496578bucket size: 1000possibility: k = partition into 3, and together and truncate if necessary.k = 154 + 965 + 78 = 1197 = 197

This provides a better spread than simple truncation, but it still does not prevent collision.

Compression Maps - Division Method

int k = hashCode( key );int index = abs(k) % ARRAY_SIZE;

It has been found that the size of the array should be a prime number. This reduces the number of collisions and spreads out the distribution of hashed values

Example Keys = {200,210,220,230,…,600} IF Array size = 100 - a non-prime number produces collisions for

each hash code IF Array size = 101 - a prime number produces less collisions

for each hash code

Compression Maps - MAD Method

This is another method to convert the hash code into a known range. MAD stands for “Multiply, Add, and Divide” where

a and b are non-negative integers (a % ARRAY_SIZE) must not be 0 a and b are chosen at random when the program is written

int k = hashCode( key );int i = abs(a * k + b) % ARRAY_SIZE;

–Example:Keys = {200,210,220,230,…,600}where a=8, b=7, array size = 100200 => (8*200+7) % 100 => 7210 => (8*210+7) % 100 => 87220 => (8*220+7) % 100 => 67230 => (8*230+7) % 100 => 47

Collisions

There is no restriction as to the key being unique or for the hash function to generate a unique value. This means that there is a chance that there might be more than one element that wants to be mapped to the same position. This would create a collision.

Collisions

Two different keys are mapped to the same location in the array

Best approach – minimize collisions by picking a good hash function

Example– A bad hash function is ( key % 100 ) because it is

too likely to cause collisions . key % 101 is better.

Collisions

If two keys hash to the same index, the corresponding records cannot be stored in the same location. So, if it's already occupied, we must find another location to store the new record, and do it so that we can find it when we look it up later on.Example

– Previous hash function of ( ID % 100 ) is too likely to cause collisions

EMPTY

55301Turing, A.

81202Babbage, C. EMPTY

77404Gates,

W.

0 1 2 3 4

A

38104McNealy,

S.

!

Collision Handling

There are a number of collision resolution techniques, but the most popular are chaining and open addressing.

Two different approaches– Chaining

– Open addressing

Chaining

Separate chaining is a method for dealing with collisions. The hash table is an array of linked lists. Data elements that hash to the same value are stored in a linked list originating from the index equivalent of their hash value.

– Each location in the hash table holds a pointer to a list

– Each list can hold many items

– As long as the hash function is good, the lists will be small because there will be few collisions

Separate Chaining Example

90 next NULL12 next 38 next 25 next

0

A

12

3456

7

89

101112

36 next NULL10 next

41 next NULL28 next 54 next

18 next NULL

Open Addressing

This is a method where only one item is always stored in one bucket. If multiple elements map to same bucket, some method must be used to find an empty bucket• Linear probing

h’(k) = ( h(k) + j ) mod N where j = 0, 1, 2, 3, . . .

»Keep adding 1 to rank to find empty bucket

• Quadratic probing

h’(k) = ( h(k) + j² ) mod N where j = 0, 1, 2, 3, . . .

• Double hashing

h’(k) = ( h(k) + j * h’’(k) ) mod N where j = 0, 1, 2, 3, . . .

where h’’(k) = q – (k mod q )

Linear Probing

If a bucket is already occupied, then try the next available bucket

EMPTY

55301Turing, A.

81202Babbage, C. EMPTY

77404Gates,

W.

0 1 2 3 4

A

38104McNealy,

S.

!

Linear Probing

If a bucket is already occupied, then try the next available bucket

EMPTY

55301Turing, A.

81202Babbage, C. EMPTY

77404Gates,

W.

0 1 2 3 4

A

38104McNealy,

S.

!

38104McNealy,

S.

55301Turing, A.

81202Babbage, C. EMPTY

77404Gates,

W.

0 1 2 3 4

A …

Linear Probing – insertItem(k,e)

If a location is already occupied, then try the next available location

Example:– h(k) = ( (k % cap) + j ) mod cap where j = 0, 1, 2, 3, . . .– Insert the following keys into hash table A

{13,26,5,37,16,21,15}

0 1 2 3 4

A

5 6 7 8 9 10

13 26 5 37 16 2115

Linear Probing – Using Lazy Deletes

Problem: – If the find() operation is looking for a key, it stops looking when it gets

to an empty location and assumes the key isn’t there– If multiple items with the same key are stored in the hash table with

linear probing and then one of them is deleted, a “hole” is created, and find() might stop prematurely

0 1 2 3 4

A

5 6 7 8 9 10

13 26 37 16 2115

• Solution: Implement removeElement so that it never deletes an item, it just marks the location “FREE”

FREE EMPTYEMPTYEMPTYEMPTY

Quadratic Probing

Quadratic Probing is another open addressing strategy to deal with collisions. It uses the following formula:

h(k) = ( (k % cap) + j² ) mod cap where j = 0, 1, 2, 3, . . .

Example: {13,26,5,37,16,21,15}((37 % 11) + 02) % 11 = 4 //collision((37 % 11) + 12) % 11 = 5 //collision((37 % 11) + 22) % 11 = 8 //OK

0 1 2 3 4

A

5 6 7 8 9 10

13 26 5 3716 2115

Quadratic Probing Pros and Cons

Advantages– Avoids clustering

Disadvantages– Creates secondary clustering – a different pattern of

filled array locations

– If the load factor is 0.5 or more, an empty location may not be found even if one exists

Double Hashing

Double hashing is another alternative to linear probing where, if there’s a collision, then a second, different hash function h' is usedh’(k) = ( h(k) + j * h’’(k) ) mod N where j = 0, 1, 2, 3, . . .

and where h’’(k) = q – (k mod q )h(k) = ( (key % cap) + (j * ( q – ( key % q ) ) ) ) % cap

where j = 0, 1, 2, 3, . .

Example: {13,26,5,37}Let q = 7

h(k) = ( (37 % 11) + (j * ( 7 – ( 37 % 7 ) ) ) ) % 11h(k) = h(37) + 0*(…) = 37 % 11 = 4 //collisionh(k) = (4 + (7 – (37 % 7)) % 11 = 9 //OK

0 1 2 3 4

A

5 6 7 8 9 10

13 26 5 37

Load Factor

The load factor of a hashing table is the ratio of the number of items in the hash table to the number of buckets and is expressed by ( lambda )

– Expresses how “full” the hash table has become– Should always be kept below 0.75– Example

capacity = 11

items stored = 7

load factor = 7/11 = 0.64

Rehashing

Maximum load factor, based on experimental data:– 0.5 for open addressing schemes– 0.9 for separate chaining

If the load factor is above that threshold, then the table should be resized

– New table should be at least double the old table so that the time cost can be amortized

– Hash function should be modified– Rehash the data – take each item out of the old array and

insert it into the new one using the new hash function