Introduction to Java - TIHE - HashTables.pdf · Hash Tables Hash tables use an array behind the...

Hash Tables

Tonga Institute of Higher Education

Introduction Hash tables are another data structure that can hold

data.

Advantages Hash tables are very good at insertion and searching No matter how much data you have, insertions, searches and

sometimes deletions are close to O(1) time Disadvantages

When hash tables become too full, performance degrades very quickly

Hash tables are based on arrays, and arrays are difficult to expand

You cannot move from data item to data item in any kind of order

Therefore, you must make sure you have an accurate idea of how much data you will store.

Also, you must not need to visit the data in any order.

Arrays are Useful

Arrays are useful in certain situations If you have a system to keep track of your employees,

you can use an array Each employee record occupies one cell of the array The array number could be the Employee ID number So looking up employee data is easy if you know the

Employee ID number

Array Shortcomings However, when arrays get very large, they take a long

time to search through them. Unordered arrays take a long time to search for items

Search: O(N) time Ordered arrays take a long time when new data items are added

Search: O(log N) time Insert: O(N) time

Let’s say we are asked to make a English dictionary and put it on the web Stores 100,000 English words Each word needs to be quickly accessible Sometimes, new words are added

A hash table is a good choice for a dictionary Search: O(1) time Insert: O(1) time

Hash Tables

Hash tables use an array behind the scenes The index of each cell is calculated using a formula Hashing – Converting a value from one set to another Hash Value or Hash - A number generated from another value like a

string of text The hash is substantially smaller than the text itself, and is

generated by a formula in such a way that it is extremely unlikely that some other text will produce the same hash value.

Demonstration

Hash Applet

Hashing – Addition Formula Simple formula where we add digits

A = 1, B = 2, C = 3, …, S = 19, T = 20, …, Z = 26 CATS = 3 + 1 + 20 + 19 = 43

So the index of CATS would be 43 But this is not a good choice

If we restrict ourselves to 10 letter words, the last word would potentially be:

zzzzzzzzzz = 26 + 26 + 26 + 26 + 26 + 26 + 26 + 26 + 26 + 26 = 260

So the range of indexes would be from 1 to 260. (a to zzzzzzzzzz)

But we know that there are more than 260 words This is because many words add up to 43: bails, was, tin,

tick, give, tend, moan,

Hashing – Multiplication by Powers Formula - 1 With normal numbers

Each digit can be from 0 to 9. (10 different values) Each digit position represents a value 10 times as big as the digit

position to the right 7654 7 * 1000 + 6 * 100 + 5 * 10 + 5 * 1 7 * 103 + 6 * 102 + 5 * 101 + 5 * 100

7654 This guarantees that every possible number has a unique

numerical value

Hashing – Multiplication by Powers Formula - 2 With letters

We can apply the same idea to guarantee that each letter sequence has a unique numerical value

Each character can be from a to z. (26 different values) Each character position represents a value 26 times as big as the

character position to the right A = 1, B = 2, C = 3, …, S = 19, T = 20, …, Z = 26 CATS (3 * 263) + (1 * 262) + (20 * 261) + (19 * 260) (3 * 17576) + (1 * 676) + (20 * 26) + (19 * 1) 53943

This guarantees that every possible letter combination has a unique numerical value

Hashing – Multiplication by Powers Formula - 3 But even this is not a good choice

If we restrict ourselves to 10 letter words, the last word would potentially be:

zzzzzzzzzz 26 * 269 + 26 * 268 + 26 * 267 + 26 * 266 + 26 * 265 + 26 * 264 + 26 * 263

+ 26 * 262 + 26 * 261 + 26 * 260

This number is very big: 269 alone is 5.4295E+12! This value is too big for an array to store in memory! This is because every single letter combination computes into a unique

index. Not every letter combination is a word! (Example: afwe, oijaw, awioa)

Hashing – Modulo Operator The Multiplication by Powers formula

Gives us a unique number for every letter combination up to 10 letters long

Has too many values We need a way to compress the huge range of

numbers into a range that that is smallerOur English dictionary will have 100,000 values

We can use the Modulo operator (%) to accomplish this

Modulo Operator The Modulo operator gives us the remainder

when one number is divided by another Example 1

13 % 10 = 3 13 divided by 10 results in a remainder of 3

Example 2 26 % 5 = 1 26 divided by 5 results in a remainder of 1

So what is the remainder for these? 55 % 6 73 % 73 13 % 8

Hashing with the Modulo Operator Using the Modulo Operator, we

can make every value in a large range of values map to a value in a small range of values

In the huge range, each number represents a potential word, but few of the numbers represent real words

In the small range, we can make it so half of cells are fullSizeOfSmallArray =

numberOfPlannedDataItems * 2 Then, we use a hash function to

map a value from the huge range to the small rangeIndexInSmallArray = KeyInLargeArray

% SizeOfSmallArrayThis formula is only true for open addressing

Collisions We pay a price for squeezing a large range into a small range

Sometimes, two values from the large range will equal the same value in the small range

We hope that not too many words will hash to the same index

Collision - When we have 2 large range values that hash to the same small range value

Both words occupythe same location

Handling Collisions

There are 2 main ways to handle collisions1. Open Addressing – When a data item can’t be

placed at a particular index, another location in the array is used Linear probing Quadratic probing Double hashing

2. Separate Chaining – When more than 1 data item needs to be placed at a particular index, linked lists are used

Open Addressing - Linear Probing In linear probing, when we try to insert

and have a collision, we search sequentially for an empty cell

Example: If 53 is occupied we try 54 then 55 and so on until we find an empty cell

The index is incremented until we find an empty cell

At the end of the list, loop around and continue at the beginning of the list

This is called linear probing because it steps sequentially along the line of cells

To simplifyour exampleswe will usenumber keys

Demonstration

Hash Applet: Insert

Code View

Hash Insert

Open Addressing - Linear Probing Searching When searching for a

data item we follow these steps Use a hash function on the

key to get an index for the small range

Check the item located at the index to see if it has the same key

Keep looking until we find an item with the same key or we find an empty cell

The original key is 472Using a hash function results inan index of 52

The original key is 135Using a hash function results inan index of 53

Demonstration

Hash Applet: Searching

Code View

Hash Search

Open Addressing - Linear Probing Deleting When we delete an item, we

can’t clear the cell This is because the find routine

quits when it finds an empty cell Therefore, we mark the cell as

being deleted with a -1 The insert code should then be

able to insert items in an empty cell or a cell with a deleted value

If we cleared413, howwould 532and 472 befound?

Demonstration

Hash Applet: Delete

Code View

Hash Delete

Primary Clustering When using linear probing, filled cells are

not evenly distributed in our array Sometimes, there’s a sequence of empty

cells Sometimes, there’s a sequence of filled

cells Cluster - A sequence of filled cells Clustering can result in very long probe

lengths. Therefore, getting to cells at the end of a sequence is slow

The bigger the cluster, the faster it will grow

Linear probing is not used very often because it suffers from too much primary clustering

Avoiding Primary Clustering If a hash table has many large clusters, the array may be

too small Increasing the size of the array will help prevent further

clustering This will require

The creation of a new and larger hash table The copying of values from the old hash table to the new hash

table Do not copy the values from the old hash table to cells that are next

to each other. This will create 1 huge cluster. Instead, use the insert() method for the new hash table This processing is called rehashing

Open Addressing - Quadratic Probing Quadratic probing eliminates primary

clustering In linear probing, when we try to insert and

have a collision, we search sequentially for an empty cell 1st index = x 2nd index = x + 1 3rd index = x + 2 4th index = x + 3

In quadratic probing, when we try to insert and have a collision, we search for an empty cell using this formula 1st index = x 2nd index = x + 12 = x + 1 3rd index = x + 22 = x + 4 4th index = x + 32 = x + 9 5th index = x + 42 = x + 16


The index is increased until we find an empty cell

This is called quadratic probing because it steps sequentially along the line of cells using squares of values

Secondary Clustering Quadratic probing eliminates primary clustering However, it’s performance can still suffer if many items

use the same key For example, if 184, 352, 973, 1352 and 1705 all hash to

the same index, a probe for 1705 takes a long time This phenomenon is called secondary clustering Secondary clustering is not a serious problem

Quadratic hash tables are not used very often because it can suffer from secondary clustering

Open Addressing - Double Hashing Double Hashing is better than Quadratic Probing

Double hashing eliminates secondary clustering Each step is different

The double hashing formula can be calculated faster than the quadratic probing formula

The number of steps taken depends on the key instead of the same sequence being used over and over again (1, 2, 4, 9, 25…)

This is done by hashing the key a second time, using a different hash function, and using the result as a step size

The secondary hash function must follow these rules It must not be the same as the primary hash function It must never output a 0 because otherwise there

would never be a step and the algorithm would be in an never-ending loop

Experts have found that the following formula works well

stepSize = constant – (KeyInBigArray % constant)


If the constant is 5, the stepsizes will range from 1 to 5!

Demonstration

HashDouble Applet

Code View

HashDouble

Open Addressing Hash Table Size

Double hashing requires that the table size be a prime number

A prime number is a number that cannot be evenly divided by another number 2, 3, 5, 7, 11, 13, 17, 19, 23, etc.

A prime number is required to avoid a situation like this: An array size is 15 (indices from 0 to 14) A key hashes to 0 with a step size of 5 This results in a never-ending step sequence: 0, 5, 10, 0, 5,

10… The program would crash

Using a prime number make it impossible for any number to divide evenly, so every remaining cell will be checked

Handling Collisions

There are 2 main ways to handle collisions1. Open Addressing – When a data item can’t be

placed at a particular index, another location in the array is used Linear probing Quadratic probing Double hashing

2. Separate Chaining – When more than 1 data item needs to be placed at a particular index, linked lists are used

Separate Chaining Separate chaining –

When more than 1 data item needs to be placed at a particular index, linked lists are used

The idea of separate chaining is easier to understand for many people

However, it requires more code to implement the linked lists

Demonstration

HashChain Applet

Code View

HashChain

Separate Chaining Hash Table Size

We know:smallArrayIndex = largeKey % smallArraySize

Later, we will cover why the remainder modulo value must be a prime number

Therefore, the size of the small array must also be a prime number

Modulo Value

Load Factors Load Factor – The ratio of the number of items in a table

to the table sizeloadFactor = numberOfItems / arraySize

In open addressing hash tables performance degrades badly when a load factor is above .5

In separate chaining hash tables, it is ok for load factors to be higher than 1 Finding the initial cell takes O(1) time and searching through the

list requires time proportional to the length of the list which is O(N)

Thus, separate chaining hash tables are preferred over open addressing hash tables. Especially when you don’t know in advance how much data will be in the hash table

Hash Functions

What makes a good hash function? It must be quick to compute

Addition is faster than multiplications, divisions and exponents

A hash table with many multiplications, divisions and exponents is bad

It must also produce values that are evenly distributed across the possible range of values Random distributions are even over the long run

Random and Non-Random Keys Random Keys

If our keys are random, our initial formula works wellsmallArrayIndex = largeKey % smallArraySize

Non-Random Keys Often, we do not use random keys For example, some companies may have an id like this

033-400-03-94-05-5-535 Digits 0-2: Supplier number (1 to 999) (Currently up to 70) Digits 3-5: Category code (1 to 999) (100, 150, 200, 250 & up to 850) Digits 6-7: Month of introduction (1 to 12) Digits 8-9: Year of introduction (00 to 99) Digits 10-11: Serial Number (1 to 99) Digit 12: Toxic risk flag (0 or 1) Digit 13-15: Checksum (Sum of other fields, modulo 100)

In this case, many numbers may not be used How can we ensure that the hash function results will be truly random?

Non Random Keys Don’t use non-data

Key fields should be reduced until every bit counts. For example, the category code should run from 0 to 15 Also, the checksum is redundant so remove it

Use all the data Every part of the key that has real data should contribute to the

key used in the hash function Always use a prime number for the modulo base

If keys share a divisor with the array size, they may hash to the same location, causing clustering

A prime number eliminates the possibility of this occurring

Folding - 1 Another good hash function involves folding This means you divide the key into groups of digits and add the groups together. This ensures that all the digits influence the hash value For example, each US citizen is identified by a Social Security number

975-27-8237 123-45-6789

First, pick the size you want your array to be Array size = 1000

smallArrayIndex = largeKey % smallArraySize Therefore, use 1000 as the value used by the modulo Therefore, the largeKey must be big enough to give a big range of values when the modulo

of 1000 is used on it When folding, we break the number up like this

12 + 34 + 56 + 78 + 9 = 189 But this is not good because using a modulo of 1000 with this number will give us a range of 1 – 189

123 + 456 + 789 = 1368 This is better because using a modulo of 1000 with this number will give us a range of 1 - 999

Then, we get the remainder of a modulo operation to get our small array index 1368 % 1000 = 368

The size of the array changes the digit group size Also, in real life, the smallArraySize would be a prime number. 1000 is used to

make the example clear

Folding - 2 If we want our array size to be 100 When folding, we break the number up

like this 12 + 34 + 56 + 78 + 9 = 189

This is ok because using a modulo of 100 with this number will give us a range of 1 – 99

The size of the array changes the digit group size

Hashing Efficiency If no collisions occur, insertion and searching in

hash tables are O(1) time This only involves a call to the hash function and a

single array reference If a collision occurs, access times are

The time described above O(1) + probe length Probe length – How many times we need to search

for a data item after the collision occurs As the load factor increases, the probe length

increases

Open Addressing Linear Probing Performance

The loss of efficiency with high load factors is more serious for open addressing than separate chaining

Unsuccessful searches generally take longer During a successful probe

sequence, the algorithm can stop as soon as it finds the desired item, which is, on average, halfway through the probe sequence

During an unsuccessful probe sequence, the algorithm must search the entire sequence before it’s sure the item is not found

Open Addressing Quadratic Probing and Double Hashing Performance

Quadratic probing and double hashing performance is the same

The performance is a better than linear probing

Higher load factors can be tolerated for quadratic probing and double hashing than linear probing

Separate Chaining Performance

A load factor of 1.0 is fairly common

Smaller load factors do not improve performance significantly

Speed for all operations increases linearly with load factor

Open Addressing vs. Separate Chaining Generally, if you use open addressing, use double hashing as it is

better than linear probing and quadratic probing If you don’t know how many items will be inserted into a hash table,

then use separate chaining. Increasing the load factor causes major performance problems with

open addressing Increasing the load factor degrades performance linearly with separate

chaining When in doubt, use separate chaining

It is more work at first But the reward is that adding more data won’t degrade performance too

badly

Using HashTables in Java 1 Each string has a hashCode method.

Some hash codesare the same!

Using HashTables in Java 2

The hashcode for a String object is computed as:

s[0]*31^(n-1)+s[1]*31^(n-2)+...+s[n-1]

Where s[i] is the ith character of a string of length n

The hash value of an empty string is defined as zero

Demonstration

hashCode Method

Using HashTables in Java 3 A Hashtable object exists in Java Use the hashCode method to get the index to insert

strings in a Hashtable object

It uses separate chaining

Put string intohashtable

Get string fromhashtable

Demonstration

Using a Hashtable object

Introduction to Java - TIHE - HashTables.pdf · Hash Tables Hash tables use an array behind the...

Documents

Transcript of Introduction to Java - TIHE - HashTables.pdf · Hash Tables Hash tables use an array behind the...