Copyright © 2002-2010 Curt Hill
HashingKey Transformation
Copyright © 2002-2010 Curt Hill
What is a hash?• Hashing is another name for key
transformation• The original key is usually a
character string or other sparse key
• The result is usually a dense integer key
Example• Suppose we have a three digit
integer key– Not every key is used
• Fewer (by definition) than 1000 items
• What would be a good structure for storing and searching this item?
• Clearly an array– Dimension should be 0..999– Mark empty slots in some way
Copyright © 2002-2010 Curt Hill
Complication• Now suppose we still have fewer
than a 1000 keys, but key is a name
• Suppose key is 10 character name• Then there are 2610 = 1.4 x 1014
(14 trillion) possibilities• Little bit large for memory• Sparse coverage
– Less than 1000 are used 1:1.4 x 1011
Copyright © 2002-2010 Curt Hill
What are the alternatives?• Must we resort to tree or linked list?
• That is a dynamic data structure• Or perhaps an array that is sorted on
key name
• Or may we somehow transform that name into an integer key?
• Key transformation aka hashing aka scatter-storage does just that
• That is into an integer in the range 0..999
Copyright © 2002-2010 Curt Hill
Hashing Components• An array (or vector) that holds the
data• A hash function that transforms
the key into an integer in the correct value
• A set of functions that adds, removes, searches the array using the hash function
• A collision strategy
Copyright © 2002-2010 Curt Hill
One Example Function• If numeric key such as product
number, use just the bottom 3 digits
• If the bottom portion has a digit that does not span all the possibilities there may be a problem– Such as 0 meaning original, 1 first
replacement – However any three digits may be used
Copyright © 2002-2010 Curt Hill
Another Example Function• Multiply the ordinal values of the
key• Divided by 1000 keeping the
remainder• Should work on any character key• See next screen for code
Copyright © 2002-2010 Curt Hill
Example• Assume the followingchar key[10];
• Use this code int index = key[0]*key[1]*…*key[9];key = key % 1000;
Copyright © 2002-2010 Curt Hill
Copyright © 2002-2010 Curt Hill
Another hash function
• Input is a character string• Output is an integer in range 0 - N• Sum the ordinal value of each character
of the string• Divide the result by N and take the
remainder– This guarantees the right range
• Number theory tells us to make N prime
Copyright © 2002-2010 Curt Hill
Example values with N = 256
• “Abcdef” returns 53• “Hi there” returns 233• “ABCDEF” returns 149• “FEDCBA” returns 149• “A character string” returns 197 • “A big character string” returns 23• “Zoology” returns 243• See the pattern yet?
Alternatively• Maul the string into integers • Operate on them• Divide and keep remainderunion { char key[12]; int ints[3]; } u;…strcpy(u.key, key);ndx = u.ints[0]* u.int[1] / u.ints[2] % 1000;
Copyright © 2002-2010 Curt Hill
Commentary• With strings, beware of things past
the null character• For reasons observable from
number theory the best strategy is to mod by a prime number
• Empty space is usually left in the table to ease the computation of the hash function
Copyright © 2002-2010 Curt Hill
Variations• Adding or multiplying the ordinal
value of characters will make transpositions give same result– ABC and BAC will map to same integer– This is known as a collision
• One approach is to multiply the ordinal value by its position:int index = 1;for(int i = 0;i<keylen;i++) index *= key[i]*i;index = index % 1000;
Copyright © 2002-2010 Curt Hill
Distributions• The distribution of the keys will
appear to be random compared to original key value
• What we would like is the keys to be uniformly scattered throughout the index space
• Often the hash function is tailored for the data at hand– With much experimentation to get a
good distribution
Copyright © 2002-2010 Curt Hill
Illustration
Copyright © 2002-2010 Curt Hill
Otter (0)
Aaron(2)
Smith (5)
Butler (6)
Lawson (8)
Character key (hash value)
HashFunction
Aaron
2
What problems exist?• Data skew• Collisions: two keys map into one
array index• A collision strategy is how to
handle this• A hash function that does not map
two keys into one index is called a perfect hash function
• Depending on the collision strategy deletions may be problematic
Copyright © 2002-2010 Curt Hill
Copyright © 2002-2010 Curt Hill
Data Skew• A good hash function spreads the
keys uniformly among the integer range
• How well does this work when the data is not uniformly distributed?
• For example consider – Names - There are many more Smiths
than Garnjobsts– Numbers – there are many more 101
courses than most other numbers • Can the hash function still give a
good distribution?
Observations• If the hash function is good then
the larger the table (the more sparse the table) the better the likelihood of avoiding a collision
• We can tell the difference between a used and unused slot
• Since a hash table will necessarily have empty space in it is almost always undesirable to store the whole record there
Copyright © 2002-2010 Curt Hill
What are we storing?• What we do is make it an array of
pointers, where the pointers point to the actual record - thus each array item wastes only pointer size (4 on Win32) number of bytes and all the items are in heap storage
• The initialization then becomes setting all slots to NULL– An unused slot is NULL– Alternative is a boolean in the table
specifying usedCopyright © 2002-2010 Curt Hill
Collision Strategies• Only limited by your creativity• Here are four that are commonly
used:• Linear probing• Quadratic probing• Rehashing• Overflow areas
Copyright © 2002-2010 Curt Hill
Linear probing• Add one (modulo size) to index until
empty slot is found• Tends to cluster (problem with most
strategies)• These become long strings of
adjacent indexes that are filled up• These need to be searched
sequentially until empty cell is found• This sequential search ruins the
inherent quickness of hashing if groups get too long
Copyright © 2002-2010 Curt Hill
Clustering• A long series of filled slots• Clustering is often a sign of poor
hash function or table too close to full
• When the overflow of one key overlaps another the problem becomes compounded
Copyright © 2002-2010 Curt Hill
Linear Probing
Copyright © 2002-2010 Curt Hill
Otter (0)
Aaron(2)
Smith (5)
Butler (6)
Lawson (8)
Character key (hash value)
Matthew(2)
Taylor(3)
Taylor could have gone in slot 3 but Matthew was already there.
Once this exists any key between 2 and 6 will end in slot 7.
Quadratic probing• Instead of adding 1 to the index
add the square– Modulo size of table
• Keeps x and x+1 from mapping to same area
Copyright © 2002-2010 Curt Hill
Rehashing• When a collision occurs with first
hash function use a different one – AKA Secondary hashing
• This doubles the difficulty since we now have to come up with two hash functions which do not conflict with each other
• If the first hash function is good, there will be comparatively few collisions
Copyright © 2002-2010 Curt Hill
Overflow areas• AKA Chaining• When a collision is detected upon
an add remove both from table• Make slot with a special entry that
redirects to another data structure– List, tree or another array or vector
Copyright © 2002-2010 Curt Hill
Chain to Overflow
Copyright © 2002-2010 Curt Hill
Otter (0)
OVER
Smith (5)
Butler (6)
Lawson (8)
Overflow area
Taylor(3)
Aaron(2)
Matthew(2)OVER
Jones (5)
Chain to List
Copyright © 2002-2010 Curt Hill
Otter (0)
OVER
Smith (5)Butler (6)
Lawson (8)
Overflow area
Taylor(3)
Aaron(2)
Matthew(2)
OVER
Jones (5)
More on chaining• Keeps the collisions from further
degrading the hash table• If the hash table is large then we
get good split so the list is short• If duplicate keys are allowed also
good• There is no systematic order of the
items in any hash table– Thus sorting sometimes is needed
after the factCopyright © 2002-2010 Curt Hill
Deletion• There are challenges to deleting in a
hash table– Making a slot empty may prevent
finding an item that actually exists
• Depends on collision strategy• Linear probing
– Rehash everything from the first empty slot before the deletion to the last empty slot after
• Similar things may be needed with others
Copyright © 2002-2010 Curt Hill
Other Considerations• Size of table is fixed• We must have a good prediction of
numbers or waste much space• Otherwise have catastrophic table
overflow• Even if we have a good prediction
it should be larger than required
Copyright © 2002-2010 Curt Hill
Analysis• Disregarding collisions hashing is
clearly O(1)• How likely are collisions?• Birthday collisions example• Analysis from Algorithms + Data
Structures = Programs provided statistics follows
Copyright © 2002-2010 Curt Hill
The Birthday Paradox• Most people think that you need a
big room with many people before two will have the same birthday
• If 23 people are in the same room there is better than 50% chance that two will have the same birthday
• Why is this number so low?• The probability is the sum of
earlier probabilitiesCopyright © 2002-2010 Curt Hill
Consider• If there is one person the probability
is zero• If one additional then 1 in 365• The third person has 2 chances in
365 plus the chances of the first 2 having the same birthday
• The twenty third person 22/365 in addition to the previous probabilities of two of the twenty two having the same
• This is about .52Copyright © 2002-2010 Curt Hill
What do we learn form this?
• The likelihood of a collision is much greater than we intuitively believe
• We will always have collisions unless we go to the large amount of work of finding the perfect hash
• Lets consider some empiracal data
Copyright © 2002-2010 Curt Hill
Using Optimal Rehashing– Load factor = number of keys /
number of slots
Copyright © 2002-2010 Curt Hill
Load Factor Probes0.1 1.050.25 1.150.5 1.390.75 1.85.9 2.560.95 3.150.99 4.66
Using Linear Probing
Copyright © 2002-2010 Curt Hill
Load Factor Probes0.1 1.060.25 1.170.5 1.500.75 2.50.9 5.500.95 10.50
Types of Hash Techniques• Mapping• Folding• Shifting• Pseudo Random Numbers• Casts
Copyright © 2002-2010 Curt Hill
Mapping• Convert items (usually characters
or integers) into values• The character value itself will often
not be acceptable• Preferred method: Create a vector
that is subscripted by the character values– Each character value returns another
value– By modifying these we can change
the hash functionCopyright © 2002-2010 Curt Hill
Folding• Treat pieces of the key as if they
were integers and compute key• Example:
– Suppose an 8 character key that is always there
– An int is 4 characters long– Form a union that has the 8
characters for one value and an array of 2 ints or 4 short ints as the other:
Copyright © 2002-2010 Curt Hill
Folding Example• Consider:union { char c[8]; int i[2]; short s[4]; } smash;
• Move the 8 characters into the c part and then remove the 2 ints or 4 short ints
• Do some computation on the ints or short ints
Copyright © 2002-2010 Curt Hill
Shifting• Using the shift instructions shift
right the values to make the high order bits more prominent
• For instance 'a'..'z' are from 97 .. 122
• Shifting can bring this range down some
Copyright © 2002-2010 Curt Hill
Pseudo Random Numbers• Computer random number
generators not random– Rather they are sequence of numbers
based on algorithm usually based on overflow
• The seed of the sequence determines where it starts
• Use your value as the seed and then call the random number generator
• This mechanism will often generate integers in a particular range as well
Copyright © 2002-2010 Curt Hill
Casts• You may use a cast to turn anything
into a character• Often the same as the folding• Beware of short character strings!
– Suppose you have "hi" in char x[8]– Do not use anything past x[2] in casts
folding– The debris past the null character may
change from time to time– Using it makes your hash function
unreliableCopyright © 2002-2010 Curt Hill
Minimal and Perfect Hashes
• A perfect hash function has no collisions– Life is simpler when there is no need
for a collision strategy
• A hash function that computes indexes where the number of keys is the same as the number of entries is called minimal– No space is wasted in the hash table
Copyright © 2002-2010 Curt Hill
Perfect Hash Requirements
• Know the keys in advance• Key is of a constant size and
makeup• The table must be of size greater
than or equal number of keys• The smaller of ratio of keys to size,
the harder the function is to find– Although not necessarily to compute
Copyright © 2002-2010 Curt Hill
Constructing a Hash• Choose the hash method • Bring down to range• Choose a collision strategy
Copyright © 2002-2010 Curt Hill
Choose the hash method• Recall the methods:
– Mapping, folding, shifting, random numbers, casting
– Any combination or something you make up
• Decision is strongly influenced by the form and type of the key
Copyright © 2002-2010 Curt Hill
Range• Second step - bring down to the right
range• If the computation was a fold and the
values were originally characters then the resulting range is limited mostly by the variable type
• Could be positive or possibly negative• Usually we do integer division and
keep the remainder– The divisor is best if prime
Copyright © 2002-2010 Curt Hill
Range • We often choose the largest prime
number smaller than the table• Shifts
– Table size is a power of two:– res = num >> 20;– Leaves twelve bits which is 0 - 4095
• Bitwise logical operations– And can mask out high order bits– Table size will be a power of two – res = num & 0x000000ff; – Forces range 0 – 255
Copyright © 2002-2010 Curt Hill
Collision Strategy• The simplest and worst is linear
probing– Do not use this if the hash table will
be close to full– May be acceptable if table is less than
75% full
• Chaining is the best when the number of entries is least well known– Also can degenerate to O(N) easily
Copyright © 2002-2010 Curt Hill
In search of the minimal perfect hash
• Holy Grail for searches– What is better than O(1) search?
• Actually we will be happy if we find any good hash
• This task is never really easy• Always requires programmer
intervention• A perfect or minimal perfect hash
requires some luck to find beside requirements mentioned previously
Copyright © 2002-2010 Curt Hill
Searching for Perfect Hash• Generate a hash function that uses
one or more characters from the items to be hashed as well as perhaps the length of the item
• Have it use a lookup table so that we can assign values to them
• Letters not in any of the used characters are usually zeroed
• Using some systematic procedure set the map values
Copyright © 2002-2010 Curt Hill
Searching continued• Attempt to hash all the items
– Keep track of all the character maps
• At the first collision– If there is a map that was only used by
one or both of the collidees alter it to prevent the collision and continue from there
– Otherwise alter one of the maps present in the two items and go back to the attempt the hash
• If there are no collisions that are not resolved the hash is successful
Copyright © 2002-2010 Curt Hill
Case study• I have such a searcher: perfhash• Finds a Perfect hash• Uses several algorithms• Written in C++
Copyright © 2002-2010 Curt Hill
PerfHash algorithms• 0 – Multiply characters and divide by
length• 1 – Add characters times square of
position then multiply length• 2 – Fold using short ints, take product• 3 – Fold using ints, alternatively add
then subtract• 4 – Fold then use as seed for random
number generator
Copyright © 2002-2010 Curt Hill
Results
Copyright © 2002-2010 Curt Hill
Keys
0 1 2 3 4
C 32 151
293
139 337 227
C++ 35 281
307
389 487 577
Pascal 40 251
251
271 151 277
Modula2 48 301
367
317 397 193
Java 51 577
431
337 331 463
Copyright © 2002-2010 Curt Hill
Finally• When hashing works well, it hard
to beat– Stable index– Good distribution from the hash
function
• It is much harder to make dynamic than trees
• Fiddling with the hash function can make it better or worse in unpredicatable ways
Top Related