CS503: Thirteenth Lecture, Fall 2008 Hash Tables Michael Barnathan.

CS503: Thirteenth Lecture, Fall 2008Hash Tables

Michael Barnathan

Here’s what we’ll be learning:• Theory:

– Keys and values.– What constitutes a good hash function?

• Data Structures:– Hash Tables.

• Collision Resolution:– Chaining– Open Addressing / Linear Probing

• Perfect Hashing• Cuckoo Hashing

Review: Arrays and Random Access• Let’s review arrays for a moment:• A size n array is indexed by a contiguous set of integers from 0

to n-1.• Because the array is contiguous in memory, accessing any

element of it can be performed in constant-time. This is random access.

• If the index actually represents something about the dataset, we can use this to access desired elements in constant-time.

• For example, asking “who is the 4th person up to bat?” in a baseball roster.– Answer: roster[3] (remember, they start at 0).– This is an O(1) operation.

Bill Gates Andrew Jackson Henry Purcell Reggie White(Worst team ever.)

I’m fourth!

Keys and Values• An index is an example of a numeric key into the array.• A key is an attribute or combination of attributes by which

each record is identified.– Arr[3] identifies as the fourth element in the array. In this case, the

key is simply an element’s position in the array.• But we can also identify arrays by attributes such as employee

names and salaries.– These don’t map too well to array indices.

• The value of an element is the data accessed by the key.– For example, if Arr[3] was an Employee, “3” is the key and the

resulting Employee object is the value.• A container that maps directly between keys and values is

called a Map (surprise!) or an associative array.

Arrays’ Shortcomings• Arrays work well if keys are contiguous integers.

– Years in a calendar, for example.• However, what if we have a non-numeric key?

– In every data structure we’ve discussed so far, we have no choice but to search for it, which is an Ω(log n) operation.

Eve

Mallory

Bob

Alice

Trudy

JohnCharlie

John? John?

I’m over here!

Who?

Don’t look at me!

No one ever listens

to me…

Mapping Data

• Idea: What if we could map the word “John” to an array index somehow?– “John” -> 5. Arr[5] = …

• Then finding “John” becomes equivalent to mapping “John” to 5 and accessing Arr[5].

• Arrays are random-access, so this is O(1).• Obvious question: How do we turn “John” into 5?

Why 5 and not 6?• Less obvious question: What if “Bob” also maps to 5?

What happens then?

Maps and Mathematical Functions

• Go waaay back and think about the first time you heard the word “function”.

• It was something that took input and transformed it into output.

234 468f(x) = 2x

Maps and Mathematical Functions

• So if we can do that, why not this?

AliceBobJohn 012h(x)

Black box

Hashing: The Idea

• We call the process of transforming input with a function and using the result as an index hashing.

• This allows us to use strings or other objects as keys.

AliceBobJohn 012h(x)

50,000 25,000 75,000

double[] Salaries

Salaries[“Alice”] = 50000

Salaries[“Bob”] = 25000

Salaries[“John”] = 75000

The Hash Function

• We call h(x) a hash function.• Any function that maps the input type to something

suitable for indexing may be used.– In Java, this means we are mapping from Object to int.– In fact, every Java class has a built in function called:

int hashCode()

– This function is defined in the Object class, which means every object has a default one.

– It also means you can override it in your own objects.

h(x)

Good Hash Functions1. A hash function must be deterministic: it must always

return the same value for the same input.2. Good hash functions distribute their output as uniformly as

possible to minimize the number of “collisions”: two different input values that hash to the same output.

1. If every distinct input value is mapped to a distinct output value, the function is called injective, or one-to-one. This is the ideal.

2. If the space of possible inputs is greater in size than possible outputs, it is also impossible (due to the pigeonhole principle: if you put n+1 objects in n holes, at least one hole must have more than one object in it).

3. Because the hash function is computed on every access of the hash table, good hash functions execute very quickly.

http://en.wikipedia.org/wiki/Injective_function

http://en.wikipedia.org/wiki/Pigeonhole_principle

The Birthday Paradox• If the range of possible inputs is larger than the range of

possible outputs, it is impossible to obtain an ideal hash function due to the pigeonhole principle (know this principle).

• However, even if this is not the case, it is still unlikely that a uniform hash function will avoid collisions.

• This is due to the birthday paradox:– This just refers to the counterintuitive notion that it is highly likely

that two people in a relatively small group share the same birthday.– Assuming a uniform distribution:– In a group of 23 people, the probability that 2 share a birthday is 50%.– In a group of 50 people, the probability is 97%.– The probability does not reach 100% until 365 people are in the room.

• “Having the same birthday” -> “Hashing to the same value”.

http://en.wikipedia.org/wiki/Birthday_paradox

The Birthday Paradox

)!365(365

!365)(

nnp

n

(Wikipedia)

Popular Hash Functions• MD5

– MD4• SHA1

– SHA2– SHA3

• CRC32• 3DES• Tiger• (Aside: Many hash functions are used for cryptography as

well. Should you use them for cryptography, make sure you pad the data with an extra string, called a salt, to avoid “rainbow table” attacks).

Hash Tables• The hash table is the array that the hash function provides an

index into.• Like other arrays, it begins with a fixed capacity and strategies

must be employed to maintain it as the hash table grows.• Because performance degrades as the hash table begins to

fill, the size of a hash table is usually increased when capacity passes a certain load factor.– For example, a table with a load factor of 0.75 would increase in size

when it is 75% full.– 0.75 is the default in Java’s HashTable, HashSet, and HashMap classes.

• Collisions, mappings of distinct objects to the same position in the array, must also be handled.– They become more of a problem as the hash table fills.

Collision Resolution

• What if element B hashes to a location already filled by element A?

• We have a collision.• There are two strategies for handling this scenario:

– Linear Probing.– Chaining.

• Or, to put it in intuitive terms:– This spot’s taken. Store the new element somewhere else.– Cram both elements into the same spot.

Linear Probing• Let element B hash to the location h(B).• Suppose h(B) is already filled by element A.• A linear probing strategy simply stores B in the next available

space.– If h(B) + 1 is available, this is where it is stored.– If not, we move to h(B) + 2 and check whether it is available.– And so on.– If we hit the end of the table, we wrap around to the beginning

(modular arithmetic).• It is also possible to use an arbitrary offset k.

– Then we check h(b) + k, h(b) + 2k, etc.– Again, everything is (mod n), the size of the table, so we wrap.

• The same strategy is used for access:– If the hashed element is not the same as the one we’re looking up,

move down the hash table and check the next element. Repeat until the elements match or an empty space is reached.

Linear Probing Example

Alice

Bob

John

Eve

Trudy

Insert “Mallory”

h(x)

Suppose Mallory hashes to John’s spot.


Alice

Bob

John

Eve

Trudy


h(x)

We check the next spot. It’s filled.


Alice

Bob

John

Eve

Trudy

Mallory


h(x)

When we find an empty spot, it is filled.

Advantages and Disadvantages• Advantages:

– Very space-efficient; values are stored in the hash table itself.– Simple; no extra structures needed.– Works fairly well when load factor is low.

• However, a low load factor wastes space.– Because colliding elements remain adjacent in memory, caching

behavior is exceptional.• Disadvantages:

– Performance swiftly degrades when load factor exceeds 0.8.– Collisions may cluster, and this requires traversing the hash table one

element at a time to find the next available space. This may slow insertion.

Chaining

• Let element B hash to the location h(B).• Suppose h(B) is already filled by element A.• A chaining strategy stores a linked list at each

node and appends the new node to the list.• When we wish to access the element again,

we perform a linear search on the list.

Chaining Example

Alice

Bob

John

Eve

Trudy


h(x)

Suppose Mallory hashes to John’s spot.

Mallory

We then append Mallory to a linked list in that same spot.

Advantages and Disadvantages• Advantages:

– Intuitive; the location we hash at is always the one returned by the hash function.

– New elements can be added to the list in constant-time; linear probing requires a linear scan.

– Performance degrades linearly even as the table fills.– More elements may be stored in the table than there are available

slots using this method.– You can quickly discover the number of keys that collide with another.

• Disadvantages:– Storing the data in adjacent memory locations, as in linear probing,

has very good caching behavior. Linked lists in general do not.

Performance

(Wikipedia)

Perfect Hashing

• If all n keys are known prior to hashing, it is possible to construct a function that maps these keys to a hash table of size n without collisions.

• This function is known as a perfect hash function.• There is a generalized procedure for discovering

perfect hash functions described at http://cmph.sourceforge.net/papers/chm92.pdf.

• But since this is a difficult paper to understand, just be aware that it is possible.

http://cmph.sourceforge.net/papers/chm92.pdf

Cuckoo Hashing• This is a strategy that uses two hashing functions to insert.• If a collision occurs using the first hash function, the existing

element is pushed out of its space (replaced by the new element) and hashed using the second function.

• This can potentially push another element out. If a loop occurs, the hash table is rebuilt using a different set of hash functions.

• However, a collision on both hash functions is unlikely until the table begins to fill.– This begins earlier than in the other two strategies:– Using two hash functions, an appropriate load factor is .5.– However, using three, the appropriate load factor jumps to .91.

• This strategy was generally found superior to both chaining and probing. However, it is still not widely known.– Fortunately for you, I have some very esoteric areas of interest.

Unsorted Associative Containers• Java has excellent built-in support for hashing.• In particular, the unsorted associative containers utilize hash

tables:– HashMap, which you have used:

• Similar functions to TreeMap.• Usually faster for random-access queries.• As you saw in Assignment 3, performing range queries or sequential

access is a pain (you had to sort).– HashSet.– HashTable (which is very much like HashMap).

• Why are they unsorted?– The point of a hash function is to turn keys into integers. In general,

sorted order cannot be maintained through this conversion.

Hashing in Other Languages

• Java: HashMap• C++: hash_map• C#: Hashtable• Perl: $var{‘key’} = “value”• PHP: $var[‘key’] = “value”• Ruby: v = { ‘key’ =>

‘value’ }

Performance

• What is the complexity of insertion in a hash table if there are no collisions?

• What if there are collisions?– If you choose your table size appropriately,

collisions are rather rare. The average size of your chains usually ends up around 2 or 3.

• Do hash tables need to use any extra space?

CRUD: Hash Tables• Insertion (average): O(1).• Access (average): O(1).• Deletion (average): O(1).

• Insertion (worst): O(n).• Access (worst): O(n).• Deletion (worst): O(n).

• Since collisions are not very common with a good hash function and an appropriate load factor, hash tables very often yield constant-time insertion, access, and deletion.

• The amount of space used depends on the load factor, but remains O(n).• They are incredibly useful structures!• They allow you to index data by a generalized key rather than a numeric ID, and

are therefore used extensively in databases and distributed queries. A hash-based algorithm called MapReduce powers Google.

Access on Demand

• This was our discussion of hashing.• Next time, we will discuss amortized analysis

and Java’s “Set” classes.• The lesson:

– An unlikely event actually has a very high probability given enough repetitions (birthday paradox).

CS503: Thirteenth Lecture, Fall 2008 Hash Tables Michael Barnathan.

Documents

Transcript of CS503: Thirteenth Lecture, Fall 2008 Hash Tables Michael Barnathan.