Fall 2007CS 225 Sets and Maps Chapter 7. Fall 2007CS 225 Chapter Objectives To understand the Java...

Fall 2007 CS 225

Sets and Maps

Chapter 7

Fall 2007 CS 225

Chapter Objectives

• To understand the Java Map and Set interfaces and how to use them

• To learn about hash coding and its use to facilitate efficient search and retrieval

• To study two forms of hash tables—open addressing and chaining—and to understand their relative benefits and performance tradeoffs

Fall 2007 CS 225

Chapter Objectives

• To learn how to implement both hash table forms

• To be introduced to the implementation of Maps and Sets

• To see how two earlier applications can be more easily implemented using Map objects for data storage

Fall 2007 CS 225

The Set Abstraction

• A set is a collection that contains no duplicate elements

• Operations on sets include:– Testing for membership– Adding elements– Removing elements– Union– Intersection– Difference– Subset

Fall 2007 CS 225

Sets and the Set Interface

• The part of the Collection hierarchy that relates to sets includes three interfaces, two abstract classes, and two actual classes

Fall 2007 CS 225

Methods of the Set Interface

• Has required methods for testing set membership, testing for an empty set, determining set size, and creating an iterator over the set

• Two optional methods for adding an element and removing an element

• Constructors enforce the no duplicate members criterion

• Add method does not allow duplicate items to be inserted

Fall 2007 CS 225

The Set Interface (Set<E>)

cardinalityint size()

set intersection, optionalboolean retainAll(Collection<E>)

set difference, optionalboolean removeAll(Collection<E>)

optionalboolean remove( Object)

Iterator<E> iterator()

boolean isEmpty()

subset testboolean containsAll(Collection<E>)

set membershipboolean contains( E)

set union, optionalboolean addAll( Collection<E>)

false if o already present, optionalboolean add( E o)

BehaviorMethod

Fall 2007 CS 225

Maps and the Map Interface

• The Map is related to the Set• Mathematically, a Map is a set of ordered

pairs whose elements are known as the key and the value

• Keys are required to be unique, but values are not necessarily unique

• You can think of each key as a “mapping” to a particular value

• A map can be used to enable efficient storage and retrieval of information in a table

Fall 2007 CS 225

The Map Interface (Map<K, V>)

int size()

returns value associated with key or null

V remove( Object key)

returns previous value or null if not present

V put( K key, V value)

boolean isEmpty()

returns object with given key or null if not present

V get( Object key)

BehaviorMethod

Fall 2007 CS 225

Hash Tables• The goal behind the hash table is to be able

to access an entry based on its key value, not its location

• We want to be able to access an element directly through its key value, rather than having to determine its location first by searching for the key value in the array

• Using a hash table enables us to retrieve an item in constant time (on average) with a linear (worst case) behavior

Fall 2007 CS 225

Hash Codes and Index Calculation

• The basis of hashing is to transform the item’s key value into an integer value which will then be transformed into a table index

Fall 2007 CS 225

Example

Getting letter frequencies to generate Huffman codes• For text-only documents, use the ascii code of a

character as its key value• table size of 128 isn't too big

• If you need the unicode value, the table would get too large. (65,536)• Not all characters will occur• Use smaller table and use mod operation to compute

the index from the unicode value

Fall 2007 CS 225

Methods for Generating Hash Codes

• In most applications, the keys will consist of strings of letters or digits rather than a single character

• The number of possible key values is much larger than the table size

• Generating good hash codes is somewhat of an experimental process

• Besides a random distribution of its values, the hash function should be relatively simple and efficient to compute

Fall 2007 CS 225

Hash Codes for Strings• For strings, simply summing the int values of

all characters will return the same hash code for sign and sing

• The Java API algorithm accounts for position of the characters as well

Fall 2007 CS 225

String hashCode Method• The String.hashCode() returns the integer

calculated by the formula: s0 x 31(n-1) + s1

x 31(n-2) + … + sn-1

where si is the ith character of the string, and n is the length of the string

• “Cat” will have a hash code of: ‘C’ x 312 + ‘a’ x 31 + ‘t’

• 31 is a prime number that generates relatively few collisons

Fall 2007 CS 225

Handling Collisions

• A collision happens when two distinct entries in the hash table have the same hash value

• Consider two ways to handle collisions– Open addressing

• each element of the hash table has a single entry

– Chaining• each element of the hash table points to a list (possibly

empty) of entries

Fall 2007 CS 225

Open Addressing

• Linear probing can be used to access an item in a hash table– If that element contains an item with a different

key, increment the index by one– Keep incrementing until you find the key or a null

entry

Fall 2007 CS 225

Algorithm for Open Addressing• Compute index using hashCode % table.length• if table[index] is null• add item at position index• else if table[index] equals item being added• item is already in the table• else• while table[index] not null• increment index• add item at new value of index

Fall 2007 CS 225

Search Termination

• As you increment the table index, your table should wrap around as in a circular array– This leads to the possibility of an infinite loop

• How do you know when to stop searching if the table is full and you have not found the correct value?– Stop when the index value for the next probe is

the same as the hash code value for the object– Ensure that the table is never full by increasing its

size after an insertion if its occupancy rate exceeds a specified threshold

Fall 2007 CS 225

NamehashCode()

hashCode%5

hashCode()%11

"Tom"

84274 4 3

"Dick"

2129869 4 5

"Harry"

69496488 3 10

"Sam"

82789 4 5

"Pete" 2484038 3 7

Fall 2007 CS 225

Traversing a hash table• You can visit all elements in a hash table by

getting each element of the array in turn• The sequence of values that results is

arbitrary• For previous example

• size = 5

"Dick", "Sam", "Pete", "Harry", "Tom"• size = 11

"Tom", "Dick", "Sam", "Pete", "Harry"

Fall 2007 CS 225

Deleting from a Hash Table• When an item is deleted, you cannot just set

its table entry to null– Any element with the same hash code that was

inserted into the table later will not be found– Store a dummy value instead– Deleted items waste storage space and reduce

search efficiency

Fall 2007 CS 225

Hash Table Considerations• Using a prime number for the size of the table tends

to reduce collisions

• Load factor is ratio of filled elements to total elements load factor = actual elements / size

• A larger load factor will result in increased collisions• Don't allow the table get too full

• Specify a maximum load factor

• If table gets too full, rehash• Allocate a new table with twice the capacits

• Get each element from the old table, compute its index in new table and insert

• Don't insert deleted items

Fall 2007 CS 225

Clustering Example

Consider a hash table of size 11

Add elements with hash codes 5, 6, 5, 6, 7 in that order

5(1)6(1)5(2)6(2)7(1)

Fall 2007 CS 225

Reducing Collisions Using Quadratic Probing

• Linear probing tends to form clusters of keys in the table, causing longer search chains

• Quadratic probing can reduce the effect of clustering– Increments form a quadratic series– Disadvantage is that the next index calculation is

time consuming as it involves multiplication, addition, and modulo division

– Not all table elements are examined when looking for an insertion index

Fall 2007 CS 225

Clustering Example

Consider a hash table of size 11

Add elements with hash codes 5, 6, 5, 6, 7 in that order using quadratic probing

There is less overlap in the search chains• 5 -> 6 -> 9 -> 14• 6 -> 7 -> 10 -> 15

5(1)

6(1)

6(2)

7(1)

5(2)

Fall 2007 CS 225

Hashing with Chaining

• Chaining is an alternative to open addressing• Each table element references a linked list

that contains all of the items that hash to the same table index– The linked list is often called a bucket– The approach sometimes called bucket hashing

• Only items that have the same value for their hash codes will be examined when looking for an object

Fall 2007 CS 225

Chaining Illustrated

Fall 2007 CS 225

Performance of Hash Tables

• Load factor is the number of filled cells divided by the table size

• Load factor has the greatest effect on hash table performance

• The lower the load factor, the better the performance as there is a lesser chance of collision when a table is sparsely populated

Fall 2007 CS 225

Performance of Hash Tables

Fall 2007 CS 225

Implementing a Hash Table• Use interface KWHashMap and implement

the interface using both open hashing and chaining.

Method Behavior

V get( K key) returns value associated with key or null

boolean isEmpty()

V put( K key, V value) Inserts or replaces object associated with key

V remove( K key) Removes mapping for key; returns value or null

int size()

Fall 2007 CS 225

Entry class

• Create an inner class to store key-value pairs• Data fieldsprivate Object keyprivate Object value• Constructorpublic Entry( Object key, Object value)• Methodspublic Object getKey()public Object getValue();public Object setValue();

Fall 2007 CS 225

Open Hash Table

• Data Fieldsprivate Entry table[]private static final int START_CAPACITYprivate double LOAD_THRESHOLDprivate int numKeysprivate int numDeletesprivate static final Entry DELETED• Private Methodsprivate int find( Object key)private void rehash()

Fall 2007 CS 225

find Algorithm for Open Hash Table

1. set index to key.hashCode() % table.length2. if index < 0 add table.length3. while table[index] not empty and != key4. increment index5. if index >= table.length6. index = 07. return index

Fall 2007 CS 225

get Algorithm for Open Hash Table

1. find table element for key2. if table element contains key3. return value at this element4. else5. return null

Fall 2007 CS 225

put Algorithm for Open Hash Table

1. find table element for key2. if table element is empty3. insert new item4. increment numKeys5. rehash if needed6. return null7. else8. save old value of element9. replace value with new value10. return old value

Fall 2007 CS 225

Removing from Open Hash Table

1. find table element for key2. if element is empty3. return null4. else5. save value of element6. replace element with DELETED7. increment numDeletes8. decrement numKeys9. return saved value

Fall 2007 CS 225

Rehashing

1. Alocate a new table with twice as many elements

2. set numKeys to 03. set numDeletes to 04. add each undeleted element of original table

to new table

Fall 2007 CS 225

Chained Hash Table

• Data Fieldsprivate LinkedList table[]private static final int CAPACITYprivate static final int LOAD_THRESHOLD

• Rehash when numKeys=3*CAPACITY

Fall 2007 CS 225

get Algorithm for Chained Hash Table

1. set index to key.hashCode % table.length2. if index < 03. add table.length4. if table[index] is empty5. return null6. for each element in list at table[index]7. if element key = search key8. return element's value9. return null (key not found)

Fall 2007 CS 225

put Algorithm for Chained Hash Table

1. set index to key.hashCode % table.length

2. if index < 0 add table.length

3. if table[index] is empty

4. create a new linked list at table[index]

5. Search list at table[index] for key

6. if successful

7. replace value and return old value

8. else

9. insert new key-value pair into list

10. increment numKeys

11. rehash if needed

12. return null

Fall 2007 CS 225

Removing from Chained Hash Table

1. set index to key.hashCode % table.length

2. if index < 0 add table.length

3. if table[index] is empty

4. return null

5. Search list at table[index] for key

6. if successful

7. remove value and decrement numKeys

8. if list at table[index] is empty

9. table[index] -> null

10. return value associated with key

11. return null

Fall 2007 CS 225

Implementation Considerations for Maps and Sets

• Class Object implements methods hashCode and equals, so every class can access these methods unless it overrides them– Object.equals compares two objects based on

their addresses, not their contents– Object.hashCode calculates an object’s hash code

based on its address, not its contents

• Java recommends that if you override the equals method, then you should also override the hashCode method

Fall 2007 CS 225

Implementing HashSetOpen

• HashSetOpen is similar to a hash table• Methods have slightly different signatures

• HashSetOpen is an adapter class with an internal HashTableOpen

Map Method Set Method

Object get( Object key) boolean contains( Object key)

Object put(Object key, Object value) boolean add( Object key)

Object remove( Object key) boolean remove( Object key)

Fall 2007 CS 225

Implementing the Java Map and Set Interfaces

• The Java API uses a hash table to implement both the Map and Set interfaces

• The task of implementing the two interfaces is simplified by the inclusion of abstract classes AbstractMap and AbstractSet in the Collection hierarchy

Fall 2007 CS 225

Nested Interface Map.Entry

• One requirement on the key-value pairs for a Map object is that they implement the interface Map.Entry<K, V>, which is an inner interface of interface Map– An implementer of the Map interface must contain

an inner class that provides code for the methods

Object getKey()Object getValue()Object setValue( Object value)

Fall 2007 CS 225

Additional Applications of Maps

• Can implement the phone directory using a map

PhoneDirectory Interface Map Interface

addOrChangeEntry putlookUpEntry getremoveEntry removeloadData nonesave none

Fall 2007 CS 225

Additional Applications of Maps

• Huffman Coding Problem– Use a map for creating an array of elements and

replacing each input character by its bit string code in the output file

– Frequency table• The key will be the input character• The value is the character code string

Fall 2007 CS 225

Chapter Review

• The Set interface describes an abstract data type that supports the same operations as a mathematical set

• The Map interface describes an abstract data type that enables a user to access information corresponding to a specified key

• A hash table uses hashing to transform an item’s key into a table index so that insertions, retrievals, and deletions can be performed in expected O(1) time

Fall 2007 CS 225

Chapter Review

• A collision occurs when two keys map to the same table index

• In open addressing, linear probing is often used to resolve collisions

• The best way to avoid collisions is to keep the table load factor relatively low by rehashing when the load factor reaches a value such as 0.75

Fall 2007 CS 225

Chapter Review

• In open addressing, you can’t remove an element from the table when you delete it, but you must mark it as deleted

• A set view of a hash table can be obtained through method entrySet

• Two Java API implementations of the Map (Set) interface are HashMap (HashSet) and TreeMap (TreeSet)

Fall 2007CS 225 Sets and Maps Chapter 7. Fall 2007CS 225 Chapter Objectives To understand the Java...

Documents

Transcript of Fall 2007CS 225 Sets and Maps Chapter 7. Fall 2007CS 225 Chapter Objectives To understand the Java...