Lecture 18 - cseweb.ucsd.edu

30
Page 1 of 30 CSE 100, UCSD: LEC 18 Lecture 18 Separate chaining Dictionary data types Hashtables vs. balanced search trees A hashtable implementation: java.util.Hashtable Reading: Weiss, Ch 5

Transcript of Lecture 18 - cseweb.ucsd.edu

Page 1: Lecture 18 - cseweb.ucsd.edu

Lecture 18

• Separate chaining

• Dictionary data types

• Hashtables vs. balanced search trees

• A hashtable implementation: java.util.Hashtable

Reading: Weiss, Ch 5

Page 1 of 30CSE 100, UCSD: LEC 18

Page 2: Lecture 18 - cseweb.ucsd.edu

Open addressing vs. separate chaining

• Linear probing, double and random hashing are appropriate if the keys are kept as entries in the hashtable itself...

doing that is called "open addressing"

it is also called "closed hashing"

• Another idea: Entries in the hashtable are just pointers to the head of a linked list (“chain”); elements of the linked list contain the keys...

this is called "separate chaining"

it is also called "open hashing"

• Collision resolution becomes easy with separate chaining: no need to probe other table locations; just insert a key in its linked list if it is not already there.

• (It is possible to use fancier data structures than linked lists for this; but linked lists work very well in the average case, as we will see)

Page 2 of 30CSE 100, UCSD: LEC 18

Page 3: Lecture 18 - cseweb.ucsd.edu

Separate chaining: basic algorithms

• When inserting a key K in a table with hash function H(K)

1. Set indx = H(K)2. Insert key in linked list headed at indx. (Search the list first to avoid duplicates.)

• When searching for a key K in a table with hash function H(K)

1. Set indx = H(K)2. Search for key in linked list headed at indx, using linear search.

• When deleting a key K in a table with hash function H(K)

1. Set indx = H(K)2. Delete key in linked list headed at indx

• Advantages: average case performance stays good as number of entries approachesand even exceeds M; delete is easier to implement than with open addressing

• Disadvantages: requires dynamic data, requires storage for pointers in addition to data, can have poor locality which causes poor caching performance

Page 3 of 30CSE 100, UCSD: LEC 18

Page 4: Lecture 18 - cseweb.ucsd.edu

Separate chaining, an example

M = 7, H(K) = K mod Minsert these keys 701, 145, 217, 19, 13, 749in this table, using separate chaining:

index: 0 1 2 3 4 5 6

Page 4 of 30CSE 100, UCSD: LEC 18

Page 5: Lecture 18 - cseweb.ucsd.edu

Analysis of separate-chaining hashing

• Keep in mind the load factor measure of how full the table is:

= N/M

where M is the size of the table, and N is the number of keys that have been inserted in the table

• With separate chaining, it is possible to have

• Given a load factor , we would like to know the time costs, in the best, average, and worst case of

new-key insert and unsuccessful find (these are the same)

successful find

• The best case is O(1) and worst case is O(N) for all of these... let’s analyze the average case

Page 5 of 30CSE 100, UCSD: LEC 18

Page 6: Lecture 18 - cseweb.ucsd.edu

Average case costs with separate chaining

• Assume a table with load factor = N/M

• There are N items total distributed over M linked lists (some of which may be empty), so the average number of items per linked list is:

• In any unsuccessful find/insert, the hash table entry for the key is accessed; then the linked list headed there is exhaustively searched

• Therefore, assuming all table entries are equally likely to be hit by the hash function, the average number of steps for insert or unsuccessful find with separate chaining is

• In successful find, the hash table entry for the key is accessed; then the linked list headed there is linearly searched. Therefore, (with the same probabilistic assumption) the average number of steps for successful find with separate chaining is

• These are less than 2 and 1.5 respectively, when < 1

• And these remain O(1), independent of M, even when exceeds 1.

U 1 +=

S 12---+=

Page 6 of 30CSE 100, UCSD: LEC 18

Page 7: Lecture 18 - cseweb.ucsd.edu

Dictionary data types

• A data structure is intended to hold data

An insert operation inserts a data item into the structure; a find operation says whether a data item is in the structure; delete removes a data item; etc.

• A Dictionary is a specialized kind of data structure:

A Dictionary structure is intended to hold pairs: each pair consists of a key, together with some related data

An insert operation inserts a key-data pair in the table; a find operation takes a key and returns the data in the key-data pair with that key; delete takes a key and removes the key-data pair with that key; etc.

• Dictionaries are sometimes called "Table” or “Map” abstract data types, or "associative memories"

Page 7 of 30CSE 100, UCSD: LEC 18

Page 8: Lecture 18 - cseweb.ucsd.edu

Dictionary as ADT

• Domain:

a collection of pairs; each pair consists of a key, and some additional data

• Operations (typical):

Create a table (initially empty)

Insert a new key-data pair in the table; if a key-data pair with the same key is already there, update the data part of the pair

Find the key-data pair in the table corresponding to a given key; return the data

Delete the key-data pair corresponding to a given key

Enumerate (traverse) all key-data pairs in the table

Page 8 of 30CSE 100, UCSD: LEC 18

Page 9: Lecture 18 - cseweb.ucsd.edu

Implementing the Dictionary ADT

• A Dictionary can be implemented in various ways:

using a list, binary search tree, hashtable, etc., etc.

• In each case:

the implementing data structure has to be able to hold key-data pairs

the implementing data structure has to be able to do insert, find, and delete operations paying attention to the key

• This could be done in a generic data structure, where the user can specify the comparison function to be used by the insert, find, and delete functions

Page 9 of 30CSE 100, UCSD: LEC 18

Page 10: Lecture 18 - cseweb.ucsd.edu

The Dictionary ADT and search engine indexes

• The Dictionary ADT is useful in any situation where you want to store, retrieve, and manipulate data based on associated keys

• One important application is a document search engine index

• An index associates words (keys) with information (data) such as what documents a word occurs in, how many times it occurs, what its position is within the document, etc.

When a word is read for the first time, an "insert" operation is done in the index to associate that word with the document in which it occurs (and possibly other information)

When a word is encountered again, "insert" or "update" operation is done to add or modify associations with that word (additional document in which it occurs, increment the number of times it occurs, etc.)

If a document is no longer available, words contained in it have their associations changed, and the "delete" operation may be necessary

By doing a “find” operation in the index using a word as key, a user can find the documents that contain that word

Page 10 of 30CSE 100, UCSD: LEC 18

Page 11: Lecture 18 - cseweb.ucsd.edu

Hashtables vs. balanced search trees

• Hashtables and balanced search trees can both be used in applications that need fast insert and find

• What are advantages and disadvantages of each?

Balanced search trees guarantee worst-case performance O(log N), which is quite good

A well-designed hash table has typical performance O(1), which is excellent; but worst-case is O(N), which is bad

Search trees require that keys be well-ordered: For any keys K1, K2, either K1<K2, K1==K2, or K1> K2

Hashtables only require that keys be testable for equality, and that you can compute a hash function for them

Page 11 of 30CSE 100, UCSD: LEC 18

Page 12: Lecture 18 - cseweb.ucsd.edu

Hashtables vs. balanced search trees, cont’d

A search tree can easily be used to return keys close in value to a given key, or to return the smallest key in the tree, or to output the keys in sorted order

A hashtable does not normally deal with ordering information efficiently

In a balanced search tree, delete is as efficient as insert

In a hashtable that uses open addressing, delete can be inefficient, and somewhat tricky to implement (easy with separate chaining though)

Overall, balanced search trees are rather difficult to implement correctly

Hash tables are relatively easy to implement

Page 12 of 30CSE 100, UCSD: LEC 18

Page 13: Lecture 18 - cseweb.ucsd.edu

A look at Java’s Hashtable

• The java.util.Hashtable class has existed in the Java standard library since JDK1.0

• In JDK 1.2, Hashtable was incorporated into the “Collections Framework”, and declared declared to implement Map

• java.util.Hashtable is similar to java.util.HashMap

They both implement Map, so they have the same public interface, but the implementation is slightly different

One difference is Hashtable has synchronized methods (this makes them slightly slower; if you don’t need synchronization for multitheaded programming, use HashMap)

• In JDK 1.5, Hashtable and Hashmap were made generic, with type parameters for keys and values

Page 13 of 30CSE 100, UCSD: LEC 18

Page 14: Lecture 18 - cseweb.ucsd.edu

Hashtable.java

package java.util;import java.io.*;/** * This class implements a hashtable, which maps keys to values. * Any non-null object can be used as a key or as a value. * <p> * To successfully store and retrieve objects from a hashtable, the * objects used as keys must implement the <code>hashCode</code> * method and the <code>equals</code> method. */public class Hashtable<K,V> extends Dictionary<K,V> implements Map<K,V>, Cloneable, java.io.Serializable {

Page 14 of 30CSE 100, UCSD: LEC 18

Page 15: Lecture 18 - cseweb.ucsd.edu

Dictionary abstract class

• Dictionary is an abstract class, that specifies some abstract methods. It acts like an interface specification, and probably should have been an interface instead of a class. Very similar to the interface java.util.Map. Methods shown here, without comments:

public abstract class Dictionary<K,V> {

abstract public int size();

abstract public boolean isEmpty();

abstract public Enumeration<K> keys();

abstract public Enumeration<V> elements();

abstract public V get(Object key);

abstract public V put(K key, V value);

abstract public V remove(Object key);}

Page 15 of 30CSE 100, UCSD: LEC 18

Page 16: Lecture 18 - cseweb.ucsd.edu

Instance variables

• Here are the instance variables declared in the Hashtable class: /** * The hash table data. */ private transient Entry table[];

/** * The total number of entries in the hash table. */ private transient int count;

/** * Rehashes the table when count exceeds this threshold. */ private int threshold;

• What is the type of elements of the array implementing the hashtable?

Page 16 of 30CSE 100, UCSD: LEC 18

Page 17: Lecture 18 - cseweb.ucsd.edu

Entry

• The Hashtable.java file also defines this inner class:

private static class Entry<K,V> implements Map.Entry<K,V> { int hash; K key; V value; Entry<K,V> next;}

• Entries in a Hashtable object’s table[] array are pointers to objects of this class.

• From these declarations so far, can you tell what collision resolution strategy is used?

Page 17 of 30CSE 100, UCSD: LEC 18

Page 18: Lecture 18 - cseweb.ucsd.edu

Hashtable methods

• We will look at these instance methods in the Hashtable class:

constructors

get()

put()

keySet()

Page 18 of 30CSE 100, UCSD: LEC 18

Page 19: Lecture 18 - cseweb.ucsd.edu

Hashtable constructors

/** * Constructs a new, empty hashtable with the specified initial * capacity and the specified load factor. * * @param initialCapacity the initial capacity of the table * @param loadFactor a number between 0.0 and 1.0. * @exception IllegalArgumentException if the initial capacity is * less than zero, or if the load factor * is less than or equal to zero. * @since JDK1.0 */public Hashtable(int initialCapacity, float loadFactor) {

if ((initialCapacity < 0) || (loadFactor <= 0.0)) { throw new IllegalArgumentException();}this.loadFactor = loadFactor;table = new Entry[initialCapacity];threshold = (int) (initialCapacity * loadFactor);

}

Page 19 of 30CSE 100, UCSD: LEC 18

Page 20: Lecture 18 - cseweb.ucsd.edu

Hashtable default constructor

/** * Constructs a new, empty hashtable with a default capacity and * load factor. * * @since JDK1.0 */public Hashtable() {

this(11, 0.75);}

• How do the default values for size and load factor compare to the hash table design principles we talked about?...

Page 20 of 30CSE 100, UCSD: LEC 18

Page 21: Lecture 18 - cseweb.ucsd.edu

get()

/** * Returns the value to which the specified key is mapped in this * hashtable. * * @param key a key in the hashtable. * @return the value to which the key is mapped in this hashtable; * null if the key is not mapped to any value in * this hashtable. */public synchronized V get(Object key) {

int hash = key.hashCode();int index = (hash & 0x7FFFFFFF) % table.length;for (Entry<K,V> e = table[index] ; e != null ; e = e.next) { if ( e.hash == hash && e.key.equals(key) ) {

return e.value; }}return null;

}

Page 21 of 30CSE 100, UCSD: LEC 18

Page 22: Lecture 18 - cseweb.ucsd.edu

put()

• Here are the javadoc comments:/** * Maps the specified <code>key</code> to the specified * <code>value</code> in this hashtable. Neither the key nor the * value can be <code>null</code>. * <p> * The value can be retrieved by calling the <code>get</code> * method with a key that is equal to the original key. * * @param key the hashtable key. * @param value the value. * @return the previous value of the specified key in this * hashtable,or <code>null</code> if it did not have one. * @exception NullPointerException if the key or value is * <code>null</code>. * @since JDK1.0 */

• ... and the code follows.

Page 22 of 30CSE 100, UCSD: LEC 18

Page 23: Lecture 18 - cseweb.ucsd.edu

public synchronized V put(K key, V value) {// Make sure the value is not nullif (value == null) { throw new NullPointerException();}

// If the key is already in the hashtable, update its valueint hash = key.hashCode();int index = (hash & 0x7FFFFFFF) % table.length;for (Entry<K,V> e = table[index] ; e != null ; e = e.next) { if ( e.hash == hash && e.key.equals(key) ) {

V old = e.value; e.value = value; return old;

}}

if (count >= threshold) { // Rehash the table if the threshold is exceeded rehash(); // this enlarges the capacity of the table index = (hash & 0x7FFFFFFF) % table.length;}

Page 23 of 30CSE 100, UCSD: LEC 18

Page 24: Lecture 18 - cseweb.ucsd.edu

// Create and add the new entry.Entry<K,V> e = new Entry<K,V>();e.hash = hash;e.key = key;e.value = value;e.next = table[index];table[index] = e;count++;return null;

}

Page 24 of 30CSE 100, UCSD: LEC 18

Page 25: Lecture 18 - cseweb.ucsd.edu

Rehashing

/** Increases the capacity of and internally reorganizes this * hashtable, in order to accommodate and access its entries more * efficiently. */protected void rehash() {

int oldCapacity = table.length;Entry oldMap[] = table;int newCapacity = oldCapacity * 2 + 1;Entry newMap[] = new Entry[newCapacity];threshold = (int)(newCapacity * loadFactor);table = newMap;for (int i = oldCapacity ; i-- > 0 ;) {

for (Entry<K,V> old = oldMap[i] ; old != null ; ) {Entry<K,V> e = old;old = old.next;int index = (e.hash & 0x7FFFFFFF) % newCapacity;e.next = newMap[index];newMap[index] = e;

} }

Page 25 of 30CSE 100, UCSD: LEC 18

Page 26: Lecture 18 - cseweb.ucsd.edu

keySet()

• For any key value, you can find out if that key is in the table or not: just use get()

• But how can you get a listing of all the keys in the table? There are many possible keys, and only a few of them will be in the table; it’s not feasible to check them all with get()

• The keySet() method returns a Set object that contains only the keys in the table: /* Returns a Set view of the keys contained in this Hashtable. * The Set supports element removal (which removes the * corresponding entry from the Hashtable), but not element * addition. * @return a Set view of the keys contained in this Map. * @since 1.2 */ public Set<K> keySet() { //... }

• An Iterator for the Set can then be used to iterate efficiently over the keys in the table

Page 26 of 30CSE 100, UCSD: LEC 18

Page 27: Lecture 18 - cseweb.ucsd.edu

Serializable objects

• Since JDK1.1, Java has had the ability to “serialize” objects

• Serialization is the process of converting an existing object to a sequence of bytes, in order to be sent over a stream (e.g. saved to a file, or transmitted over a network connection, etc.)

serializing an object also sometimes called ‘persisting’ or ‘pickling’ the object

• This is done in such a way that the object can be deserialized, i.e. reconstituted, later (e.g. by reading from the file, or when the serialized object is received at the other end of the network connection, etc.)

• In order for an object to be serialized, its class must be declared to implement the java.io.Serializable interface

• This interface does not specify any methods: a class that declares itself to implement it is just indicating that instances of it can be serialized

• Many Java library classes are serializable; user-defined classes can also be serializable

Page 27 of 30CSE 100, UCSD: LEC 18

Page 28: Lecture 18 - cseweb.ucsd.edu

Serializing a serializable class

• If a class is Serializable, objects that are instances of that class or a subclass can be serialized

• To serialize an object, pass it to the writeObject() method of an appropriately created java.io.ObjectOutputStream object

• The object can be deserialized by creating a corresponding java.io.ObjectInputStream object and calling its readObject() method (you will want to downcast the returned Object reference to be of the appropriate type)

Page 28 of 30CSE 100, UCSD: LEC 18

Page 29: Lecture 18 - cseweb.ucsd.edu

Designing a serializable class

• If all the instance variables of a user-defined class are of primitive types or Serializable class types, then the class can be declared to implement the Serializable interface and instances of the class can be serialized

• If an instance variable is not of a Serializable class type, or you do not want it to be part of the serialized representation, the instance variable must be marked transient

• transient instance variables are serialized as their default values (null for class types, “zero” for primitive types)

to change this you can write your own serialization and deserialization methods, which can call the default methods; see online documentation for how to do this

• Classes themselves are not serialized, only objects! So, to get everything to work, the same class definition must be available in both serialization and deserialization contexts

As a corollary, static variables are never serialized: they are created and initialized when the class is loaded into the Java virtual machine, not when an instance of the class is deserialized

Page 29 of 30CSE 100, UCSD: LEC 18

Page 30: Lecture 18 - cseweb.ucsd.edu

Next time

• Self-organizing data structures

• Self-organizing lists

• Splay trees

• Spatial data structures

• K-D trees

• The C++ Standard Template Library

Page 30 of 30CSE 100, UCSD: LEC 18