Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered...
-
Upload
henry-walton -
Category
Documents
-
view
220 -
download
1
Transcript of Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered...
Computer Science 112
Fundamentals of Programming IIImplementation Strategies for Unordered Collections
What They Are
• Bag - a collection of items in no particular order
• Set - a collection of unique items in no particular order
• Dictionary - a collection of values associated with unique keys
Variations
• SortedBag - a bag that allows clients to access items in sorted order
• SortedSet - a set that allows clients to access items in sorted order
• SortedDictionary - a dictionary that allows clients to access keys in sorted order
Sorted Set and Dictionary Implementations
• Array-based, using a sorted list
• Linked, using a linked binary search tree
• Must keep the tree balanced; insertions and removals will then be logarithmic as well
d.isEmpty()
len(d)
iter(d) # Iterate through the keys
str(d)
key in d
d.get(key, defaultValue = None)
item = d[key]
d[key] = item # Add or replace
d.pop(key, defaultValue = None)
d.entries() # A set of entries
d.keys() # An iterator on the keys
d.values() # An iterator on the values
Dictionary Interface
Dictionary Implementations
• Array-based (like ArraySet and ArraySortedSet)
• Linked structure (like LinkedSet and TreeSortedSet)
• All use an Entry class to contain the key/value pair
Possible Organization I
ArraySet LinkedSet
AbstractBag
AbstractCollection
ArrayDict LinkedDict
Is a dictionary just a type of set with some additional methods?
ArrayBag LinkedBag
Possible Organization II
ArrayBag LinkedBag
AbstractBag
AbstractCollection
ArrayDict LinkedDict
AbstractDict
Which methods are implemented in AbstractDict?
ArraySet LinkedSet
class Entry(object):
def __init__(self, key, value): self.key = key self.value = value
def __eq__(self, other): if type(self) != type(other): return False return self.key == other.key
def __lt__(self, other): if type(self) != type(other): return False return self.key < other.key
def __le__(self, other): if type(self) != type(other): return False return self.key <= other.key
The Entry Class
Goes in abstractdict.py, where all dictionaries can see it
from abstractcollection import AbstractCollection
class AbstractDict(AbstractCollection):
def __init__(self): AbstractCollection.__init__(self, None)
def __str__(self): return " {" + ", ".join(map(lambda entry: str(entry.key) + \ ":" + str(entry.value), self.entries())) + "}"
The AbstractDict Class
{2:3, 6:7}
Can We Do Better?
• If we could associate each unordered set element or each unordered dictionary key with a unique index position in an array, we could have
– Constant-time search– Constant-time insertion– Constant-time removal
Hashing
• Each data element has a unique hash value, which is an integer
• This value can be computed in constant time by a hash function
• This computation can be performed on each insertion, access, and removal
How Are the Elements Stored?
• The hash value is used to locate the element’s index in an array, thus preserving constant-time access
• How to compute this:
hashValue % capacity of array
Position will be >= 0 and < capacity
def __contains__(self, item): index = abs(hash(item)) % len(self._array) return self._array[index] != None
A Sample Access Method (Set)
• self._array is an array of items
• len(self._array) is the array’s current physical size
• hash(item) is a function that returns an item’s hash value
• Other access methods have a similar structure
A Sample Mutator Method (Set)
def add(self, item): if not item in self: index = abs(hash(item)) % len(self._array) self._array[index] = item
Adding Items
A
mySet.add("A")
index = 10
Adding Items
B A
mySet.add("B")
index = 5
Adding Items
C B A
mySet.add("C")
index = 0
Adding Items
C B A D
index = 14
mySet.add("D")
Adding Items
C B
Add 12 more items
A D
Adding Items
C E M Q B N F K T W A G L Y I D
Array is fullResize the array and rehash all elements
Performance
• O(1) lookups, insertions, removals - wow!
• Cost of resizing the array is amortized over many insertions and removals
• Works as long as hashValue % capacity is not the same for two items
Problem: Collisions
• As more elements fill the array, the likelihood that their hash values map to the same array position increases
• A collision then occurs: that is, items compete for the same position in the array
def testHash(arrayLength = 10, numberOfItems = 5): print(" Item hash code array index") for i in range(1, numberOfItems + 1): item = "Item" + str(i) code = hash(item) index = abs(code) % arrayLength print("%7s%12d%8d" % (item, code, index))
A Tester Program
Load Factor
• An array’s load factor expresses the ratio of the number of elements to its capacity
• Example: elements(10) / length(30) = .3333
• Try to keep load factor low to minimize collisions
• Does waste some memory, though
Collision Processing Strategies
• Linear collision processing - search for the next available empty slot in the array, wrapping around if the end is reached
• Can lead to clustering, where several elements that have collided now occupy consecutive positions
• Several small clusters may coalesce into a large cluster and thus degrade performance
Collision Processing Strategies
• Rehashing - run one or more additional hash functions until a collision does not occur
• Works well when the load factor is small
• Multiple hash functions may contribute a large constant of proportionality to the running time
Collision Processing Strategies
• Quadratic collision processing - Move a considerable distance from the initial collision
• Does not require other rehashing functions
• When k is the collision position, we enter a loop that repeatedly attempts to locate an empty position
k + 12 // The first attempt to locate a positionk + 22 // The second attempt to locate a positionk + r2 // The rth attempt to locate a position
Collision Processing Strategies
• Chaining– Each hash value specifies an index or bucket in
the array– This bucket is at the head of a linked structure
or chain of items with the same hash value
Some Buckets and Chains
D5 D2
D6 D4
D8
D3 D1D7
0
1
2
3
4
index
# Instance variables for locating data
self._foundEntry # Pointer to item just located # undefined if not foundself._priorEntry # Pointer to item prior to one just located # undefined if not foundself._index # Index of chain in which item was located # undefined if not found # Instance variables for data
self._array # the array of collision listsself._size # number of items in the set
HashSet Data
Extra instance variables support pointer manipulationsduring insertions and removals
from node import Nodefrom abstractset import AbstractSetfrom abstractcollection import AbstractCollection
class HashSet(AbstractCollection, AbstractSet):
DEFAULT_CAPACITY = 1000;
def __init__(self, sourceCollection = None): self._array = Array(HashSet.DEFAULT_CAPACITY) self._foundEntry = self._priorEntry = None self._index = -1 AbstractCollection.__init__(self, sourceCollection)
HashSet Initialization
Uses singly linked nodes for the collision lists
def __contains__(self, item): self._index = abs(hash(item)) % len(self._array) self._priorEntry = None self._foundEntry = self._array[self._index] while self._foundEntry != None: if self._foundEntry.data == item: return True else: self._priorEntry = self._foundEntry self._foundEntry = self._foundEntry.next return False
HashSet Searching
If this method returns True, the instance variables _index, _foundEntry, and _priorEntry allow other methods to locate and manipulate an item in the array’s collision list efficiently
def add(self, item): if not item in self: newEntry = Node(item, self._array[self._index]) self._array[self._index] = newEntry self._size += 1
HashSet Insertion
Link to head of chain
def remove(self, item): if not item in self: raise KeyError(str(item) + " not in set") elif self._priorEntry is None: self._array[self._index] = self._foundEntry.next else: self._priorEntry.next = self._foundEntry.next self._size -= 1
HashSet Removal
Performance of Chaining
• If chains are evenly distributed across the array, close to O(1)
• If one or two chains get very long, processing tends to be linear
• Can use a large array but wastes memory
• On the average and for the most part, close to O(1)
For Friday
Introduction to Graphs (Chapter 20)