Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered...

37
Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections

Transcript of Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered...

Page 1: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Computer Science 112

Fundamentals of Programming IIImplementation Strategies for Unordered Collections

Page 2: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

What They Are

• Bag - a collection of items in no particular order

• Set - a collection of unique items in no particular order

• Dictionary - a collection of values associated with unique keys

Page 3: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Variations

• SortedBag - a bag that allows clients to access items in sorted order

• SortedSet - a set that allows clients to access items in sorted order

• SortedDictionary - a dictionary that allows clients to access keys in sorted order

Page 4: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Sorted Set and Dictionary Implementations

• Array-based, using a sorted list

• Linked, using a linked binary search tree

• Must keep the tree balanced; insertions and removals will then be logarithmic as well

Page 5: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

d.isEmpty()

len(d)

iter(d) # Iterate through the keys

str(d)

key in d

d.get(key, defaultValue = None)

item = d[key]

d[key] = item # Add or replace

d.pop(key, defaultValue = None)

d.entries() # A set of entries

d.keys() # An iterator on the keys

d.values() # An iterator on the values

Dictionary Interface

Page 6: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Dictionary Implementations

• Array-based (like ArraySet and ArraySortedSet)

• Linked structure (like LinkedSet and TreeSortedSet)

• All use an Entry class to contain the key/value pair

Page 7: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Possible Organization I

ArraySet LinkedSet

AbstractBag

AbstractCollection

ArrayDict LinkedDict

Is a dictionary just a type of set with some additional methods?

ArrayBag LinkedBag

Page 8: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Possible Organization II

ArrayBag LinkedBag

AbstractBag

AbstractCollection

ArrayDict LinkedDict

AbstractDict

Which methods are implemented in AbstractDict?

ArraySet LinkedSet

Page 9: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

class Entry(object):

def __init__(self, key, value): self.key = key self.value = value

def __eq__(self, other): if type(self) != type(other): return False return self.key == other.key

def __lt__(self, other): if type(self) != type(other): return False return self.key < other.key

def __le__(self, other): if type(self) != type(other): return False return self.key <= other.key

The Entry Class

Goes in abstractdict.py, where all dictionaries can see it

Page 10: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

from abstractcollection import AbstractCollection

class AbstractDict(AbstractCollection):

def __init__(self): AbstractCollection.__init__(self, None)

def __str__(self): return " {" + ", ".join(map(lambda entry: str(entry.key) + \ ":" + str(entry.value), self.entries())) + "}"

The AbstractDict Class

{2:3, 6:7}

Page 11: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Can We Do Better?

• If we could associate each unordered set element or each unordered dictionary key with a unique index position in an array, we could have

– Constant-time search– Constant-time insertion– Constant-time removal

Page 12: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Hashing

• Each data element has a unique hash value, which is an integer

• This value can be computed in constant time by a hash function

• This computation can be performed on each insertion, access, and removal

Page 13: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

How Are the Elements Stored?

• The hash value is used to locate the element’s index in an array, thus preserving constant-time access

• How to compute this:

hashValue % capacity of array

Position will be >= 0 and < capacity

Page 14: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

def __contains__(self, item): index = abs(hash(item)) % len(self._array) return self._array[index] != None

A Sample Access Method (Set)

• self._array is an array of items

• len(self._array) is the array’s current physical size

• hash(item) is a function that returns an item’s hash value

• Other access methods have a similar structure

Page 15: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

A Sample Mutator Method (Set)

def add(self, item): if not item in self: index = abs(hash(item)) % len(self._array) self._array[index] = item

Page 16: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Adding Items

A

mySet.add("A")

index = 10

Page 17: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Adding Items

B A

mySet.add("B")

index = 5

Page 18: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Adding Items

C B A

mySet.add("C")

index = 0

Page 19: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Adding Items

C B A D

index = 14

mySet.add("D")

Page 20: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Adding Items

C B

Add 12 more items

A D

Page 21: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Adding Items

C E M Q B N F K T W A G L Y I D

Array is fullResize the array and rehash all elements

Page 22: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Performance

• O(1) lookups, insertions, removals - wow!

• Cost of resizing the array is amortized over many insertions and removals

• Works as long as hashValue % capacity is not the same for two items

Page 23: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Problem: Collisions

• As more elements fill the array, the likelihood that their hash values map to the same array position increases

• A collision then occurs: that is, items compete for the same position in the array

Page 24: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

def testHash(arrayLength = 10, numberOfItems = 5): print(" Item hash code array index") for i in range(1, numberOfItems + 1): item = "Item" + str(i) code = hash(item) index = abs(code) % arrayLength print("%7s%12d%8d" % (item, code, index))

A Tester Program

Page 25: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Load Factor

• An array’s load factor expresses the ratio of the number of elements to its capacity

• Example: elements(10) / length(30) = .3333

• Try to keep load factor low to minimize collisions

• Does waste some memory, though

Page 26: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Collision Processing Strategies

• Linear collision processing - search for the next available empty slot in the array, wrapping around if the end is reached

• Can lead to clustering, where several elements that have collided now occupy consecutive positions

• Several small clusters may coalesce into a large cluster and thus degrade performance

Page 27: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Collision Processing Strategies

• Rehashing - run one or more additional hash functions until a collision does not occur

• Works well when the load factor is small

• Multiple hash functions may contribute a large constant of proportionality to the running time

Page 28: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Collision Processing Strategies

• Quadratic collision processing - Move a considerable distance from the initial collision

• Does not require other rehashing functions

• When k is the collision position, we enter a loop that repeatedly attempts to locate an empty position

k + 12 // The first attempt to locate a positionk + 22 // The second attempt to locate a positionk + r2 // The rth attempt to locate a position

Page 29: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Collision Processing Strategies

• Chaining– Each hash value specifies an index or bucket in

the array– This bucket is at the head of a linked structure

or chain of items with the same hash value

Page 30: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Some Buckets and Chains

D5 D2

D6 D4

D8

D3 D1D7

0

1

2

3

4

index

Page 31: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

# Instance variables for locating data

self._foundEntry # Pointer to item just located # undefined if not foundself._priorEntry # Pointer to item prior to one just located # undefined if not foundself._index # Index of chain in which item was located # undefined if not found # Instance variables for data

self._array # the array of collision listsself._size # number of items in the set

HashSet Data

Extra instance variables support pointer manipulationsduring insertions and removals

Page 32: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

from node import Nodefrom abstractset import AbstractSetfrom abstractcollection import AbstractCollection

class HashSet(AbstractCollection, AbstractSet):

DEFAULT_CAPACITY = 1000;

def __init__(self, sourceCollection = None): self._array = Array(HashSet.DEFAULT_CAPACITY) self._foundEntry = self._priorEntry = None self._index = -1 AbstractCollection.__init__(self, sourceCollection)

HashSet Initialization

Uses singly linked nodes for the collision lists

Page 33: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

def __contains__(self, item): self._index = abs(hash(item)) % len(self._array) self._priorEntry = None self._foundEntry = self._array[self._index] while self._foundEntry != None: if self._foundEntry.data == item: return True else: self._priorEntry = self._foundEntry self._foundEntry = self._foundEntry.next return False

HashSet Searching

If this method returns True, the instance variables _index, _foundEntry, and _priorEntry allow other methods to locate and manipulate an item in the array’s collision list efficiently

Page 34: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

def add(self, item): if not item in self: newEntry = Node(item, self._array[self._index]) self._array[self._index] = newEntry self._size += 1

HashSet Insertion

Link to head of chain

Page 35: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

def remove(self, item): if not item in self: raise KeyError(str(item) + " not in set") elif self._priorEntry is None: self._array[self._index] = self._foundEntry.next else: self._priorEntry.next = self._foundEntry.next self._size -= 1

HashSet Removal

Page 36: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

Performance of Chaining

• If chains are evenly distributed across the array, close to O(1)

• If one or two chains get very long, processing tends to be linear

• Can use a large array but wastes memory

• On the average and for the most part, close to O(1)

Page 37: Computer Science 112 Fundamentals of Programming II Implementation Strategies for Unordered Collections.

For Friday

Introduction to Graphs (Chapter 20)