Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered...

48
Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries

Transcript of Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered...

Page 1: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python:From First Programs Through Data

Structures

Chapter 19

Unordered Collections: Sets and Dictionaries

Page 2: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 2

Objectives

After completing this chapter, you will be able to:

• Implement a set type and a dictionary type using lists

• Explain how hashing can help a programmer achieve constant access time to unordered collections

• Explain strategies for resolving collisions during hashing, such as linear probing, quadratic probing, and bucket/chaining

Page 3: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 3

Objectives (continued)

After completing this chapter, you will be able to: (continued)

• Use a hashing strategy to implement a set type and a dictionary type

• Use a binary search tree to implement a sorted set type and a sorted dictionary type

Page 4: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 4

• A set is a collection of items in no particular order

• Most typical operations:– Return the number of items in the set– Test for the empty set (a set that contains no items)– Add an item to the set– Remove an item from the set– Test for set membership– Obtain the union of two sets– Obtain the intersection of two sets– Obtain the difference of two sets

Using Sets

Page 5: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 5

Using Sets (continued)

Page 6: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 6

The Python set Class

Page 7: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 7

The Python set Class (continued)

Page 8: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 8

A Sample Session with Sets

Page 9: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 9

A Sample Session with Sets (continued)

Page 10: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 10

Applications of Sets

• Sets have many applications in the area of data processing– Example: In database management, answer to query

that contains conjunction of two keys could be constructed from intersection of sets of items associated with those keys

Page 11: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 11

Implementations of Sets

• Arrays and lists may be used to contain the data items of a set– A linked list has the advantage of supporting

constant-time removals of items• Once they are located in the structure

• Hashing attempts to approximate random access into an array for insertions, removals, and searches

Page 12: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 12

Relationship Between Sets and Dictionaries

• A dictionary is an unordered collection of elements called entries– Each entry consists of a key and an associated

value– A dictionary’s keys must be unique, but its values

may be duplicated

• One can think of a dictionary as having a set of keys

Page 13: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 13

List Implementations of Sets and Dictionaries

• The simplest implementations of sets and dictionaries use lists

• This section presents these implementations and assesses their run-time performance

Page 14: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Sets

• List implementation of a set

Fundamentals of Python: From First Programs Through Data Structures 14

Page 15: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 15

Dictionaries

• Our list-based implementation of a dictionary is called ListDict– The entries in a dictionary consist of two parts, a key

and a value

• A list implementation of a dictionary behaves in many ways like a list implementation of a set

Page 16: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 16

Dictionaries (continued)

Page 17: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 17

Dictionaries (continued)

Page 18: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 18

Dictionaries (continued)

Page 19: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 19

Complexity Analysis of the List Implementations of Sets and

Dictionaries• The list implementations of sets and dictionaries

require little programmer effort– Unfortunately, they do not perform well

• Basic accessing methods must perform a linear search of the underlying list– Each basic accessing method is O(n)

Page 20: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 20

Hashing Strategies

• Key-to-address transformation or a hashing function– Acts on a given key by returning its relative position in

an array

• Hash table– An array used with a hashing strategy

• Collision– Placement of different keys at the same array index

Page 21: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 21

Hashing Strategies (continued)

Page 22: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 22

Hashing Strategies (continued)

Page 23: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 23

The Relationship of Collisions to Density

• Density– The number of keys relative to the length of an array

• As the density decreases, so does the probability of collisions

• Keeping a low load factor even (say, below .2) seems like a good way to avoid collisions– Cost of memory incurred by load factors below .5 is

probably prohibitive for data sets of millions of items– Even load factors below .5 cannot prevent many

collisions from occurring for some data sets

Page 24: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 24

Hashing with Non-Numeric Keys

• Try returning the sum of the ASCII values in the string

• This method has effect of producing same keys for anagrams– Strings that contain same characters, but in different

order

• First letters of many words in English are unevenly distributed– This might have the effect of weighting or biasing the

sums generated

Page 25: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 25

Hashing with Non-Numeric Keys (continued)

• One solution:– If length of string is greater than a certain threshold

• Drop first character from string before computing sum

• Can also subtract the ASCII value of the last character

• Python also includes a standard hash function for use in hashing applications– Function can receive any Python object as an

argument and returns a unique integer

Page 26: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 26

Hashing with Non-Numeric Keys (continued)

Page 27: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 27

Linear Probing

• Linear probing– Simplest way to resolve a collision– Search array, starting from collision spot, for the first

available position

• At the start of an insertion, the hashing function is run to compute the home index of the item– If cell at home index is not available, move index to

the right to probe for an available cell– When search reaches last position of array, probing

wraps around to continue from the first position

Page 28: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Linear Probing (continued)

• For retrievals, stop probing process when current array cell is empty or it contains the target item– If target item is found, its cell is set to DELETED

Fundamentals of Python: From First Programs Through Data Structures 28

Page 29: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 29

Linear Probing (continued)

• Problem: After several insertions/removals, item is farther away from its home index than needs to be– Increasing the average overall access time

• Two ways to deal with this problem:– After a removal, shift items on the cell’s right over to

the cell’s left until an empty cell, a currently occupied cell, or the home indexes for each item are reached

– Regularly rehash the table (e.g., if load factor is .5)

• Clustering: Occurs when items causing a collision are relocated to the same region within the array

Page 30: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 30

Linear Probing (continued)

Page 31: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Quadratic Probing

• To avoid clustering associated with linear probing, we can advance the search for an empty position a considerable distance from the collision point– Quadratic probing: Increments the home index by

the square of a distance on each attempt

• Problem: By jumping over some cells, one or more of them might be missed– Can lead to some wasted space

Fundamentals of Python: From First Programs Through Data Structures 31

Page 32: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Quadratic Probing (continued)

• Here is the code for insertions, updated to use quadratic probing:

Fundamentals of Python: From First Programs Through Data Structures 32

Page 33: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Chaining

• Items are stored in an array of linked lists (chains) – Each item’s key locates the bucket (index) of the

chain in which the item resides or is to be inserted

• Retrieval and removal each perform these steps:– Compute the item’s home index in the array– Search the linked list at that index for the item

• To insert an item:– Compute the item’s home index in the array– If cell is empty, create a node with item and assign

the node to cell; else (collision), insert item in chain

Fundamentals of Python: From First Programs Through Data Structures 33

Page 34: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 34

Chaining (continued)

Page 35: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Complexity Analysis

• Linear probing: Complexity depends on load factor (D) and tendency of items to cluster– Worst case (method traverses entire array before

locating item’s position): behavior is linear– Average behavior in searching for an item that

cannot be found is (1/2) [1 + 1/(1 – D)2]

• Quadratic probing: Tends to mitigate clustering– Average search complexity is 1 – loge(1 – D) – (D /

2) for the successful case and 1 / (1 – D) – D – loge(1 – D) for the unsuccessful case

Fundamentals of Python: From First Programs Through Data Structures 35

Page 36: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Complexity Analysis (continued)

• Chaining:– Locating an item consists of two parts:

• Computing home index constant time behavior• Searching linked list upon a collision linear

– Worst case (all items that have collided with each other are in one chain, which is a linked list): O(n)

– If lists are evenly distributed in array and array is fairly large, the second part can be close to constant

– Best case (a chain of length 1 occupies each array cell): O(1)

Fundamentals of Python: From First Programs Through Data Structures 36

Page 37: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Case Study: Profiling Hashing Strategies

• Request:– Write a program that allows a programmer to profile

different hashing strategies

• Analysis:– Should allow to gather statistics on number of

collisions caused by the hashing strategies– Other useful information:

• Hash table’s load factor

• Number of probes needed to resolve collisions during linear or quadratic probing

Fundamentals of Python: From First Programs Through Data Structures 37

Page 38: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 38

Case Study: Profiling Hashing Strategies (continued)

Page 39: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 39

Case Study: Profiling Hashing Strategies (continued)

Page 40: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Case Study: Profiling Hashing Strategies (continued)

• Analysis (continued):– Here are the profiler’s results:

• Design:– Profiler class requires instance variables to track

a table, number of collisions, and number of probesFundamentals of Python: From First Programs Through Data Structures 40

Page 41: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Case Study: Profiling Hashing Strategies (continued)

• Implementation:

Fundamentals of Python: From First Programs Through Data Structures 41

Page 42: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Fundamentals of Python: From First Programs Through Data Structures 42

Case Study: Profiling Hashing Strategies (continued)

Page 43: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Hashing Implementation of Dictionaries

• HashDict uses the bucket/chaining strategy– To manage the array, declare three instance

variables: _table, _size, and _capacity

Fundamentals of Python: From First Programs Through Data Structures 43

Page 44: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Hashing Implementation of Sets

• The design of the methods for HashSet is also the same as the methods in HashDict, except:– __contains__ searches for an item (not key)– add inserts item only if it is not already in the set– A single iterator method is included instead of

separate methods that return keys and values

Fundamentals of Python: From First Programs Through Data Structures 44

Page 45: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Sorted Sets and Dictionaries

• Each item added to a sorted set must be comparable with its other items– Same applies for keys added to a sorted dictionary

• The iterator for each type of collection guarantees its users access to items or keys in sorted order

• Implementation alternatives:– List-based: must maintain a sorted list of the items– Hashing implementation: not feasible– Binary search tree implementation: generally provide

logarithmic access to data items

Fundamentals of Python: From First Programs Through Data Structures 45

Page 46: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Sorted Sets and Dictionaries (continued)

Fundamentals of Python: From First Programs Through Data Structures 46

Page 47: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Summary

• A set is an unordered collection of items– Each item is unique– List-based implementation linear-time access– Hashing implementation constant-time access

• Items in a sorted set can be visited in sorted order– A tree-based implementation of a sorted set

supports logarithmic-time access

• A dictionary is an unordered collection of entries, where each entry consists of a key and a value– Each key is unique; its values may be duplicated

Fundamentals of Python: From First Programs Through Data Structures 47

Page 48: Fundamentals of Python: From First Programs Through Data Structures Chapter 19 Unordered Collections: Sets and Dictionaries.

Summary (continued)

• A sorted dictionary imposes an ordering by comparison on its keys

• Implementations of both types of dictionaries are similar to those of sets

• Hashing: Technique for locating an item in constant time– Techniques to resolve collisions: linear collision

processing, quadratic collision processing, chaining– The run-time and memory aspects involve the load

factor of the array

Fundamentals of Python: From First Programs Through Data Structures 48