Data Structure - Yazdcs.yazd.ac.ir/farshi/Teaching/RandAlg3931/Slides/ch8_data... · 2015. 1....
Transcript of Data Structure - Yazdcs.yazd.ac.ir/farshi/Teaching/RandAlg3931/Slides/ch8_data... · 2015. 1....
Data Structure
Mohsen Arab
Yazd University
January 13, 2015
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 1 / 86
Table of Content
Binary Search Tree
Treaps
Skip Lists
Hash Tables
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 2 / 86
Fundamental Data-structuring Problem
fundamental data-structuring problem: maintain a collectionS1,S2, ... of sets of items to efficiently support certain types of queriesand operations:
MAKESET(S): create a new (empty) set S.
INSERT(i, S): insert item i into the set S.
DELETE(k,S): delete the item indexed by the key value k from theset S.
FIND(k, S): return the item indexed by the key value k in the set S.
JOIN(S1, i, S2): replace the sets S1 and S2 by the new setS = S1 ∪ i ∪ S2, where
1 for all items j ∈ S1, k(j) < k(i),2 for all items j ∈ S2, k(j) > k(i).
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 3 / 86
Fundamental Data-structuring Problem(cont.)
Paste(S1 , S2): replace the sets S1 and S2 by the new setS = S1 ∪ S2, where for all items i ∈ S1 and j ∈ S2, k(i) < k(j).
Split(k,S): replace the set S by the new sets S1 and S2 where
S1 = j ∈ S | k(j) < kS2 = j ∈ S | k(j) > k
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 4 / 86
binary search tree
binary search tree: binary tree in which keys satisfy search tree property.
Definition
Search tree property: for all nodes with key value k, the left sub-treecontains only key values smaller than k and the right sub-tree containsonly key values larger than k.
the key values in binary tree are in symmetric order, if they satisfysearch tree property.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 5 / 86
we will assume BST are endogenous.
Definition
Endogenous: all key values are stored at internal nodes, and all leaf nodesare empty.
This will ensure that the trees are full, which means that everynon-leaf (internal) node has exactly two children.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 6 / 86
standard implementations of operations
MakeSet(S): initialize an empty tree for the set S.
Joint(S1,k,S2): create a node containing key k as root, make S1 andS2 as its left and right sub-tree respectively.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 7 / 86
search
Example:FIND(4,S)
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 8 / 86
Insert
perform Find(k,S), insert k where search fails (into the empty leafnode)
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 9 / 86
Insert
perform Find(k,S), insert k where search fails (into the empty leafnode)
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 9 / 86
implementations of operations(Delete)
Delete(K,S):1) if the node v containing k has a leaf as one of its two children.For example, if the right child of v is a leaf, then replace v by L( v) as thechild of P(v).
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 10 / 86
implementations of operations(Delete)
Delete(K,S):1) if the node v containing k has a leaf as one of its two children.For example, if the right child of v is a leaf, then replace v by L( v) as thechild of P(v).
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 10 / 86
implementations of operations(Delete)
2. If neither of the children is a leaf,
let k ′ be the key value that is the predecessor of k in the set S.
Now, we can delete the node containing k ′ since its right child is aleaf, and replace the key value k by k ′ in the node v.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 11 / 86
implementations of operations(Delete)
2. If neither of the children is a leaf,
let k ′ be the key value that is the predecessor of k in the set S.
Now, we can delete the node containing k ′ since its right child is aleaf, and replace the key value k by k ′ in the node v.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 11 / 86
implementations of operations(cont.)
PASTE(S1, S2):1 delete the largest key value, say k, from S1.2 apply JOIN(S1,k,S2).
Note
k can be found by doing a FIND(∞,S1).
SPLIT(k, S):
if k is at the root of S, do the reverse of the steps employed inJOIN(S1,k,S2).else, make use of rotations to move it to the root.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 12 / 86
Problem:
Each operation can be performed in time proportional to the height ofthe tree. There is sequence of INSERT operations that result in tree ofheight linear in n.
Solution:
Perform rotations during update operations to ensure having all leavesin distance O(log n) from the root.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 13 / 86
Rotations
Each type of rotation moves a node together with one of its sub-treescloser to the root (and some others away from the root), whilepreserving the search tree property.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 14 / 86
A different strategy: Splaying in self-adjusting search tree
Splaying
the splay operation moves a specified node to the root via a sequence ofrotations
Amortization
partitioning of the total cost of a sequence of operations among theindividual operations in that sequence.
Thus, amortized time bound can be viewed as the average cost of theoperations in a sequence
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 15 / 86
idea behind self-adjusting trees
to use a particular implementation of the splay operation to move to theroot a node accessed by a FIND operation
How it can benefit us
nodes which accessed often enough, remain close to root. Thus, totalrunning time will increase not very much
for an infrequently accessed node,total running time will not increasevery much in any case.
Note
These self-adjusting trees guarantee only amortized logarithmic time peroperation.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 16 / 86
Advantages and drawbacks of self-adjusting trees
Advantages
They are relatively simple to implement.
do not require explicit balance information to be stored at nodes
splay trees can be shown to be optimal with respect to arbitraryaccess frequencies for the items being stored.
Drawbacks
they restructure the entire tree during updates and even simple searchoperations.
during any given operation splay trees may perform a logarithmicnumber of rotations
we do not have the guarantee that every operation will run quickly
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 17 / 86
Treaps
treaps are efficient randomized alternative to the balanced tree andself-adjusting tree.
Treaps achieve essentially the same time bounds in the expectedsense, but with following advantages:
1 do not require any explicit balance information2 expected number of rotations performed is small for each operation3 They are extremely simple to implement
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 18 / 86
binary search tree
A (full, endogenous) binary tree whose nodes have key values associatedwith them is a binary search tree if the key values are in the symmetricorder
heap
If the key values decrease monotonically along any root-leaf path, we callthe structure a heap and say that the keys are stored in a heap order.
treap
Consider a binary tree where each node v contains a pair of values: a keyk( v) as well as a priority p( v).We call this structure a treap if it is a binary search tree with respect tothe key values and, simultaneously, a heap with respect to the priorities
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 19 / 86
example of treaps
S = (k1,p1), ... ,(kn,pn)
S=(2, 13), (4, 26), (6,19), (7, 30), (9,14), (11, 27), (12, 22)
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 20 / 86
Theorem 8.1
Let S = (k1,p1), ... ,(kn,pn) be any set of key-priority pairs such thatthe keys and the priorities are distinct.Then, there exists a unique treapT(S) for it.
proof:
It is obvious that the theorem is true for n = 0 and for n = 1.
Suppose now that n ≥ 2, and assume that (k1, p1) has the highestpriority in S. Then, a treap for S can be constructed by putting item 1at the root of T(S).
A treap for the items in S of key value smaller (larger) than k1 can beconstructed recursively, and this is stored as the left (right) sub-treeof item 1.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 21 / 86
implementation of Operations using treap
MAKESET(S) or a FIND(k, S) operation exactly as before.
INSERT(k, S):
Do FIND(k, S) and inserting k at the empty leaf node where the searchterminates with failure.if heap order property is violated ( parent(k).p < k.p):
Repeat:decrease k’s depth by performing a rotation at node w= parent(k) sothat k becomes the parent of w.until k either becomes the root or parent(k).p > k.p.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 22 / 86
implementation of Operations using treap: Add(), Example
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 23 / 86
implementation of Operations using treap: Add(), Example
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 23 / 86
implementation of Operations using treap: Add(), Example
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 23 / 86
implementation of Operations using treap: Delete(),Example
DELETE(k, S): operation is exactly the reverse of an insertion downwarduntil both its children are leaves, and then simply discard the node.
Note: The choice of the rotation (left or right) at each stage depends onthe relative order of the priorities of the children of the node being deleted.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 24 / 86
Delete(), Example
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 25 / 86
Delete(), Example
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 25 / 86
Delete(), Example
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 25 / 86
Delete(), Example
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 25 / 86
JOIN(S1, k, S2): operation as before, and the resulting structure is atreap provided the priority of k is higher than that of any item in S1
or S2.If the new root (containing k) violates the heap order, we simplyrotate that node downward until each of the two children of the nodehas a smaller priority or is a leaf.
PASTE(S1, S2): As in BST.
SPLlT(k, S):1 delete k from S.2 inserting it into S with a priority of ∞.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 26 / 86
left spine of a tree: the path obtained by starting at the root andrepeatedly moving to the left child until a leaf is reached;
the right spine is defined similarly.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 27 / 86
left spine of a tree: the path obtained by starting at the root andrepeatedly moving to the left child until a leaf is reached;
the right spine is defined similarly.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 27 / 86
Mulmuley Games
Mulmuley games are useful abstractions of processes underlying thebehavior of certain geometric algorithms.The cast of characters in these games is:
P = P1, ... ,Pp S = S1, ... ,Ss T = T1, ... ,Tt B = B1, ... ,Bb
The set P ∪ S is drawn from a totally ordered universe.all players are smaller than all stoppers: for all i and j, Pi < Sj
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 28 / 86
Exercise 8.5:
Let Hk =∑k
i=1 1/i . denote the kth Harmonic number.Show that:
∑nk=1 Hk = (n + 1)Hn+1 − (n + 1)
Recall that Hk = Ink + O(1) (Proposition B.4).
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 29 / 86
Depending upon the set of active characters, we formulate four differentgames, with each game being more general than the previous one.
Game A.
initial set of characters X = P ∪ B .The game proceeds by repeatedly sampling from X without replacement,until the set X becomes empty.
random variable V: the number of samples in which a player Pi is chosensuch that Pi is larger than all previously chosen players.
value of the game Ap = E [V ]
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 30 / 86
Lemma 8.2:
For all p ≥ 0, Ap = Hp.
Proof:
Assume that the set of players is ordered as P1 > P2 > ... > Pp.
in Game A, bystanders are not considered, so we can set b=0.
if the first chosen player is Pi , the expected value of the game is1 + Ai−1.
Ap =∑p
i=11+Ai−1
p = 1 +∑p
i=1Ai−1
p
Upon rearrangement, using the fact that A0 = 0,∑p−1
i=1 Ai = pAp − p.By Exercise 8.5: Harmonic numbers are the solution to the above equation.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 31 / 86
Game C.
initial set of characters X = P ∪ B ∪ S .
the stoppers are treated as players. But the game stops when astopper is chosen for the first time.
value of the game C sp = E [V + 1] = E [V ] + 1
Note
since all players are smaller than all stoppers, we will always get acontribution of 1 to the game value from the first stopper.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 32 / 86
Lemma 8.3
Lemma 8.3
For all p, s ≥ 0, C sp = 1 + Hs+p - Hs .
Proof
Assume that the set of players is ordered as P1 > P2 > ... > Pp.
As in Game A, bystanders are not considered, so we can set b=0.
if the first sample is Pi ,the probability of the this event is s/(s + p).The expected game value is 1 + C s
i−1.
if the first sample is a stopper, ,the probability of the this event iss/(s + p). The game value is 1...
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 33 / 86
Proof of Lemma 8.3
Proof of Lemma 8.3 (cont.)..C sp = ( s
s+p × 1) + ( 1s+p ×
∑pi=1(1 + C s
i−1)).Upon rearrangement, using the fact that C s
0 = 1, we obtain that
C sp = s+p+1
s+p +∑p−1
i=1 C si
s+p
which is equivalent to∑p−1i=1 C s
i = (s + p)C sp − (s + p + 1).
Once again, using Exercise 8.5 it can be verified that the solution to therecurrence is given by C s
p = 1 + Hs+p − Hs .
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 34 / 86
Game D and E.
Games D and E are similar to Games A and C, But:
in Game D, X = P ∪ B ∪ T and in Game E, X = P ∪ B ∪ S ∪ T .
The role of the triggers is that the counting process begins only afterthe first trigger has been chosen.
i.e,
a player or a stopper contributes to V only if it is sampled after atrigger and before any stopper (and of course it is larger than allpreviously chosen players).
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 35 / 86
Lemma 8.4: For all p, t ≥ 0, Dtp = Hp + Ht − Hp+t .
Lemma 8.5:For all p,s,t ≥ 0,E s,t
p = ts+t + (Hs+p − Hs)− (Hs+p+t − Hs+t) .
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 36 / 86
Analysis of Treaps
memory less property
Since the random priorities for the elements of S are chosenindependently, we can assume that the priorities are chosen before theinsertion process is initiated
Once the priorities have been fixed, Theorem 8.1 implies that thetreap T is uniquely determined.
This implies that the order in which the elements are inserted doesnot affect the structure of the tree.
without loss of generality, we can assume that the elements of set Sare inserted into T in the order of decreasing priority.
An advantage of this view is that it implies that all insertions take place atthe leaves and no rotations are required to ensure the heap order on thepriorities.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 37 / 86
Lemma 8.6
Let T be a random treap for a set S of size n. For an element x ∈ Shaving rank k,
E (depth(x)) = Hk + Hn−k+1 − 1
idea of proof
S− = y ∈ S |y ≤ x,S+ = y ∈ S |y ≥ xSince x has rank k, it follows that |S−| = k, |S+| = n − k + 1
Qx ⊆ S : the ancestors of x
Q−x = S− ∩ Qx , Q+x = S+ ∩ Qx
we will establish that E [|Q−x |] = Hk . By By symmetry, it follows thatE [|Q+
x |] = Hn−k+1 − 1
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 38 / 86
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 39 / 86
Consider any ancestor y ∈ Q−x of the node x.
By the memoryless assumption, y must have been inserted prior to x:py > px .
Since y < x , it must be the case that x lies in the right sub-tree of y.
search for every element z whose value lies between y and x(y < z < x) must follow the path from the root to y, and in fact gointo the right sub-tree of y.
We conclude that y is an ancestor of every node containing an elementof value between y and x.By our assumption,z must have been inserted after y, and hence is oflower priority than y.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 40 / 86
.. Continue of proof..
The preceding argument establishes that an element y ∈ S− is anancestor of x, or a member of Q−x ; if and only if it was the largestelement of S− in the treap at the time of its insertion.
the order of insertion is determined by the order of the priorities, andthe latter is uniformly distributed by the order of the priorities,
Thus, the order of insertion can be viewed as being determined byuniform sampling without replacement from the pool S.
We can now claim that the distribution of | Q−x | is the same as thatof the value of Game A when P = S− and B = S\S−. Since| S− |= k , the expected size of | Q−x |= Hk
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 41 / 86
For any element x in a treap,Lx : length of the left spine of the right sub-tree of x.Rx : length of the right spine of the left sub-tree of x.
Lemma 8.7
Let T be a random treap for a set S of size n. For an element X ∈ S ofrank k,
E [Rx ] = 1− 1k , E [Lx ] = 1− 1
n−k+1
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 42 / 86
proof: (1)an element z < x lies on the right spine of the left sub-tree of xif and only if (2) z is inserted after x, and all elements y whose values liebetween z and x (z < y < x) are inserted after z.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 43 / 86
proof
z is inserted after x, and all elements y whose values lie between z and x(z < y < x) are inserted after z ⇒ element z lies on the right spine of theleft sub-tree of x .a. if x is ancestor of z: if x doesn’t lie on the spine right of left sub-treex, then: z < u < x (or z < v < x ) and since u (or v) is ancestor of z, it isinserted before z (contradiction).b. if x is not ancestor of z: let w be lowest common ancestor of z and x.we wee that z < w < x and since w is ancestor of z, it should have beeninserted before z (contradiction).
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 44 / 86
Proof (1)⇒ (2):an element z < x lies on the right spine of the left sub-tree of x⇒z is inserted after x, and all elements y whose values lie between z and x(z < y < x) are inserted after z.
since x is ancestor of z, so it is have been inserted before z. Also, since allelement y (z < y < x) should be inserted in the right sub-tree of z, thenthey will be inserted after z.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 45 / 86
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 46 / 86
Search in Skip List
We search for a key x in a a skip list as follows:
We start at the first position of the top list
At the current position p, we compare x with y ← key(next(p))
x = y: we return element(next(p))x> y: we scan forwardx <y: we drop down
Example: search for 78
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 47 / 86
Tree representation of a skip list
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 48 / 86
Analyzing Random Skip Lists
A random leveling of the set S is defined as follows:
Given the choice of level Li , the level Li+1 is defined by independentlychoosing to retain each element x ∈ Li with probability
he process starts with L1 = S and terminates when a newlyconstructed level is empty.
alternate view:
let the levels l(x) for x ∈ S be independent random variables, eachwith the geometric distribution with parameter p=1/2.
Let r be maxx∈S(l(x)) + 1
Place x in each of the levels L1, ... , Ll(x).
Like random Treaps, a random level is chosen for every element of Supon its insertion and remains fixed until the element is deleted.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 49 / 86
Analyzing Random Skip Lists
A random leveling of the set S is defined as follows:
Given the choice of level Li , the level Li+1 is defined by independentlychoosing to retain each element x ∈ Li with probability
he process starts with L1 = S and terminates when a newlyconstructed level is empty.
alternate view:
let the levels l(x) for x ∈ S be independent random variables, eachwith the geometric distribution with parameter p=1/2.
Let r be maxx∈S(l(x)) + 1
Place x in each of the levels L1, ... , Ll(x).
Like random Treaps, a random level is chosen for every element of Supon its insertion and remains fixed until the element is deleted.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 49 / 86
Lemma 8.9
The number of levels r in a random leveling of a set S of size n hasexpected value E [r ] = O(logn). Moreover, r = O(logn) with highprobability.
Proof:
r = maxx∈S(l(x)) + 1.
Levels l(x) are i.i.d. random variables distributed geometrically withparameter 1/2.
pr [maxiXi > t] ≤ n(1− p)t = n2t ,
we have p=1/2, with choosing t = αlogn and r = maxixi we have:
pr [r > αlogn] ≤ 1nα−1
for any α > 1.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 50 / 86
lemma 8.10
Define Ij(Y ) as the interval at level j that contains y.For an interval I at level i + 1, c(I) denotes the number of children it hasat level i.
Lemma 8.9
The number of levels r in a random leveling of a set S of size n hasexpected value E[r] = O(log n). Moreover, r = O(log n) with highprobability .
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 51 / 86
Hash Tables
1 static dictionary: we are given a set of keys S and must organize itinto a data structure that supports the efficient processing of FINDqueries.
2 dynamic dictionary: set S is not provided in advance. Instead it isconstructed by a series of INSERT and DELETE operations that areintermingled with the FIND queries.
Data Structuring problemAll data structures discussed earlier require (logn) time to process anysearch or update operation.
These time bounds are optimal
for data structures based on pointers and search trees we are facedwith a logarithmic lower bound.These time bounds are based on the fact that the only computation wecan perform over the keys is to compare them and thereby determinetheir relationship in the underlying total order.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 52 / 86
Hash Tables
Suppose:
keys in S are chosen from a totally ordered universe M of size m.w.l.o.g, M = 0, ...,m − 1keys are distinct.
The idea:Create an array T [0..m − 1] of size m in which
T[k]=1 if k ∈ ST[k] = NULL otherwise
This is called a direct-address table
Operations take O(1) time.So whats the problem?
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 53 / 86
Direct addressing works well when the range m of keys is relativelysmall.
But what if the keys are 32-bit integers?
Problem 1: direct-address table will have 232 entries, more than 4billion.Problem 2: even if memory is not an issue the time to initialize theelements to NULL may be.
we want to reduce the size of the table to value close to |S |, whilemaintaining the property that a search or update can be performed inO(1) time.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 54 / 86
A table T consisting of n cells indexed by N = 0, ..., n − 1A hash function h(), which is a mapping from M into N
n < m ,otherwise use direct address table.
collision occurs when: two distinct keys x and y map in A collisionoccurs when: two distinct keys x and y map in the same location, i.e.h(x) = h(y).
Goal: maintain a small table, and use hash function h to map keysinto this table. If h behaves randomly, shouldn’t get too manycollisions.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 55 / 86
Hash Tables Chaining
Chaining puts elements that collide in a linked list:
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 56 / 86
Universal Hash Families
2-universal
Let M = 0, ...,m − 1 and N = 0, ..., n − 1, with m ≥ n.A family H of functions from M into N is said to be 2-universal if, for all x,y ∈ M such that x 6= y , and for h chosen uniformly at random from H,
Pr [h(x) = h(y)] ≤ 1n
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 57 / 86
define the following indicator function for a collision between the keys xand y under the hash function h:
δ(x , y , h)=
1 for h(x)=h(y) and x 6= y0 otherwise
For all X ,Y ⊆ M, define the following extensions of the indicator functionδ:
δ(x , y ,H) = Σh∈Hδ(x , y , h) ,
δ(x ,Y , h) = Σy∈Y δ(x , y , h) ,
δ(X ,Y , h) = Σx∈X δ(x ,Y , h) ,
δ(x ,Y ,H) = Σy∈Y δ(x , y ,H) ,
δ(X ,Y ,H) = Σh∈Hδ(X ,Y , h) .
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 58 / 86
Note
For a 2-universal family H and any x 6= y , we have δ(x , y ,H) ≤ |H|/n.
Theorem 8.12:
For any family H of functions from M to N, there exist x , y ∈ M such that
δ(x , y ,H) > |H|n −
|H|m
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 59 / 86
Proof of Theorem 8.12
Proof
Fix some function h∈ H, and for each z ∈ N define the set of elements ofM mapped to z as
Az = x ∈ M|h(x) = z
The sets Az , for z ∈ N, form a partition of M. It is easy to verify that
δ(Aw ,Az , h)=
0 w 6= z
|Az |(|Az | − 1) w = z
The total number of collisions between all possible pairs of elements isminimized when these sets Az are all of the same size. We obtain
δ(M,M, h) =∑
z∈N |Az |(|Az | − 1)≥ n(mn (mn − 1)) = m2( 1
n −1m )
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 60 / 86
Proof(Cont.)
Proof(Cont.)
δ(M,M,H) =∑
h∈H δ(M,M, h) ≥ |H|m2( 1n −
1m ) .
By the pigeonhole principle. ∃x , y ∈ M such that:δ(x , y ,H) ≥ δ(M,M,H)
m2
= |H|δ(M,M,h)m2
≥ |H|m2( 1
n− 1
m)
m2
= |H|( 1n −
1m )
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 61 / 86
Lemma 8.13:
For all x ∈ M, S ⊆ M, and random h ∈ H,
E [δ(x ,S , h)] ≤ |S|n
Proof:E (δ(x , S , h)) =
∑h∈H
δ(x ,S ,h)|H|
= 1|H|∑
h∈H∑
y∈S δ(x , y , h)
= 1|H|∑
y∈S∑
h∈H δ(x , y , h)
= 1|H|∑
y∈S δ(x , y ,H)
≤ 1|H|∑
y∈S|H|n
= |S|n .
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 62 / 86
in Our dynamic dictionary scheme :
Notes
a hash function h ∈ H is chosen uniformly at random, remains fixedduring entire sequence of updates and queries.
An inserted key x is stored at the location h(x),and due to collisions there could be other keys also stored at thatlocation.
The keys colliding at a given location are organized into a linked list
Assuming that the set of keys currently stored in the table is S ⊆ M,
the length of the linked list is δ(x ,S , h), which has expectation |S |/n .
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 63 / 86
Theorem 8.14:
Consider a request sequence R = Rl ,R2 ... Rr of update and searchoperations starting with an empty hash table.Suppose that this sequence contains S INSERT operations.Let ρ(h,R) denote the total cost of processing these requests using thehash function h ∈ H.
Theorem 8.14:
For any sequence R of length r with S INSERTS, and h chosenuniformly at random from a 2-universal family H,
E [ρ(h,R)] ≤ r(1 + sn )
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 64 / 86
Constructing Universal Hash Families
Fix m and n. choose a prime p ≥ m.We will work over the field zp = 0, 1, ..., p − 1.let g : zp → N be the function given by g(x) = x mod n.
For all a, b ∈ zp, define the linear function fa,b : zp → zp and the hashfunction ha,b : zp → N as follows.
fa,b(x)=ax+b mod p.ha,b(x) = g(fa,b(x)) =(ax+b mod p) mod n
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 65 / 86
We the family of hash functions H = ha,b | a, b ∈ zp with a 6= 0
Lemma 8.15
or all x, y ∈ zp such that x 6= y ,
δ(x , y ,H) = δ(zp, zp, g).
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 66 / 86
proof
Suppose that x and y collide under a specific function ha,b. Letfa,b(X ) = r and fa,b(y) = s.observe that r 6= s since a 6= 0 and x 6= y.A collision takes place if and onlyif g(r) = g(s), or equivalently, r ≡ s (mod n).
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 67 / 86
Now, having fixed x and y, for each such choice of r 6= s, the values of aand b are uniquely determined by solution of:
ax + b ≡ r (mod p)ay + b ≡ s (mod p)
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 68 / 86
Theorem 8.16:
The family H= ha,b|a, b ∈ Zp with a 6= 0 is a 2-universal family.
Proof:For each z ∈ N, let Az = x ∈ zp with g(x) = z; it is clear that|Az | ≤ dp/ne. In other words, for every r ∈ Zp there are at most dp/nedifferent choices of s ∈ Zp such that g(r)=g(s). Since there are p differentchoices of r ∈ Zp to start with,
δ(ZP ,Zp, g) ≤ p(dpne − 1) ≤ p(p−1)n
lemma 8.15: δ(x , y ,H) = δ(zp, zp, g), This Proof: δ(ZP ,Zp, g) ≤ p(p−1)n ,
so:δ(x , y ,H) ≤ p(p−1)
n . Since |H| = p(p − 1), Therefore: δ(x , y ,H) ≤ |H|n .
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 69 / 86
Definition 8.6
Let M = 0, 1, ...,m− 1 and N = 0, 1, ..., n− 1, with m ≥ n,. A familyH of functions from M into N is said to be strongly 2-universal if for allx1 6= x2 ∈ M, any y1, y2 ∈ N, and h chosen uniformly at random from H,
pr[h(x1) = y1 and h(x2) = y2]= 1n2 .
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 70 / 86
Definition 8.7
Definition
A family of hash functions H = h : M → N, is said to be a perfect hashfamily if for each set S ⊂ M of size s < n there exists a hash function h ∈H that is perfect for S.
Note:It is clear that perfect hash families exist: for example, the family of allpossible functions from M to T, is a perfect hash family.Given a perfect hash family H, we solve static dictionary by:
1 finding h ∈ H perfect for S.
2 storing each key x ∈ S at the location T [h(x)].
3 responding to a search query for a key q by examining the contents ofT [h(q)].
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 71 / 86
The preprocessing cost:
depends on the cost of identifying a perfect hash function for a specificchoice of S.
search cost:
depends on the time required to evaluate the hash function.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 72 / 86
since the choice of the hash function will depend on the set S, itsdescription must also be stored in the table.
Suppose that the size of the perfect hash family H is r.
storing the description of a hash function from H will require Ω(log r)bits.
it is essential that the description of the hash function should fit into0(1) locations in the table T.
A cell in the table, can be used to encode at most log m bits ofinformation.
Note
therefore, we will only be interested in constructing hash families whosesize r is bounded by a polynomial in m
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 73 / 86
Exercise 8.13:
Assume for simplicity that n = s. Show that for m = 2Ω(s), there existperfect hash families of size polynomial in m.
Thus, The existence of a perfect hash family is guaranteed only for valuesof m that are extremely large relative to n.
Exercise 8.14:
Assuming that n = s, show that any perfect hash family must have size2Ω(s).
Thus, we need to have m = 2Ω(s), or s = O( 1og m), to guarantee eventhe existence of a perfect hash family of size polynomial in m.Unfortunately, in practice the case s = O(1og m) is not very interesting fortypical values of m, e.g, for m=232.Solution: using double hashing.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 74 / 86
Definition 8.8
Let S ⊂ M and h: M → N. For each table location 0 ≤ i ≤ n − 1, wedefine the bin
Bi (h, S) = x ∈ S | h(x) = i
The size of a bin is denoted by bi (h, S) =| Bi(h, S) |.
Definition 8.9:
A hash function h is b-perfect for S if bi (h,S) ≤ b, for each i. A family ofhash functions h: M → N is said to be a b-perfect hash family if foreach S ⊂ M of size s there exists a hash function h ∈ H that is b-perfectfor S.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 75 / 86
Exercise 8.15:
Show that there exists a b-perfect hash family H such that b = O(log n)and | H |≤ m, for any m ≥ n.
Double hashing:
At the first level we use a (log m)-perfect hash function h to map Sinto the primary table T.
Consider the bin Bi consisting of all keys from S mapped into aparticular cell T[i].
elements of the bin Bi mapped into the secondary table Ti associatedwith that location using a secondary hash function hi .
Since the size of Bi is bounded by b, we can find a hash function hi that isperfect for Bi provided 2b is polynomially bounded in m. For b = O(logm) this condition holds.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 76 / 86
the double hashing scheme can be implemented with O( 1) query time, forany m ≥ n.
the goal of the primary hash functions should be to create bins smallenough that some perfect hash functions can be used as the secondaryhash functions.
Exercise.8.16:
Consider a table of size r indexed by R=0, ..., r − 1, show that thereexists a perfect hash family H = M → R with | H |≤ m provided thatr = Ω(s2), for all m ≥ s.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 77 / 86
Towards our final solution
We will use a primary table of size n = s, choosing a primary hashfunction that ensures that the bin sizes are small.
the perfect hash functions from Exercise 8.16 are then used toresolve the collisions by using secondary hash tables of size quadraticin the bin sizes,
Total space required by the double hashing scheme
s + O(∑s−1
i=0 b2i )
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 78 / 86
Achieving Bounded Query Time
Our goal now is:
1 to find primary hash functions which ensure that the sum of thesquares of the bin sizes is linear.
2 to find perfect hash functions for the secondary tables, which use atmost quadratic space.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 79 / 86
Definition 8.10:
Consider any V ⊆ M with | V | = v, and let R=0, ..., r − 1 with r ≥ v.For 1 ≤ k ≤ p - 1, define the function hk : M → R as follows,
hk(x)=(kx mod p) mod r .
For each i ∈ R, the bins corresponding to the keys colliding at i aredenoted as
Bi (k , r ,V ) = x ∈ V | hk(x) = i
and their sizes are denoted by bi (k , r ,V ) =| Bi (k, r ,V ) |.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 80 / 86
Lemma 8.17:
For all V ⊆ M of size v, and all r ≥ v,∑p−1k=1
∑r−1i=0
(bi (k, r ,V )
2
)< (p−1)v2
r = mv2
r .
Proof:The left-hand side of (8.2)counts the number of tuples (k, x , y) suchthat hk causes x and y to collide. i.e,
1 x,y ∈ V with x 6= y , and
2 ((kx mod p) mod r) = ((ky mod p) mod r).
The relation between k and x,y is as follows:
k(x − y) mod p ∈ ±r ,±2r ,±3r , ...,±b(p − 1)/rcr
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 81 / 86
proof(cont.)
Since p is a prime and Zp is a field, for any fixed value of x - y there is aunique solution for k satisfying the equation
k(x-y) mod p= jr
for any value of j. This immediately implies that the number of values of kthat cause a collision between x and y is at most 2(p−1)
r .
Finally, noting that the number of choices of the pair x , y is
(v2
). we
obtain
∑p−1k=1
∑r−1i=0
(bi (k, r ,V )
2
)≤(v2
)2(p−1)
r < (p−1)v2
r
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 82 / 86
Corollary 8.18
For all V ⊆ M of size v, and all r ≥ v, there exists k ∈ 1, ...,m such that
∑r−1i=0
(bi (k , r ,V )
2
)< v2
r .
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 83 / 86
Theorem 8.19
For any S ⊆ M with | S | = s and m ≥ s, there exists a hash tablerepresentation of S that uses space O(s) and permits the processing of aFIND operation in O( 1) time.
proof:The double hashing scheme is as described above, and all that remains tobe shown is that there are choices of the primary hash function hk and thesecondary hash functions hk1 , ..., hks that ensure the promised performancebounds.
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 84 / 86
proof(cont.)
Consider first the primary hash function hk . The only property desired ofthis function is that the sum of squares of the colliding sets (the bins) belinear in n to ensure that the space used by the secondary hash tables isO(s).Applying Corollary 8.18 to the case where V = S and R = T, implyingthat v = r = s, we obtain that there exists a k ∈ I , ...,m such that
∑s−1i=0
(bi (k , s,S)
2
)< s.
or that ∑s−1i=0 bi (k , s,S)[bi (k , s, S)− 1)] < 2s.
Since ∪s−1i=0Bi (k , s,S) = S and
∑s−1i=0 bi (k , s,S) = s,
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 85 / 86
∑s−1i=0 bi (k, s, S)2 < 2s +
∑s−1i=0 bi (k, s, S) = 3s
Consider now the secondary hash function hki for the set Sj = Bi (k , s, S)of size si . Applying Corollary 8.18 to the case where V = Si (or v = si )and using a secondary hash table of size r=s2
i , it follows that there exists aki ∈ 1, ...,m such that
∑s2i −1j=0
(bj(ki , s
2i ,Si )
2
)< 1.
where b bj(ki , s2i , Si ) is the number of collisions at the jth location of the
secondary hash table for T[i]. This can be the case only when each term ofthe summation is zero, implying that bj(ki , s
2i ,Si ) ≤ 1 for all j. Thus, it
follows that there exists a perfect secondary hash function hki .
Mohsen Arab (Yazd University ) Data Structure January 13, 2015 86 / 86