Data Structure - Yazdcs.yazd.ac.ir/farshi/Teaching/RandAlg3931/Slides/ch8_data... · 2015. 1....

Data Structure

Mohsen Arab

Yazd University

January 13, 2015

Mohsen Arab (Yazd University ) Data Structure January 13, 2015 1 / 86

Table of Content

Binary Search Tree

Treaps

Skip Lists

Hash Tables


Fundamental Data-structuring Problem

fundamental data-structuring problem: maintain a collectionS1,S2, ... of sets of items to efficiently support certain types of queriesand operations:

MAKESET(S): create a new (empty) set S.

INSERT(i, S): insert item i into the set S.

DELETE(k,S): delete the item indexed by the key value k from theset S.

FIND(k, S): return the item indexed by the key value k in the set S.

JOIN(S1, i, S2): replace the sets S1 and S2 by the new setS = S1 ∪ i ∪ S2, where

1 for all items j ∈ S1, k(j) < k(i),2 for all items j ∈ S2, k(j) > k(i).


Fundamental Data-structuring Problem(cont.)

Paste(S1 , S2): replace the sets S1 and S2 by the new setS = S1 ∪ S2, where for all items i ∈ S1 and j ∈ S2, k(i) < k(j).

Split(k,S): replace the set S by the new sets S1 and S2 where

S1 = j ∈ S | k(j) < kS2 = j ∈ S | k(j) > k


binary search tree

binary search tree: binary tree in which keys satisfy search tree property.

Definition

Search tree property: for all nodes with key value k, the left sub-treecontains only key values smaller than k and the right sub-tree containsonly key values larger than k.

the key values in binary tree are in symmetric order, if they satisfysearch tree property.


we will assume BST are endogenous.

Definition

Endogenous: all key values are stored at internal nodes, and all leaf nodesare empty.

This will ensure that the trees are full, which means that everynon-leaf (internal) node has exactly two children.


standard implementations of operations

MakeSet(S): initialize an empty tree for the set S.

Joint(S1,k,S2): create a node containing key k as root, make S1 andS2 as its left and right sub-tree respectively.


search

Example:FIND(4,S)


Insert

perform Find(k,S), insert k where search fails (into the empty leafnode)


implementations of operations(Delete)

Delete(K,S):1) if the node v containing k has a leaf as one of its two children.For example, if the right child of v is a leaf, then replace v by L( v) as thechild of P(v).


implementations of operations(Delete)

2. If neither of the children is a leaf,

let k ′ be the key value that is the predecessor of k in the set S.

Now, we can delete the node containing k ′ since its right child is aleaf, and replace the key value k by k ′ in the node v.


implementations of operations(cont.)

PASTE(S1, S2):1 delete the largest key value, say k, from S1.2 apply JOIN(S1,k,S2).

Note

k can be found by doing a FIND(∞,S1).

SPLIT(k, S):

if k is at the root of S, do the reverse of the steps employed inJOIN(S1,k,S2).else, make use of rotations to move it to the root.


Problem:

Each operation can be performed in time proportional to the height ofthe tree. There is sequence of INSERT operations that result in tree ofheight linear in n.

Solution:

Perform rotations during update operations to ensure having all leavesin distance O(log n) from the root.


Rotations

Each type of rotation moves a node together with one of its sub-treescloser to the root (and some others away from the root), whilepreserving the search tree property.


A different strategy: Splaying in self-adjusting search tree

Splaying

the splay operation moves a specified node to the root via a sequence ofrotations

Amortization

partitioning of the total cost of a sequence of operations among theindividual operations in that sequence.

Thus, amortized time bound can be viewed as the average cost of theoperations in a sequence


idea behind self-adjusting trees

to use a particular implementation of the splay operation to move to theroot a node accessed by a FIND operation

How it can benefit us

nodes which accessed often enough, remain close to root. Thus, totalrunning time will increase not very much

for an infrequently accessed node,total running time will not increasevery much in any case.

Note

These self-adjusting trees guarantee only amortized logarithmic time peroperation.


Advantages and drawbacks of self-adjusting trees

Advantages

They are relatively simple to implement.

do not require explicit balance information to be stored at nodes

splay trees can be shown to be optimal with respect to arbitraryaccess frequencies for the items being stored.

Drawbacks

they restructure the entire tree during updates and even simple searchoperations.

during any given operation splay trees may perform a logarithmicnumber of rotations

we do not have the guarantee that every operation will run quickly


Treaps

treaps are efficient randomized alternative to the balanced tree andself-adjusting tree.

Treaps achieve essentially the same time bounds in the expectedsense, but with following advantages:

1 do not require any explicit balance information2 expected number of rotations performed is small for each operation3 They are extremely simple to implement


binary search tree

A (full, endogenous) binary tree whose nodes have key values associatedwith them is a binary search tree if the key values are in the symmetricorder

heap

If the key values decrease monotonically along any root-leaf path, we callthe structure a heap and say that the keys are stored in a heap order.

treap

Consider a binary tree where each node v contains a pair of values: a keyk( v) as well as a priority p( v).We call this structure a treap if it is a binary search tree with respect tothe key values and, simultaneously, a heap with respect to the priorities


example of treaps

S = (k1,p1), ... ,(kn,pn)

S=(2, 13), (4, 26), (6,19), (7, 30), (9,14), (11, 27), (12, 22)


Theorem 8.1

Let S = (k1,p1), ... ,(kn,pn) be any set of key-priority pairs such thatthe keys and the priorities are distinct.Then, there exists a unique treapT(S) for it.

proof:

It is obvious that the theorem is true for n = 0 and for n = 1.

Suppose now that n ≥ 2, and assume that (k1, p1) has the highestpriority in S. Then, a treap for S can be constructed by putting item 1at the root of T(S).

A treap for the items in S of key value smaller (larger) than k1 can beconstructed recursively, and this is stored as the left (right) sub-treeof item 1.


implementation of Operations using treap

MAKESET(S) or a FIND(k, S) operation exactly as before.

INSERT(k, S):

Do FIND(k, S) and inserting k at the empty leaf node where the searchterminates with failure.if heap order property is violated ( parent(k).p < k.p):

Repeat:decrease k’s depth by performing a rotation at node w= parent(k) sothat k becomes the parent of w.until k either becomes the root or parent(k).p > k.p.


implementation of Operations using treap: Add(), Example


implementation of Operations using treap: Delete(),Example

DELETE(k, S): operation is exactly the reverse of an insertion downwarduntil both its children are leaves, and then simply discard the node.

Note: The choice of the rotation (left or right) at each stage depends onthe relative order of the priorities of the children of the node being deleted.


Delete(), Example


JOIN(S1, k, S2): operation as before, and the resulting structure is atreap provided the priority of k is higher than that of any item in S1

or S2.If the new root (containing k) violates the heap order, we simplyrotate that node downward until each of the two children of the nodehas a smaller priority or is a leaf.

PASTE(S1, S2): As in BST.

SPLlT(k, S):1 delete k from S.2 inserting it into S with a priority of ∞.


left spine of a tree: the path obtained by starting at the root andrepeatedly moving to the left child until a leaf is reached;

the right spine is defined similarly.


Mulmuley Games

Mulmuley games are useful abstractions of processes underlying thebehavior of certain geometric algorithms.The cast of characters in these games is:

P = P1, ... ,Pp S = S1, ... ,Ss T = T1, ... ,Tt B = B1, ... ,Bb

The set P ∪ S is drawn from a totally ordered universe.all players are smaller than all stoppers: for all i and j, Pi < Sj


Exercise 8.5:

Let Hk =∑k

i=1 1/i . denote the kth Harmonic number.Show that:

∑nk=1 Hk = (n + 1)Hn+1 − (n + 1)

Recall that Hk = Ink + O(1) (Proposition B.4).


Depending upon the set of active characters, we formulate four differentgames, with each game being more general than the previous one.

Game A.

initial set of characters X = P ∪ B .The game proceeds by repeatedly sampling from X without replacement,until the set X becomes empty.

random variable V: the number of samples in which a player Pi is chosensuch that Pi is larger than all previously chosen players.

value of the game Ap = E [V ]


Lemma 8.2:

For all p ≥ 0, Ap = Hp.

Proof:

Assume that the set of players is ordered as P1 > P2 > ... > Pp.

in Game A, bystanders are not considered, so we can set b=0.

if the first chosen player is Pi , the expected value of the game is1 + Ai−1.

Ap =∑p

i=11+Ai−1

p = 1 +∑p

i=1Ai−1

p

Upon rearrangement, using the fact that A0 = 0,∑p−1

i=1 Ai = pAp − p.By Exercise 8.5: Harmonic numbers are the solution to the above equation.


Game C.

initial set of characters X = P ∪ B ∪ S .

the stoppers are treated as players. But the game stops when astopper is chosen for the first time.

value of the game C sp = E [V + 1] = E [V ] + 1

Note

since all players are smaller than all stoppers, we will always get acontribution of 1 to the game value from the first stopper.


Lemma 8.3

Lemma 8.3

For all p, s ≥ 0, C sp = 1 + Hs+p - Hs .

Proof

Assume that the set of players is ordered as P1 > P2 > ... > Pp.

As in Game A, bystanders are not considered, so we can set b=0.

if the first sample is Pi ,the probability of the this event is s/(s + p).The expected game value is 1 + C s

i−1.

if the first sample is a stopper, ,the probability of the this event iss/(s + p). The game value is 1...


Proof of Lemma 8.3

Proof of Lemma 8.3 (cont.)..C sp = ( s

s+p × 1) + ( 1s+p ×

∑pi=1(1 + C s

i−1)).Upon rearrangement, using the fact that C s

0 = 1, we obtain that

C sp = s+p+1

s+p +∑p−1

i=1 C si

s+p

which is equivalent to∑p−1i=1 C s

i = (s + p)C sp − (s + p + 1).

Once again, using Exercise 8.5 it can be verified that the solution to therecurrence is given by C s

p = 1 + Hs+p − Hs .


Game D and E.

Games D and E are similar to Games A and C, But:

in Game D, X = P ∪ B ∪ T and in Game E, X = P ∪ B ∪ S ∪ T .

The role of the triggers is that the counting process begins only afterthe first trigger has been chosen.

i.e,

a player or a stopper contributes to V only if it is sampled after atrigger and before any stopper (and of course it is larger than allpreviously chosen players).


Lemma 8.4: For all p, t ≥ 0, Dtp = Hp + Ht − Hp+t .

Lemma 8.5:For all p,s,t ≥ 0,E s,t

p = ts+t + (Hs+p − Hs)− (Hs+p+t − Hs+t) .


Analysis of Treaps

memory less property

Since the random priorities for the elements of S are chosenindependently, we can assume that the priorities are chosen before theinsertion process is initiated

Once the priorities have been fixed, Theorem 8.1 implies that thetreap T is uniquely determined.

This implies that the order in which the elements are inserted doesnot affect the structure of the tree.

without loss of generality, we can assume that the elements of set Sare inserted into T in the order of decreasing priority.

An advantage of this view is that it implies that all insertions take place atthe leaves and no rotations are required to ensure the heap order on thepriorities.


Lemma 8.6

Let T be a random treap for a set S of size n. For an element x ∈ Shaving rank k,

E (depth(x)) = Hk + Hn−k+1 − 1

idea of proof

S− = y ∈ S |y ≤ x,S+ = y ∈ S |y ≥ xSince x has rank k, it follows that |S−| = k, |S+| = n − k + 1

Qx ⊆ S : the ancestors of x

Q−x = S− ∩ Qx , Q+x = S+ ∩ Qx

we will establish that E [|Q−x |] = Hk . By By symmetry, it follows thatE [|Q+

x |] = Hn−k+1 − 1


Consider any ancestor y ∈ Q−x of the node x.

By the memoryless assumption, y must have been inserted prior to x:py > px .

Since y < x , it must be the case that x lies in the right sub-tree of y.

search for every element z whose value lies between y and x(y < z < x) must follow the path from the root to y, and in fact gointo the right sub-tree of y.

We conclude that y is an ancestor of every node containing an elementof value between y and x.By our assumption,z must have been inserted after y, and hence is oflower priority than y.


.. Continue of proof..

The preceding argument establishes that an element y ∈ S− is anancestor of x, or a member of Q−x ; if and only if it was the largestelement of S− in the treap at the time of its insertion.

the order of insertion is determined by the order of the priorities, andthe latter is uniformly distributed by the order of the priorities,

Thus, the order of insertion can be viewed as being determined byuniform sampling without replacement from the pool S.

We can now claim that the distribution of | Q−x | is the same as thatof the value of Game A when P = S− and B = S\S−. Since| S− |= k , the expected size of | Q−x |= Hk


For any element x in a treap,Lx : length of the left spine of the right sub-tree of x.Rx : length of the right spine of the left sub-tree of x.

Lemma 8.7

Let T be a random treap for a set S of size n. For an element X ∈ S ofrank k,

E [Rx ] = 1− 1k , E [Lx ] = 1− 1

n−k+1


proof: (1)an element z < x lies on the right spine of the left sub-tree of xif and only if (2) z is inserted after x, and all elements y whose values liebetween z and x (z < y < x) are inserted after z.


proof

z is inserted after x, and all elements y whose values lie between z and x(z < y < x) are inserted after z ⇒ element z lies on the right spine of theleft sub-tree of x .a. if x is ancestor of z: if x doesn’t lie on the spine right of left sub-treex, then: z < u < x (or z < v < x ) and since u (or v) is ancestor of z, it isinserted before z (contradiction).b. if x is not ancestor of z: let w be lowest common ancestor of z and x.we wee that z < w < x and since w is ancestor of z, it should have beeninserted before z (contradiction).


Proof (1)⇒ (2):an element z < x lies on the right spine of the left sub-tree of x⇒z is inserted after x, and all elements y whose values lie between z and x(z < y < x) are inserted after z.

since x is ancestor of z, so it is have been inserted before z. Also, since allelement y (z < y < x) should be inserted in the right sub-tree of z, thenthey will be inserted after z.


Search in Skip List

We search for a key x in a a skip list as follows:

We start at the first position of the top list

At the current position p, we compare x with y ← key(next(p))

x = y: we return element(next(p))x> y: we scan forwardx <y: we drop down

Example: search for 78


Tree representation of a skip list


Analyzing Random Skip Lists

A random leveling of the set S is defined as follows:

Given the choice of level Li , the level Li+1 is defined by independentlychoosing to retain each element x ∈ Li with probability

he process starts with L1 = S and terminates when a newlyconstructed level is empty.

alternate view:

let the levels l(x) for x ∈ S be independent random variables, eachwith the geometric distribution with parameter p=1/2.

Let r be maxx∈S(l(x)) + 1

Place x in each of the levels L1, ... , Ll(x).

Like random Treaps, a random level is chosen for every element of Supon its insertion and remains fixed until the element is deleted.


Lemma 8.9

The number of levels r in a random leveling of a set S of size n hasexpected value E [r ] = O(logn). Moreover, r = O(logn) with highprobability.

Proof:

r = maxx∈S(l(x)) + 1.

Levels l(x) are i.i.d. random variables distributed geometrically withparameter 1/2.

pr [maxiXi > t] ≤ n(1− p)t = n2t ,

we have p=1/2, with choosing t = αlogn and r = maxixi we have:

pr [r > αlogn] ≤ 1nα−1

for any α > 1.


lemma 8.10

Define Ij(Y ) as the interval at level j that contains y.For an interval I at level i + 1, c(I) denotes the number of children it hasat level i.

Lemma 8.9

The number of levels r in a random leveling of a set S of size n hasexpected value E[r] = O(log n). Moreover, r = O(log n) with highprobability .


Hash Tables

1 static dictionary: we are given a set of keys S and must organize itinto a data structure that supports the efficient processing of FINDqueries.

2 dynamic dictionary: set S is not provided in advance. Instead it isconstructed by a series of INSERT and DELETE operations that areintermingled with the FIND queries.

Data Structuring problemAll data structures discussed earlier require (logn) time to process anysearch or update operation.

These time bounds are optimal

for data structures based on pointers and search trees we are facedwith a logarithmic lower bound.These time bounds are based on the fact that the only computation wecan perform over the keys is to compare them and thereby determinetheir relationship in the underlying total order.


Hash Tables

Suppose:

keys in S are chosen from a totally ordered universe M of size m.w.l.o.g, M = 0, ...,m − 1keys are distinct.

The idea:Create an array T [0..m − 1] of size m in which

T[k]=1 if k ∈ ST[k] = NULL otherwise

This is called a direct-address table

Operations take O(1) time.So whats the problem?


Direct addressing works well when the range m of keys is relativelysmall.

But what if the keys are 32-bit integers?

Problem 1: direct-address table will have 232 entries, more than 4billion.Problem 2: even if memory is not an issue the time to initialize theelements to NULL may be.

we want to reduce the size of the table to value close to |S |, whilemaintaining the property that a search or update can be performed inO(1) time.


A table T consisting of n cells indexed by N = 0, ..., n − 1A hash function h(), which is a mapping from M into N

n < m ,otherwise use direct address table.

collision occurs when: two distinct keys x and y map in A collisionoccurs when: two distinct keys x and y map in the same location, i.e.h(x) = h(y).

Goal: maintain a small table, and use hash function h to map keysinto this table. If h behaves randomly, shouldn’t get too manycollisions.


Hash Tables Chaining

Chaining puts elements that collide in a linked list:


Universal Hash Families

2-universal

Let M = 0, ...,m − 1 and N = 0, ..., n − 1, with m ≥ n.A family H of functions from M into N is said to be 2-universal if, for all x,y ∈ M such that x 6= y , and for h chosen uniformly at random from H,

Pr [h(x) = h(y)] ≤ 1n


define the following indicator function for a collision between the keys xand y under the hash function h:

δ(x , y , h)=

1 for h(x)=h(y) and x 6= y0 otherwise

For all X ,Y ⊆ M, define the following extensions of the indicator functionδ:

δ(x , y ,H) = Σh∈Hδ(x , y , h) ,

δ(x ,Y , h) = Σy∈Y δ(x , y , h) ,

δ(X ,Y , h) = Σx∈X δ(x ,Y , h) ,

δ(x ,Y ,H) = Σy∈Y δ(x , y ,H) ,

δ(X ,Y ,H) = Σh∈Hδ(X ,Y , h) .


Note

For a 2-universal family H and any x 6= y , we have δ(x , y ,H) ≤ |H|/n.

Theorem 8.12:

For any family H of functions from M to N, there exist x , y ∈ M such that

δ(x , y ,H) > |H|n −

|H|m


Proof of Theorem 8.12

Proof

Fix some function h∈ H, and for each z ∈ N define the set of elements ofM mapped to z as

Az = x ∈ M|h(x) = z

The sets Az , for z ∈ N, form a partition of M. It is easy to verify that

δ(Aw ,Az , h)=

0 w 6= z

|Az |(|Az | − 1) w = z

The total number of collisions between all possible pairs of elements isminimized when these sets Az are all of the same size. We obtain

δ(M,M, h) =∑

z∈N |Az |(|Az | − 1)≥ n(mn (mn − 1)) = m2( 1

n −1m )


Proof(Cont.)

Proof(Cont.)

δ(M,M,H) =∑

h∈H δ(M,M, h) ≥ |H|m2( 1n −

1m ) .

By the pigeonhole principle. ∃x , y ∈ M such that:δ(x , y ,H) ≥ δ(M,M,H)

m2

= |H|δ(M,M,h)m2

≥ |H|m2( 1

n− 1

m)

m2

= |H|( 1n −

1m )


Lemma 8.13:

For all x ∈ M, S ⊆ M, and random h ∈ H,

E [δ(x ,S , h)] ≤ |S|n

Proof:E (δ(x , S , h)) =

∑h∈H

δ(x ,S ,h)|H|

= 1|H|∑

h∈H∑

y∈S δ(x , y , h)

= 1|H|∑

y∈S∑

h∈H δ(x , y , h)

= 1|H|∑

y∈S δ(x , y ,H)

≤ 1|H|∑

y∈S|H|n

= |S|n .


in Our dynamic dictionary scheme :

Notes

a hash function h ∈ H is chosen uniformly at random, remains fixedduring entire sequence of updates and queries.

An inserted key x is stored at the location h(x),and due to collisions there could be other keys also stored at thatlocation.

The keys colliding at a given location are organized into a linked list

Assuming that the set of keys currently stored in the table is S ⊆ M,

the length of the linked list is δ(x ,S , h), which has expectation |S |/n .


Theorem 8.14:

Consider a request sequence R = Rl ,R2 ... Rr of update and searchoperations starting with an empty hash table.Suppose that this sequence contains S INSERT operations.Let ρ(h,R) denote the total cost of processing these requests using thehash function h ∈ H.

Theorem 8.14:

For any sequence R of length r with S INSERTS, and h chosenuniformly at random from a 2-universal family H,

E [ρ(h,R)] ≤ r(1 + sn )


Constructing Universal Hash Families

Fix m and n. choose a prime p ≥ m.We will work over the field zp = 0, 1, ..., p − 1.let g : zp → N be the function given by g(x) = x mod n.

For all a, b ∈ zp, define the linear function fa,b : zp → zp and the hashfunction ha,b : zp → N as follows.

fa,b(x)=ax+b mod p.ha,b(x) = g(fa,b(x)) =(ax+b mod p) mod n


We the family of hash functions H = ha,b | a, b ∈ zp with a 6= 0

Lemma 8.15

or all x, y ∈ zp such that x 6= y ,

δ(x , y ,H) = δ(zp, zp, g).


proof

Suppose that x and y collide under a specific function ha,b. Letfa,b(X ) = r and fa,b(y) = s.observe that r 6= s since a 6= 0 and x 6= y.A collision takes place if and onlyif g(r) = g(s), or equivalently, r ≡ s (mod n).


Now, having fixed x and y, for each such choice of r 6= s, the values of aand b are uniquely determined by solution of:

ax + b ≡ r (mod p)ay + b ≡ s (mod p)


Theorem 8.16:

The family H= ha,b|a, b ∈ Zp with a 6= 0 is a 2-universal family.

Proof:For each z ∈ N, let Az = x ∈ zp with g(x) = z; it is clear that|Az | ≤ dp/ne. In other words, for every r ∈ Zp there are at most dp/nedifferent choices of s ∈ Zp such that g(r)=g(s). Since there are p differentchoices of r ∈ Zp to start with,

δ(ZP ,Zp, g) ≤ p(dpne − 1) ≤ p(p−1)n

lemma 8.15: δ(x , y ,H) = δ(zp, zp, g), This Proof: δ(ZP ,Zp, g) ≤ p(p−1)n ,

so:δ(x , y ,H) ≤ p(p−1)

n . Since |H| = p(p − 1), Therefore: δ(x , y ,H) ≤ |H|n .


Definition 8.6

Let M = 0, 1, ...,m− 1 and N = 0, 1, ..., n− 1, with m ≥ n,. A familyH of functions from M into N is said to be strongly 2-universal if for allx1 6= x2 ∈ M, any y1, y2 ∈ N, and h chosen uniformly at random from H,

pr[h(x1) = y1 and h(x2) = y2]= 1n2 .


Definition 8.7

Definition

A family of hash functions H = h : M → N, is said to be a perfect hashfamily if for each set S ⊂ M of size s < n there exists a hash function h ∈H that is perfect for S.

Note:It is clear that perfect hash families exist: for example, the family of allpossible functions from M to T, is a perfect hash family.Given a perfect hash family H, we solve static dictionary by:

1 finding h ∈ H perfect for S.

2 storing each key x ∈ S at the location T [h(x)].

3 responding to a search query for a key q by examining the contents ofT [h(q)].


The preprocessing cost:

depends on the cost of identifying a perfect hash function for a specificchoice of S.

search cost:

depends on the time required to evaluate the hash function.


since the choice of the hash function will depend on the set S, itsdescription must also be stored in the table.

Suppose that the size of the perfect hash family H is r.

storing the description of a hash function from H will require Ω(log r)bits.

it is essential that the description of the hash function should fit into0(1) locations in the table T.

A cell in the table, can be used to encode at most log m bits ofinformation.

Note

therefore, we will only be interested in constructing hash families whosesize r is bounded by a polynomial in m


Exercise 8.13:

Assume for simplicity that n = s. Show that for m = 2Ω(s), there existperfect hash families of size polynomial in m.

Thus, The existence of a perfect hash family is guaranteed only for valuesof m that are extremely large relative to n.

Exercise 8.14:

Assuming that n = s, show that any perfect hash family must have size2Ω(s).

Thus, we need to have m = 2Ω(s), or s = O( 1og m), to guarantee eventhe existence of a perfect hash family of size polynomial in m.Unfortunately, in practice the case s = O(1og m) is not very interesting fortypical values of m, e.g, for m=232.Solution: using double hashing.


Definition 8.8

Let S ⊂ M and h: M → N. For each table location 0 ≤ i ≤ n − 1, wedefine the bin

Bi (h, S) = x ∈ S | h(x) = i

The size of a bin is denoted by bi (h, S) =| Bi(h, S) |.

Definition 8.9:

A hash function h is b-perfect for S if bi (h,S) ≤ b, for each i. A family ofhash functions h: M → N is said to be a b-perfect hash family if foreach S ⊂ M of size s there exists a hash function h ∈ H that is b-perfectfor S.


Exercise 8.15:

Show that there exists a b-perfect hash family H such that b = O(log n)and | H |≤ m, for any m ≥ n.

Double hashing:

At the first level we use a (log m)-perfect hash function h to map Sinto the primary table T.

Consider the bin Bi consisting of all keys from S mapped into aparticular cell T[i].

elements of the bin Bi mapped into the secondary table Ti associatedwith that location using a secondary hash function hi .

Since the size of Bi is bounded by b, we can find a hash function hi that isperfect for Bi provided 2b is polynomially bounded in m. For b = O(logm) this condition holds.


the double hashing scheme can be implemented with O( 1) query time, forany m ≥ n.

the goal of the primary hash functions should be to create bins smallenough that some perfect hash functions can be used as the secondaryhash functions.

Exercise.8.16:

Consider a table of size r indexed by R=0, ..., r − 1, show that thereexists a perfect hash family H = M → R with | H |≤ m provided thatr = Ω(s2), for all m ≥ s.


Towards our final solution

We will use a primary table of size n = s, choosing a primary hashfunction that ensures that the bin sizes are small.

the perfect hash functions from Exercise 8.16 are then used toresolve the collisions by using secondary hash tables of size quadraticin the bin sizes,

Total space required by the double hashing scheme

s + O(∑s−1

i=0 b2i )


Achieving Bounded Query Time

Our goal now is:

1 to find primary hash functions which ensure that the sum of thesquares of the bin sizes is linear.

2 to find perfect hash functions for the secondary tables, which use atmost quadratic space.


Definition 8.10:

Consider any V ⊆ M with | V | = v, and let R=0, ..., r − 1 with r ≥ v.For 1 ≤ k ≤ p - 1, define the function hk : M → R as follows,

hk(x)=(kx mod p) mod r .

For each i ∈ R, the bins corresponding to the keys colliding at i aredenoted as

Bi (k , r ,V ) = x ∈ V | hk(x) = i

and their sizes are denoted by bi (k , r ,V ) =| Bi (k, r ,V ) |.


Lemma 8.17:

For all V ⊆ M of size v, and all r ≥ v,∑p−1k=1

∑r−1i=0

(bi (k, r ,V )

2

)< (p−1)v2

r = mv2

r .

Proof:The left-hand side of (8.2)counts the number of tuples (k, x , y) suchthat hk causes x and y to collide. i.e,

1 x,y ∈ V with x 6= y , and

2 ((kx mod p) mod r) = ((ky mod p) mod r).

The relation between k and x,y is as follows:

k(x − y) mod p ∈ ±r ,±2r ,±3r , ...,±b(p − 1)/rcr


proof(cont.)

Since p is a prime and Zp is a field, for any fixed value of x - y there is aunique solution for k satisfying the equation

k(x-y) mod p= jr

for any value of j. This immediately implies that the number of values of kthat cause a collision between x and y is at most 2(p−1)

r .

Finally, noting that the number of choices of the pair x , y is

(v2

). we

obtain

∑p−1k=1

∑r−1i=0

(bi (k, r ,V )

2

)≤(v2

)2(p−1)

r < (p−1)v2

r


Corollary 8.18

For all V ⊆ M of size v, and all r ≥ v, there exists k ∈ 1, ...,m such that

∑r−1i=0

(bi (k , r ,V )

2

)< v2

r .


Theorem 8.19

For any S ⊆ M with | S | = s and m ≥ s, there exists a hash tablerepresentation of S that uses space O(s) and permits the processing of aFIND operation in O( 1) time.

proof:The double hashing scheme is as described above, and all that remains tobe shown is that there are choices of the primary hash function hk and thesecondary hash functions hk1 , ..., hks that ensure the promised performancebounds.


proof(cont.)

Consider first the primary hash function hk . The only property desired ofthis function is that the sum of squares of the colliding sets (the bins) belinear in n to ensure that the space used by the secondary hash tables isO(s).Applying Corollary 8.18 to the case where V = S and R = T, implyingthat v = r = s, we obtain that there exists a k ∈ I , ...,m such that

∑s−1i=0

(bi (k , s,S)

2

)< s.

or that ∑s−1i=0 bi (k , s,S)[bi (k , s, S)− 1)] < 2s.

Since ∪s−1i=0Bi (k , s,S) = S and

∑s−1i=0 bi (k , s,S) = s,


∑s−1i=0 bi (k, s, S)2 < 2s +

∑s−1i=0 bi (k, s, S) = 3s

Consider now the secondary hash function hki for the set Sj = Bi (k , s, S)of size si . Applying Corollary 8.18 to the case where V = Si (or v = si )and using a secondary hash table of size r=s2

i , it follows that there exists aki ∈ 1, ...,m such that

∑s2i −1j=0

(bj(ki , s

2i ,Si )

2

)< 1.

where b bj(ki , s2i , Si ) is the number of collisions at the jth location of the

secondary hash table for T[i]. This can be the case only when each term ofthe summation is zero, implying that bj(ki , s

2i ,Si ) ≤ 1 for all j. Thus, it

follows that there exists a perfect secondary hash function hki .


Data Structure - Yazdcs.yazd.ac.ir/farshi/Teaching/RandAlg3931/Slides/ch8_data... · 2015. 1....

Documents

Transcript of Data Structure - Yazdcs.yazd.ac.ir/farshi/Teaching/RandAlg3931/Slides/ch8_data... · 2015. 1....