Some Open Questions Related to Cuckoo Hashing Michael Mitzenmacher Harvard University.

Some Open Questions Related toCuckoo Hashing

Michael Mitzenmacher

Harvard University

The Beginnings

Cuckoo Hashing

• Basic scheme: each element gets two possible locations (uniformly at random).

• To insert x, check both locations for x. If one is empty, insert.

• If both are full, x kicks out an old element y. Then y moves to its other location.

• If that location is full, y kicks out z, and so on, until an empty slot is found.

Cuckoo Hashing Examples

A B C

E D


A B C

E D

F


A B FC

E D


A B FC

E D

G


E G B FC

A D


A B C

E D F

G

Why Do We CareAbout Cuckoo Hashing?

• Hash tables a fundamental data structure.• Multiple-choice hashing yields tables with

– High memory utilization.– Constant time look-ups.– Simplicity – easily coded, parallelized.

• Cuckoo hashing expands on this, combining multiple choices with ability to move elements.

• Practical potential, and theoretically interesting!

Good Properties of Cuckoo Hashing

• Worst case constant lookup time.

• High memory utilizations possible.

• Simple to build, design.

Cuckoo Hashing Failures

• Bad case 1: inserted element runs into cycles.

• Bad case 2: inserted element has very long path before insertion completes.– Could be on a long cycle.

• Bad cases occur with very small probability when load is sufficiently low.

• Theoretical solution: re-hash everything if a failure occurs.

Various Representations

Buckets

Buckets

Elements

Buckets

Elements

Buckets

Elements

Basic Performance

• For 2 choices, load less than 50%, n elements gives failure rate of (1/n); maximum insert time O(log n).

• Related to random graph representation.– Each element is an edge, buckets are vertices.– Edge corresponds to two random choices of an

element.– Small load implies small acyclic or unicyclic

components, of size at most O(log n).

Natural Extensions

• More than 2 choices per element.– Very different : hypergraphs instead of graphs.– D. Fotakis, R. Pagh, P. Sanders, and P. Spirakis.– Space efficient hash tables with worst case

constant access time.

• More than 1 element per bucket. – M. Dietzfelbinger and C. Weidling. – Balanced allocation and dictionaries with tightly

packed constant size bins.

Variations

• Online : Elements inserted and deleted as you go. – Constant expected time + logarithmic (or

polylogarithmic) time with high probability per element.

• Offline : All elements available at start.– Becomes a maximum matching problem.– No real moving of elements -- equivalent to

offline version of multiple-choice hashing of Azar, Broder, Karlin, and Upfal.

Open Question 1:Random Walk Cuckoo Hashing

• More than 2 choices is important.– Much higher memory utilizations.– 3 choices : 90+% in experiments.– 4 choices : about 97%.

• Analysis [FPSS] : Use breadth first search on bipartite graph to find an augmenting path.– Not practical for many implementations.

Random Walk Cuckoo Hashing

• When it is time to kick something out, choose one randomly.

• Small state, effective. • Intuition : if fraction p of the buckets are

empty, random walk “should” have fraction p of finding empty bucket at each step.– Clearly wrong, but nice intuition.– Suggests logarithmic time to find an empty slot.

The Open Question

• Find tight bounds on the performance of random walk cuckoo hashing in the online setting, for d > 3 choices (and possibly more than one element per bucket).

Recent Progress

• Polylogarithmic bounds on insertion time for large number of choices: RANDOM 09, Frieze, Mitzenmacher, Melsted.

• Two step argument:– Most buckets have an augmenting path of length

O(log log n) to an empty bucket. (Reach empty bucket with inverse logarithmic probability.)

– Expansion gives such a bucket is found after O(log n) steps with high probability.

The Open Question

• Find tight bounds on the performance of random walk cuckoo hashing in the online setting, for d > 3 choices (and possibly more than one element per bucket).

• Is logarithmic insertion time the right answer? Lower bounds?

• Better understanding of graph structure with 3 or more choices.

Open Question 2:Thresholds

• How much load can cuckoo hashing handle before collisions overwhelm it?

• There appear to be asymptotic thresholds.– Fine below the threshold, disaster after.– Useful for designs for real systems.

• The case for 2 choices, 1 element per bucket well understood.

• Less so for other cases.

The Open Question

• Tight thresholds for cuckoo hashing schemes, and corresponding efficient algorithms.

What’s Known

• 2 choices, 1 element per bucket well understood.• For 2 choices, more than 1 element per bucket:

– Corresponds to orientability problems on random graphs : orient edges so no more than k pointing to each vertex.

– Offline thresholds known. – Online (provable) thresholds weak.

• For more than 2 choices:– Harder orientability problems.– Online (provable) thresholds weak.– Very close lower/upper bounds for offline setting.

New Result

• Dietzfelbinger, Goerdt, Mitzenmacher, Montanari, Pagh have tight bounds on offline thresholds, more than 2 choices, 1 item per bucket.

• Extension to more than 1 item per bucket still open.

• Writeup (hopefully) coming soon…

What Was Known (Example)

• Case of 3 choices.– Upper bound on load of 0.9183. [Batu Berenbrink Cooper]

• Uses differential-equation based analysis of orientability threshold.

– Lower bound of 0.8894 (offline). [Dietzfelbinger Pagh]• Random maximum matching problem.• Use random matrices with 3 ones per column to design dictionary

schemes. Bound corresponds to full-rank threshold of such matrices.

• Upper bound is tight, using better bound on full-rank threshold.

The Open Question

• Tight thresholds for cuckoo hashing schemes, and corresponding efficient algorithms.

• Offline bounds for more than 2 choices.• Offline bounds for more than 2 choices and

more than 1 item per bucket.• Online bounds generally.

– Specific case of d = 3 especially interesting.

Open Question 3:Stashes

• A failure is declared whenever one element can’t be placed.

• Is that really necessary?

• What if we could keep one element unplaced? Or eight? Or O(log n)?

• Goal : Reduce the failure probability.

• Second goal : Reduce moves per insert.

The Open Question

• What is the value of some extra space to stash problematic elements?

Motivation : CAMs

• CAM = content addressable memory– Fully associative lookup.– Usually expensive, so must be kept small.– Hardware solution, or a dedicated cache line in

software.

• Not usually considered in theoretical work, but very useful in practice.

• Can we bridge this gap?– What can CAMs do for us?

A CAM-Stash

• Use a CAM to stash away elements that would cause failure.– ESA 2008, Kirsch, Mitzenmacher, Wieder.

• Intuition: if failures were independent, probability that s elements cause failures goes to (1/ns). – Failures not independent, but nearly so.– A stash holding a constant number of elements greatly reduces

failure probability. – Implemented as hardware CAM or cache line.

• Lookup requires also looking at stash.– But generally empty.

Analysis Method

• Treat cells as vertices, elements as edges in bipartite graph.

• Count components that have excess edges to be placed in stash.

• Random graph analysis to bound excess edges.

6 vertices, 7 edges: 1 edge must go into stash.

A Simple Experiment

• 10,000 elements, table of size 24,000, 2 choices per element, 107 trials.

Stash Size Needed Trials

0 9989861

1 10040

2 97

3 2

4 0

Generalizations

• Can similarly generalize known results for cuckoo hashing with more than 2 choices, more than 1 element per bucket.

• Stash of size s reduces failure exponent linearly in s.

• Intuition: random graph analysis exposes “bottleneck” in cuckoo hashing. Stashes relieve the bottleneck.

CAM to Improve Insertion Time

• Lots of moves per insert in worst case.– Average is constant.– But maximum is (log n) with non-trivial

(inverse-poly) probability.

• May want bounded number of memory accesses per insert.

• Empirical study by Kirsch/Mitzenmacher.

A CAM-Queue

• Insertion is a sequence of suboperations. – Of form “Move x to position Hj(x).”

• Use the CAM as a queue for pending suboperations.• Perform suboperations from queue as available.

– Move attempt = 1 lookup/write.– A suboperation may cause another suboperation to go on the

queue.

• Lookup: check the hash table and the CAM-queue.• De-amortization

– Use queue to turn worst-case performance into average-case performance.

Queue Sizes

• Need CAM sized to overflow with negligible probability.– Maximum queue size much bigger than average.

– Experiments suggest queues of size in small 10s possible, with 4+ suboperations per insert, in practice.

• Recent work by Arbitman, Naor, Segev gives provable bounds for logarithmic-sized queue for 2-choice cuckoo hashing, up to 50% loads. – Analysis open for more than 2 choices.

The Open Question

• What is the value of some extra space to stash problematic elements?

• Can these uses of stashes be similarly useful for other data structures?

• Is there a general theory telling us the value of constant/logarithmic/linear sized stashes?

Open Question 4:Randomness

• Analysis always easier when assuming hash functions are perfectly random.

• But perfect hash functions are unrealistic.

• What about real hash functions on real data?

The Open Question

• How much randomness is needed for cuckoo hashing to be effective?

Universal Hash Families

• Defined by Carter/Wegman

• Family of hash functions of form H:[N] [M] is k-wise independent if when H is chosen randomly, for any distinct x1,x2,…xk, and any a1,a2,…ak,

• Family is k-wise universal if

€

Pr(H(x i) = ai) =1/ M k

i=1

k

∏

€

Pr(H(x1) = H(x2) = ...= H(xk )) ≤1/ M k−1

Recent Results

• For 2 choices, O(log n)-wise independence is sufficient; [PR] show hash functions of Siegel suffice.

• Queueing result of [ANS] uses new technology of Braverman to show polylogarithmic-wise independence suffices.

• Cohen and Kane show 5-independence not enough; also show 1 O(log n)-wise independent and 1 pairwise independent hash function suffice.

Another Approach : Random Data

• Previous analysis for worst-case data. What about random data?

• Analysis usually trivial if data is independently, uniformly chosen over large universe.– Then all hashes appear “perfectly random”.

• Not a good model for real data.• Need intermediate model between worst-

case, average case. [Mitzenmacher Vadhan]

A Model for Data

• Based on models of semi-random sources.• Data is a finite stream, modeled by a

sequence of random variables X1,X2,…XT.• Range of each variable is [N].• Each stream element has some entropy,

conditioned on values of previous elements.– Correlations possible.– But each new element has some unpredictability.

Applications

• Potentially, wherever hashing is used– Bloom Filters– Power of Two Choices– Linear Probing– Cuckoo Hashing– Many Others…

Intuition

• If each element has entropy, then extract the entropy to hash each element to near-uniform location.

• Extractors should provide near-uniform behavior.

Notions of Entropy

• max probability : – min-entropy :– block source with max probability p per block

• collision probability : – Renyi entropy :– block source with coll probability p per block

• These “entropies” within a factor of 2.• We use collision probability/Renyi entropy.

]Pr[max)(mp xXX x ==))(mp/1log()(H XX =∞

pxXxXX iii ≤== −− ),...,|(mp 1111

∑ == 2])(Pr[)(cp xXX x))(cp/1log()(H2 XX =

pxXxXX iii ≤== −− ),...,|(cp 1111

Leftover Hash Lemma

• A “classical” result (from 1989) [ILL].

• Intuitive statement: If is chosen from a pairwise independent hash function, and X is a random variable with small collision probability, H(X) will be close to uniform.

][][: MNH →

Leftover Hash Lemma• Specific statements for current setting.

– For 2-universal hash families.

• Let be a random hash function from a 2-universal hash family . If cp(X)< 1/K, then (H,H(X)) is -close to (H,U[M]).– Equivalently, if X has Renyi entropy at least log M + 2log(1/), then

(H,H(X)) is -close to uniform.

• Let be a random hash function from a 2-universal hash family. Given a block-source with coll prob 1/K per block, (H,H(X1),.. H(XT)) is xxxxxxxxxx-close to (H,U[M]T).– Equivalently, if X has Renyi entropy at least log M + 2log(T/), then

(H,H(X1),.. H(XT)) is -close to uniform.

][][: MNH →

KM /)2/1(

][][: MNH →

KMT /)2/(

Further Improvements

• Additional improvements over Leftover Hash Lemma in paper [MV08].

• Chung and Vadhan [CV08] further improve analysis.

• Dietzfelbinger and Shellbach show you have to be careful : pairwise independence not enough even for random data sets from small universe. [DS09]– Not enough entropy when data large compared to

universe.

The Open Question

• How much randomness is needed for cuckoo hashing to be effective?

• Tighten bound on independence needed in worst case, and provide efficient hash function families.

• What better results are possible with reasonable assumptions on the data?

Open Question 5:Parallel Architectures

• Multicores, Graphics Processor Units (GPUs), other parallel architectures possibly the next wave.

• Multiple-choice hashing and cuckoo hashing seem naturally parallelizable.

• Theory and practice?

The Open Question

• Design and analyze efficient schemes for constructing and maintaining hash tables in modern parallel architectures.

Related Work• Plenty on parallel hashing/load balancing schemes.

– PRAM emulation, related work in the 1990s.

• Technical improvements of last decade suggest more is possible.

• In Amenta et al., we designed new implementation for GPUs based on cuckoo hashing. – To appear in SIGGRAPH 09.– New theory, practical implementations possible?

The Open Question

• Design and analyze efficient schemes for constructing and maintaining hash tables in modern parallel architectures.

• How can cuckoo hashing be helpful?

• Practical implementations, with strong theoretical backing?

Open Questions• Tight bounds on insertion times for random walk cuckoo

hashing for d > 2 choices.• Tight bounds on load capacity thresholds for cuckoo hashing

for d > 2 choices (and more than one element per bucket). • Stashes : where to use them, and a general framework for

them?• Randomness: how much is really needed in the worst case?

On suitably random data? • Parallelizable instantiations of cuckoo hashing?• Real-world applications for cuckoo hashing.• Your question here…

Thanks

Much thanks to Martin Dietzfelbinger

andRasmus Pagh

for comments, suggestions, references.

Thanks to myco-authors

for the results.

THANK YOU.

Some Open Questions Related to Cuckoo Hashing Michael Mitzenmacher Harvard University.

Documents

Transcript of Some Open Questions Related to Cuckoo Hashing Michael Mitzenmacher Harvard University.