Summer School on Hashing’14 Locality Sensitive Hashing
description
Transcript of Summer School on Hashing’14 Locality Sensitive Hashing
Summer School on Hashing’14
Locality Sensitive Hashing
Alex Andoni(Microsoft Research)
Nearest Neighbor Search (NNS)
• Preprocess: a set of points
• Query: given a query point , report a point with the smallest distance to
𝑞𝑝
Approximate NNS• r-near neighbor: given a new
point , report a point s.t.
• Randomized: a point returned with 90% probability
c-approximate
𝑐𝑟if there exists apoint at distance
q
r p
cr
Heuristic for Exact NNS• r-near neighbor: given a new
point , report a set with• all points s.t. (each with 90%
probability)
• may contain some approximate neighbors s.t.
• Can filter out bad answers
q
r p
cr
c-approximate
Locality-Sensitive Hashing• Random hash function on
s.t. for any points :• Close when is “high” • Far when is “small”
• Use several hashtables
q
p
𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)
Pr [𝑔 (𝑝 )=𝑔 (𝑞)]
𝑟 𝑐𝑟
1
𝑃 1𝑃 2
: , where
[Indyk-Motwani’98]
q
“not-so-small”𝑃 1=¿𝑃 2=¿
𝜌=log 1/𝑃1
log 1/𝑃2
Locality sensitive hash functions
6
• Hash function is usually a concatenation of “primitive” functions:
• Example: Hamming space • , i.e., choose bit for a random • chooses bits at random•
Full algorithm
7
• Data structure is just hash tables:• Each hash table uses a fresh random function • Hash all dataset points into the table
• Query:• Check for collisions in each of the hash tables• until we encounter a point within distance
• Guarantees:• Space: , plus space to store points• Query time: (in expectation)• 50% probability of success.
Analysis of LSH Scheme
8
• Choice of parameters ?• hash tables with
• Pr[collision of far pair] = • Pr[collision of close pair] = • Hence “repetitions” (tables) suffice!
distance
colli
sion
prob
abili
ty
𝑟 𝑐𝑟
𝑃1
𝑃2
𝑃12
𝑃22
set s.t.=
=
𝑘=1𝑘=2
Analysis: Correctness
9
• Let be an -near neighbor• If does not exists, algorithm can output
anything• Algorithm fails when:
• near neighbor is not in the searched buckets • Probability of failure:
• Probability do not collide in a hash table: • Probability they do not collide in hash tables at
most
Analysis: Runtime
10
• Runtime dominated by:• Hash function evaluation: time• Distance computations to points in buckets
• Distance computations:• Care only about far points, at distance • In one hash table, we have
• Probability a far point collides is at most • Expected number of far points in a bucket:
• Over hash tables, expected number of far points is
• Total: in expectation
NNS for Euclidean space
11
• Hash function is a concatenation of “primitive” functions:
• LSH function :• pick a random line , and quantize• project point into
• is a random Gaussian vector• random in • is a parameter (e.g., 4)
[Datar-Immorlica-Indyk-Mirrokni’04]
𝑝ℓ
• Regular grid → grid of balls• p can hit empty space, so take
more such grids until p is in a ball
• Need (too) many grids of balls
• Start by projecting in dimension t
• Analysis gives• Choice of reduced dimension
t?• Tradeoff between
• # hash tables, n, and• Time to hash, tO(t)
• Total query time: dn1/c2+o(1)
Optimal* LSH
2D
p
pRt
[A-Indyk’06]
x
Proof idea• Claim: , i.e.,
• P(r)=probability of collision when ||p-q||=r
• Intuitive proof:• Projection approx preserves distances
[JL] • P(r) = intersection / union• P(r)≈random point u beyond the
dashed line• Fact (high dimensions): the x-
coordinate of u has a nearly Gaussian distribution
→ P(r) exp(-A·r2)
pqr
q
P(r)
u
p
LSH Zoo
14
• Hamming distance• : pick a random coordinate(s)
[IM98]• Manhattan distance:
• : random grid [AI’06]• Jaccard distance between sets:
• : pick a random permutation on the universe
min-wise hashing [Bro’97,BGMZ’97]• Angle: sim-hash [Cha’02,…]
To be or not to be
To Simons or not to Simons
…21102…
be toornot
Sim
ons
…01122…
be toornot
Sim
ons
…11101… …01111…
{be,not,or,to} {not,or,to, Simons}
1 1
not
not
=be,to,Simons,or,not
be to
LSH in the wild
15
• If want exact NNS, what is ?• Can choose any parameters • Correct as long as • Performance:
• trade-off between # tables and false positives• will depend on dataset “quality”• Can tune to optimize for given dataset
𝐿
𝑘
safety not guaranteed
fewer false positives
fewer tables
Time-Space Trade-offs
[AI’06]
[KOR’98, IM’98, Pan’06]
[Ind’01, Pan’06]
Space Time Comment Reference
[DIIM’04, AI’06]
[IM’98]
querytime
space
medium medium
lowhigh
highlow
ω(1) memory lookups [AIP’06]
ω(1) memory lookups [PTW’08, PTW’10]
[MNP’06, OWZ’11]
1 mem lookup
LSH is tight…
leave the rest to cell-probe lower bounds?
Data-dependent Hashing!• NNS in Hamming space () with query
time, space and preprocessing for
• optimal LSH: of [IM’98]• NNS in Euclidean space () with:
• optimal LSH: of [AI’06]
18
+𝑂( 1𝑐3 /2 )+𝑜(1)
+𝑂( 1𝑐3 )+𝑜(1)
[A-Indyk-Nguyen-Razenshteyn’14]
A look at LSH lower bounds• LSH lower bounds in Hamming space
• Fourier analytic• [Motwani-Naor-Panigrahy’06]
• distribution over hash functions • Far pair: random, distance • Close pair: random at distance = • Get
19
[O’Donnell-Wu-Zhou’11]
𝜖 𝑑𝜖 𝑑𝑐
𝜌 ≥1/𝑐
Why not NNS lower bound?• Suppose we try to generalize [OWZ’11]
to NNS• Pick random • All the “far point” are concentrated in a small
ball of radius • Easy to see at preprocessing: actual near
neighbor close to the center of the minimum enclosing ball
20
Intuition• Data dependent LSH:
• Hashing that depends on the entire given dataset!
• Two components:• “Nice” geometric configuration with • Reduction from general to this “nice”
geometric configuration
21
Nice Configuration: “sparsity”• All points are on a sphere of radius
• Random points are at distance
• Lemma 1: • “Proof”:
• Obtained via “cap carving”• Similar to “ball carving” [KMS’98, AI’06]
• Lemma 1’: for radius = 22
Reduction: into spherical LSH• Idea: apply a few rounds of “regular” LSH
• Ball carving [AI’06]• as if target approx. is 5c => query time
• Intuitively:• far points unlikely to collide• partitions the data into buckets of small
diameter • find the minimum enclosing ball• finally apply spherical LSH on this ball!
23
Two-level algorithm• hash tables, each with:
• hash function • ’s are “ball carving LSH” (data independent)• ’s are “spherical LSH” (data dependent)
• Final is an “average” of from levels 1 and 2h1 (𝑝 ) ,…,h𝑙(𝑝)
𝑠1 (𝑝 ) ,…, 𝑠𝑚(𝑝)
Details• Inside a bucket, need to ensure “sparse”
case• 1) drop all “far pairs”• 2) find minimum enclosing ball (MEB)• 3) partition by “sparsity” (distance from
center)
25
1) Far points• In level 1 bucket:
• Set parameters as if looking for approximation
• Probability a pair survives:• At distance : • At distance : • At distance :
• Expected number of collisions for distance is constant
• Throw out all pairs that violate this
26
2) Minimum Enclosing Ball• Use Jung theorem:
• A pointset with diameter has a MEB of radius
27
3) Partition by “sparsity”• Inside a bucket, points are at distance at most
• “Sparse” LSH does not work in this case• Need to partition by the distance to center• Partition into spherical shells of width 1
28
Practice of NNS
29
• Data-dependent partitions…• Practice:
• Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees…
• often no guarantees
• Theory?• assuming more about data: PCA-like
algorithms “work” [Abdullah-A-Kannan-Krauthgamer’14]
ℓ
Finale• LSH: geometric hashing• Data-dependent hashing can do better!
• Better upper bound? • Multi-level improves a bit, but not too much• for ?
• Best partitions in theory and practice?• LSH for other metrics:
• Edit distance, Earth-Mover Distance, etc• Lower bounds (cell probe?)
30
Open question:• Practical variant of ball-carving hashing?• Design space-partition of that is
• efficient: point location in poly(t) time• qualitative: regions are “sphere-like”
[Prob. needle of length 1 is not cut]
[Prob needle of length c is not cut]≥
1/c2
𝑝