LSH

14
Locality Sensitive Hashing with Application to Near Neighbor Reporting Ken Liu 2013.12.06

description

Introduction to Locality Sensitive Hashing

Transcript of LSH

Page 1: LSH

Locality Sensitive Hashingwith Application to Near Neighbor Reporting

Ken Liu2013.12.06

Page 2: LSH

Origination

• Nearest Neighbor Search– Given a set P of points in a metric space and allowable deviation > 0 – For any query q, returns a point p P s.t. d(p,q) < d(p*,q) + ∈ W.H.P, where p*

is the nearest neighbor of q in P

• Near Neighbor Search– Additional parameter radius r > 0 is given. – For any query q, returns a point p P s.t. d(p,q) < r + ∈ W.H.P if d(p*,q) r

• INDYK and Motwani, STOC’98– First introduced the concept of LSH– Nearest Neighbor Search can be reduced to Near Neighbor Search with little

overhead– Near Neighbor Search can be solved efficiently using LSH

Page 3: LSH

Near Neighbor Reporting by LSH

• Near Neighbor Reporting– Given a set P of points in a metric space, a radius r > 0, allowable deviation >

0, recall rate P1 and collision error rate P2

• Recall rate: # of reported near neighbors/ # of near neighbors• Collision error rate: # of reported distant neighbors/# of distant neighbors

– For any query q, • Each point p P wit d(p,q) ∈ r is reported with probability at least P1

• Each point p P with d(p,q) ∈ r + is reported with probability at most P2

• Each point p P with r+∈ > d(p,q) > r is free to be reported or not– Thus, for any sequence of queries, the expected average recall rate is at least

P1 and the expected collision error rate is most P2

Page 4: LSH

Applications• The same technique can be applied to Near Neighbor Search, Quasi-

Bicliques Mining, Near-Duplicate Document Detection, Clustering, …

Page 5: LSH

LSH

• An LSH H is a set of hash functions over a metric space • H is (r, r+, P1, P2)-sensitive if

– P1 < P2

– If d(p,q) r , then PrhH(h(p) = h(q)) P1

– If d(p,q) r+, then PrhH(h(p) = h(q)) P2

• P1: recall rate

• P2: collision error rate

Page 6: LSH

LSH for angular distance

• Distance function: d(u, v) = arccos(u,v) = (u,v)• Random Projection:

– Choose a random unit vector w– h(u) = sgn(uw)– Pr(h(u) = h(v)) = 1 - (u,v)/

u

v

Page 7: LSH

LSH for Jaccard distance

• Distance function: d(A, B) = 1 - | A B |/| A B |• MinHash:

– Pick a random permutation on the universe U – h(A) = argminaA (a)– Pr(h(A) = h(B)) = | A B |/| A B | = 1 – d(A, B)

• Note:– Finding the Jaccard median

• Very easy to understand, very hard to compute• Studied from 1981• Chierichetti, Kumar, Pandey, Vassilvitskii, SODA2010

– NP-Hard – PTAS– Time complexity is still very high

𝐴A B

Page 8: LSH

What if boss is not satisfied…

Jaccard distance

0.2

P1 = 0.8

0.4

P2 = 0.6

collision Prob.

Jaccard distance

0.2

P1 = 0.9

0.4

P2 = 0.1

collision Prob.

Jaccard distance

0.2

P1 = 1

collision Prob.

You have … Customers heard …

Gap amplification Sales promotion

Boss want …

Page 9: LSH

Gap Amplification• Construct a new LSH H’ from the old LSH H

– B = {g=(h1,h2, …, hk): hi H}• b(p) = b(q) iff ANDi=1,…,k hi(p) = hi(q)

– H’ ={h’ =(b1,b2,…,bL): bi B }• h’(p) = h’(q) iff ORi=1,…,L bi(p) = bi(q)

• Intuition:– AND increases the gap

• Collision probabilities of distant points decrease more than that of near points

– OR increases the collision probability

• Let P = PrhH(h(p)=h(q))– PrgB(b(p)=b(q)) = Pk

– PrgB(b(p)b(q)) = 1 – Pk

– Prh’H’(h’ (p)h’(q)) = (1 – Pk)L

– Prh’H’(h’(p)=h’(q)) = 1 - (1 – Pk)L

L buckets

k hashes

+

+

+

Page 10: LSH

S-curve

P1P2 P

1 – (1-Pk)L

P’1

P’2

k=3

k=4

• 1 – (1-Pk)L is an s-curve• Consider the s-curves passing through (P1, P’1)• The larger k => the steeper s-curve

Page 11: LSH

How to choose k and L

• Suppose that we have P1 = 0.8 and P2 = 0.6

• Target P’1 = 0.9 and P’2 = 0.1

• Let P’1 = 1 – (1-P1k)L

– L = = log 1–P’1 / log 1-P1k

• Keep Increasing k until 1 – (1-P2)L P’2, where L = (log 1–P’1 / log 1-P1

k)

• For example– k = 1 L = (log 1–P’1 / log 1-P1

1) 2 1 – (1-P2)2 0.8

– k = 5 L = (log 1–P’1 / log 1-P15) 6 1 – (1-P2

5)6 0.4

– k = 10 L = (log 1–P’1 / log 1-P110) 20 1 – (1-P2

10)20 0.1

Page 12: LSH

Increasing rate of bucket number

• Let

Page 13: LSH

Decreasing rate of collision error• ------------(1)

-------------(2)

• By (1) and (2), we have ()