LSH

Locality Sensitive Hashingwith Application to Near Neighbor Reporting

Ken Liu2013.12.06

Origination

• Nearest Neighbor Search– Given a set P of points in a metric space and allowable deviation > 0 – For any query q, returns a point p P s.t. d(p,q) < d(p*,q) + ∈ W.H.P, where p*

is the nearest neighbor of q in P

• Near Neighbor Search– Additional parameter radius r > 0 is given. – For any query q, returns a point p P s.t. d(p,q) < r + ∈ W.H.P if d(p*,q) r

• INDYK and Motwani, STOC’98– First introduced the concept of LSH– Nearest Neighbor Search can be reduced to Near Neighbor Search with little

overhead– Near Neighbor Search can be solved efficiently using LSH

Near Neighbor Reporting by LSH

• Near Neighbor Reporting– Given a set P of points in a metric space, a radius r > 0, allowable deviation >

0, recall rate P1 and collision error rate P2

• Recall rate: # of reported near neighbors/ # of near neighbors• Collision error rate: # of reported distant neighbors/# of distant neighbors

– For any query q, • Each point p P wit d(p,q) ∈ r is reported with probability at least P1

• Each point p P with d(p,q) ∈ r + is reported with probability at most P2

• Each point p P with r+∈ > d(p,q) > r is free to be reported or not– Thus, for any sequence of queries, the expected average recall rate is at least

P1 and the expected collision error rate is most P2

Applications• The same technique can be applied to Near Neighbor Search, Quasi-

Bicliques Mining, Near-Duplicate Document Detection, Clustering, …

LSH

• An LSH H is a set of hash functions over a metric space • H is (r, r+, P1, P2)-sensitive if

– P1 < P2

– If d(p,q) r , then PrhH(h(p) = h(q)) P1

– If d(p,q) r+, then PrhH(h(p) = h(q)) P2

• P1: recall rate

• P2: collision error rate

LSH for angular distance

• Distance function: d(u, v) = arccos(u,v) = (u,v)• Random Projection:

– Choose a random unit vector w– h(u) = sgn(uw)– Pr(h(u) = h(v)) = 1 - (u,v)/

u

v

LSH for Jaccard distance

• Distance function: d(A, B) = 1 - | A B |/| A B |• MinHash:

– Pick a random permutation on the universe U – h(A) = argminaA (a)– Pr(h(A) = h(B)) = | A B |/| A B | = 1 – d(A, B)

• Note:– Finding the Jaccard median

• Very easy to understand, very hard to compute• Studied from 1981• Chierichetti, Kumar, Pandey, Vassilvitskii, SODA2010

– NP-Hard – PTAS– Time complexity is still very high

𝐴A B

What if boss is not satisfied…

Jaccard distance

0.2

P1 = 0.8

0.4

P2 = 0.6

collision Prob.

Jaccard distance

0.2

P1 = 0.9

0.4

P2 = 0.1

collision Prob.

Jaccard distance

0.2

P1 = 1

collision Prob.

You have … Customers heard …

Gap amplification Sales promotion

Boss want …

Gap Amplification• Construct a new LSH H’ from the old LSH H

– B = {g=(h1,h2, …, hk): hi H}• b(p) = b(q) iff ANDi=1,…,k hi(p) = hi(q)

– H’ ={h’ =(b1,b2,…,bL): bi B }• h’(p) = h’(q) iff ORi=1,…,L bi(p) = bi(q)

• Intuition:– AND increases the gap

• Collision probabilities of distant points decrease more than that of near points

– OR increases the collision probability

• Let P = PrhH(h(p)=h(q))– PrgB(b(p)=b(q)) = Pk

– PrgB(b(p)b(q)) = 1 – Pk

– Prh’H’(h’ (p)h’(q)) = (1 – Pk)L

– Prh’H’(h’(p)=h’(q)) = 1 - (1 – Pk)L

L buckets

k hashes

+

+

+

S-curve

P1P2 P

1 – (1-Pk)L

P’1

P’2

k=3

k=4

• 1 – (1-Pk)L is an s-curve• Consider the s-curves passing through (P1, P’1)• The larger k => the steeper s-curve

How to choose k and L

• Suppose that we have P1 = 0.8 and P2 = 0.6

• Target P’1 = 0.9 and P’2 = 0.1

• Let P’1 = 1 – (1-P1k)L

– L = = log 1–P’1 / log 1-P1k

• Keep Increasing k until 1 – (1-P2)L P’2, where L = (log 1–P’1 / log 1-P1

k)

• For example– k = 1 L = (log 1–P’1 / log 1-P1

1) 2 1 – (1-P2)2 0.8

– k = 5 L = (log 1–P’1 / log 1-P15) 6 1 – (1-P2

5)6 0.4

– k = 10 L = (log 1–P’1 / log 1-P110) 20 1 – (1-P2

10)20 0.1

Increasing rate of bucket number

• Let

Decreasing rate of collision error• ------------(1)

-------------(2)

• By (1) and (2), we have ()

Reference

1. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality

2. T-61_5060_locality sensitive_hashing.pdf3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf

http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/IndykM-curse.pdf

http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/IndykM-curse.pdf

https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=10&ved=0CJwBEBYwCQ&url=https://noppa.aalto.fi/noppa/kurssi/t-61.5060/luennot/T-61_5060_locality-sensitive_hashing.pdf&ei=Bx6cUo3yDsLSkAWsp4HgCw&usg=AFQjCNEd6rc-EQEjiocR9L1kNrC9lGCGFQ&sig2=8tgPjliiEXv0ELGoGyEGhA

https://www.google.com.tw/url?sa=t&rct=j&q=&esrc=s&source=web&cd=10&ved=0CJwBEBYwCQ&url=https://noppa.aalto.fi/noppa/kurssi/t-61.5060/luennot/T-61_5060_locality-sensitive_hashing.pdf&ei=Bx6cUo3yDsLSkAWsp4HgCw&usg=AFQjCNEd6rc-EQEjiocR9L1kNrC9lGCGFQ&sig2=8tgPjliiEXv0ELGoGyEGhA

http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf

LSH

Technology

Transcript of LSH