LSH
-
Upload
hsiao-fei-liu -
Category
Technology
-
view
205 -
download
6
description
Transcript of LSH
![Page 1: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/1.jpg)
Locality Sensitive Hashingwith Application to Near Neighbor Reporting
Ken Liu2013.12.06
![Page 2: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/2.jpg)
Origination
• Nearest Neighbor Search– Given a set P of points in a metric space and allowable deviation > 0 – For any query q, returns a point p P s.t. d(p,q) < d(p*,q) + ∈ W.H.P, where p*
is the nearest neighbor of q in P
• Near Neighbor Search– Additional parameter radius r > 0 is given. – For any query q, returns a point p P s.t. d(p,q) < r + ∈ W.H.P if d(p*,q) r
• INDYK and Motwani, STOC’98– First introduced the concept of LSH– Nearest Neighbor Search can be reduced to Near Neighbor Search with little
overhead– Near Neighbor Search can be solved efficiently using LSH
![Page 3: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/3.jpg)
Near Neighbor Reporting by LSH
• Near Neighbor Reporting– Given a set P of points in a metric space, a radius r > 0, allowable deviation >
0, recall rate P1 and collision error rate P2
• Recall rate: # of reported near neighbors/ # of near neighbors• Collision error rate: # of reported distant neighbors/# of distant neighbors
– For any query q, • Each point p P wit d(p,q) ∈ r is reported with probability at least P1
• Each point p P with d(p,q) ∈ r + is reported with probability at most P2
• Each point p P with r+∈ > d(p,q) > r is free to be reported or not– Thus, for any sequence of queries, the expected average recall rate is at least
P1 and the expected collision error rate is most P2
![Page 4: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/4.jpg)
Applications• The same technique can be applied to Near Neighbor Search, Quasi-
Bicliques Mining, Near-Duplicate Document Detection, Clustering, …
![Page 5: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/5.jpg)
LSH
• An LSH H is a set of hash functions over a metric space • H is (r, r+, P1, P2)-sensitive if
– P1 < P2
– If d(p,q) r , then PrhH(h(p) = h(q)) P1
– If d(p,q) r+, then PrhH(h(p) = h(q)) P2
• P1: recall rate
• P2: collision error rate
![Page 6: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/6.jpg)
LSH for angular distance
• Distance function: d(u, v) = arccos(u,v) = (u,v)• Random Projection:
– Choose a random unit vector w– h(u) = sgn(uw)– Pr(h(u) = h(v)) = 1 - (u,v)/
u
v
![Page 7: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/7.jpg)
LSH for Jaccard distance
• Distance function: d(A, B) = 1 - | A B |/| A B |• MinHash:
– Pick a random permutation on the universe U – h(A) = argminaA (a)– Pr(h(A) = h(B)) = | A B |/| A B | = 1 – d(A, B)
• Note:– Finding the Jaccard median
• Very easy to understand, very hard to compute• Studied from 1981• Chierichetti, Kumar, Pandey, Vassilvitskii, SODA2010
– NP-Hard – PTAS– Time complexity is still very high
𝐴A B
![Page 8: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/8.jpg)
What if boss is not satisfied…
Jaccard distance
0.2
P1 = 0.8
0.4
P2 = 0.6
collision Prob.
Jaccard distance
0.2
P1 = 0.9
0.4
P2 = 0.1
collision Prob.
Jaccard distance
0.2
P1 = 1
collision Prob.
You have … Customers heard …
Gap amplification Sales promotion
Boss want …
![Page 9: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/9.jpg)
Gap Amplification• Construct a new LSH H’ from the old LSH H
– B = {g=(h1,h2, …, hk): hi H}• b(p) = b(q) iff ANDi=1,…,k hi(p) = hi(q)
– H’ ={h’ =(b1,b2,…,bL): bi B }• h’(p) = h’(q) iff ORi=1,…,L bi(p) = bi(q)
• Intuition:– AND increases the gap
• Collision probabilities of distant points decrease more than that of near points
– OR increases the collision probability
• Let P = PrhH(h(p)=h(q))– PrgB(b(p)=b(q)) = Pk
– PrgB(b(p)b(q)) = 1 – Pk
– Prh’H’(h’ (p)h’(q)) = (1 – Pk)L
– Prh’H’(h’(p)=h’(q)) = 1 - (1 – Pk)L
L buckets
k hashes
+
+
+
![Page 10: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/10.jpg)
S-curve
P1P2 P
1 – (1-Pk)L
P’1
P’2
k=3
k=4
• 1 – (1-Pk)L is an s-curve• Consider the s-curves passing through (P1, P’1)• The larger k => the steeper s-curve
![Page 11: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/11.jpg)
How to choose k and L
• Suppose that we have P1 = 0.8 and P2 = 0.6
• Target P’1 = 0.9 and P’2 = 0.1
• Let P’1 = 1 – (1-P1k)L
– L = = log 1–P’1 / log 1-P1k
• Keep Increasing k until 1 – (1-P2)L P’2, where L = (log 1–P’1 / log 1-P1
k)
• For example– k = 1 L = (log 1–P’1 / log 1-P1
1) 2 1 – (1-P2)2 0.8
– k = 5 L = (log 1–P’1 / log 1-P15) 6 1 – (1-P2
5)6 0.4
– k = 10 L = (log 1–P’1 / log 1-P110) 20 1 – (1-P2
10)20 0.1
![Page 12: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/12.jpg)
Increasing rate of bucket number
• Let
![Page 13: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/13.jpg)
Decreasing rate of collision error• ------------(1)
-------------(2)
• By (1) and (2), we have ()
![Page 14: LSH](https://reader031.fdocuments.net/reader031/viewer/2022020723/549e7431b4795988208b4761/html5/thumbnails/14.jpg)
Reference
1. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality
2. T-61_5060_locality sensitive_hashing.pdf3. http://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf