A Gentle Introduction to Locality Sensitive Hashing with Apache Spark
-
Upload
francois-garillot -
Category
Software
-
view
14.542 -
download
3
Transcript of A Gentle Introduction to Locality Sensitive Hashing with Apache Spark
![Page 1: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/1.jpg)
A GENTLE INTRODUCTION TO APACHE SPARK AND LOCALITY-SENSITIVE
HASHING1
![Page 3: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/3.jpg)
LOCALITY-SENSITIVE HASHING
▸ A story : Why LSH▸ How it works & hash families
▸ LSH distribution▸ Beware : WIP
3
![Page 4: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/4.jpg)
SPARK TENETS
▸ broadcast variables▸ per-partition commands▸ shuffle sparsely
4
![Page 5: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/5.jpg)
5
![Page 6: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/6.jpg)
6
![Page 7: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/7.jpg)
7
![Page 8: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/8.jpg)
SEGMENTATION
▸ small sample: 289421 users▸ larger sample : 5684403 users
46K websites, ultimately users4 personal laptops, 4 provided laptops
8
![Page 9: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/9.jpg)
K-MEANS COMPLEXITY
Find with the 'elbow method' on within-cluster sum of squares. Then
9
![Page 10: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/10.jpg)
EM - GAUSSIAN MIXTURE
With dimensions, mixtures,
10
![Page 11: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/11.jpg)
LOCALITY-SENSITIVE HASHING FUNCTIONSA family H of hashing functions is -sensitive if:
▸ if then ▸ if then
11
![Page 12: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/12.jpg)
DISTANCES ! (THOSE AND MANY OTHER)
▸ Hamming distance : where is arandomly chosen index
▸ Jaccard :
▸ Cosine distance:
12
![Page 13: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/13.jpg)
EARTH MOVER'S DISTANCE
13
![Page 14: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/14.jpg)
EARTH MOVER'S DISTANCE
Find optimal F minimizing:
Then:
14
![Page 15: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/15.jpg)
A WORD ON MODULARITY
LSH for EMD introduced by Charikar in the Simhash paper (2002).
Yet no place to plug your LSH family in implementation (e.g. scikit, mrsqueeze) !
15
![Page 16: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/16.jpg)
LSH AMPLIFICATION : CONCATENATIONS AND PARALLEL
▸ basic LSH:
▸ AND (series) construction: ▸ OR (parallel) construction :
16
![Page 17: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/17.jpg)
17
![Page 18: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/18.jpg)
BASIC LSH val hashCollection = records.map(s => (getId(s), s)). mapValues(s => getHash(s, hashers)) val subArray = hashCollection.flatMap { case (recordId, hash) => hash.grouped(hashLength / numberBands).zipWithIndex.map{ case (band, bandIndex) => (bandIndex, (band, sentenceId)) } }
18
![Page 19: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/19.jpg)
LOOKUPdef findCandidates(record: Iterable[String], hashers: Array[Int => Int], mBands: BandType) = { val hash = getHash(record, hashers) val subArrays = partitionArray(hash).zipWithIndex
subArrays.flatMap { case (band, bandIndex) => val hashedBucket = mBands.lookup(bandIndex). headOption. flatMap{_.get(band)} hashedBucket }.flatten.toSet}
19
![Page 20: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/20.jpg)
getHash(record,hashers)
DISTRIBUTE RANDOM SEEDS, NOT PERMUTATION FUNCTIONSrecords.mapPartitions { iter => val rng = new Scala.util.random() iter.map(x => hashers.flatMap{h => getHashFunction(rng, h)(x)})}
20
![Page 21: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/21.jpg)
AND YET, OOM
21
![Page 22: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/22.jpg)
BASIC LSHWITH A 2-STABLE GAUSSIAN DISTRIBUTION
With data points, choose and , to solve the problem
22
![Page 23: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/23.jpg)
WEB LOGS ARE SPARSE
Input : hits per user, over 6 months, 2x50-ish integers/user (4GB)
Output of length 1000 integers per user : 10 (parallel) bands, 100 (concatenated) hashes
64-bit integers : 40 GB
Yet !23
![Page 24: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/24.jpg)
ENTROPY LSH (PANIGRAPHI 2006)REPLACE TABLES BY OFFSETS
, , chosen randomly from the surfaceof , the sphere of radius centered at
24
![Page 25: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/25.jpg)
ENTROPY LSHWITH A 2-STABLE GAUSSIAN DISTRIBUTION
With data points, choose and
, to solve the problem with asfew as hash tables
25
![Page 26: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/26.jpg)
BUT ... NETWORK COSTS
▸ Basic LSH : look up buckets,
▸ Entropy LSH : search for offsets
26
![Page 27: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/27.jpg)
LAYERED LSH (BAHMANI ET AL. 2012)
Output of your LSH family is in , with e.g. a cosine norm.
For closer points, the chance of hashes hashing to the same bucket is high!
27
![Page 28: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/28.jpg)
LAYERED LSH
Have an LSH family for your norm on
Likely that for all offsets
28
![Page 29: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/29.jpg)
LAYERED LSH
Output of hash generation is (GH(p), (H(p), p)) for all p.
In Spark, group, or custom partitioner for (H(p), p) RDD.
Network cost :
29
![Page 30: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/30.jpg)
PERFORMANCE
30
![Page 31: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/31.jpg)
FUTURE WORKHAVE A (BIG) WEBLOG ?
▸ Weve▸ Yandex
31
![Page 32: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/32.jpg)
FUTURE WORKLOCALITY-SENSITIVE HASHING FORESTS !
32
![Page 33: A Gentle Introduction to Locality Sensitive Hashing with Apache Spark](https://reader033.fdocuments.net/reader033/viewer/2022050613/5886f1051a28abba528b6b93/html5/thumbnails/33.jpg)
RELEASEgithub.com/huitseeker/spark-lsh
1 SEPT 2015
33