New CS60021: Scalable Data Mining Similarity Search and...

60
CS60021: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya

Transcript of New CS60021: Scalable Data Mining Similarity Search and...

Page 1: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

CS60021:ScalableDataMining

SimilaritySearchandHashing

SourangshuBha>acharya

Page 2: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

FindingSimilarItems

Page 3: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

DistanceMeasures¡  Goal:Findnear-neighborsinhigh-dim.space

–  Weformallydefine“nearneighbors”aspointsthatarea“smalldistance”apart

•  ForeachapplicaGon,wefirstneedtodefinewhat“distance”means

•  Today:Jaccarddistance/similarity–  TheJaccardsimilarityoftwosetsisthesizeoftheirintersecGon

dividedbythesizeoftheirunion:sim(C1,C2)=|C1∩C2|/|C1∪C2|

–  Jaccarddistance:d(C1,C2)=1-|C1∩C2|/|C1∪C2|

3J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

3 in intersection 8 in union Jaccard similarity= 3/8 Jaccard distance = 5/8

Page 4: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

Task:FindingSimilarDocuments

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org4

Page 5: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

3EssenGalStepsforSimilarDocs

1.   Shingling:Convertdocumentstosets

2.   Min-Hashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity

3.   Locality-Sensi:veHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments

–  Candidatepairs!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org5

Page 6: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

TheBigPicture

6

Shingling Docu- ment

The set of strings of length k that appear in the doc- ument

Min Hashing

Signatures: short integer vectors that represent the sets, and reflect their similarity

Locality- Sensitive Hashing

Candidate pairs: those pairs of signatures that we need to test for similarity

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 7: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

Shingling

Step1:Shingling:Convertdocumentstosets

ShinglingDocu-ment

Thesetofstringsoflengthkthatappearinthedoc-ument

Page 8: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

DocumentsasHigh-DimData

•  Step1:Shingling:Convertdocumentstosets

•  Simpleapproaches:– Document=setofwordsappearingindocument– Document=setof“important”words– Don’tworkwellforthisapplicaGon.Why?

•  Needtoaccountfororderingofwords!•  Adifferentway:Shingles!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org8

Page 9: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

Define:Shingles•  Ak-shingle(ork-gram)foradocumentisasequenceofktokensthatappearsinthedoc–  Tokenscanbecharacters,wordsorsomethingelse,dependingontheapplicaGon

–  Assumetokens=charactersforexamples

•  Example:k=2;documentD1=abcabSetof2-shingles:S(D1)={ab,bc,ca}–  OpOon:Shinglesasabag(mulGset),countabtwice:S’(D1)={ab,bc,ca, ab}

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org9

Page 10: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

CompressingShingles•  Tocompresslongshingles,wecanhashthemto(say)4bytes•  Representadocumentbythesetofhashvaluesofitsk-

shingles–  Idea:Twodocumentscould(rarely)appeartohaveshinglesin

common,wheninfactonlythehash-valueswereshared

•  Example:k=2;documentD1=abcabSetof2-shingles:S(D1)={ab,bc,ca}Hashthesingles:h(D1)={1,5,7}

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org10

Page 11: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

SimilarityMetricforShingles•  DocumentD1isasetofitsk-shinglesC1=S(D1)•  Equivalently,eachdocumentisa

0/1vectorinthespaceofk-shingles–  Eachuniqueshingleisadimension–  Vectorsareverysparse

•  AnaturalsimilaritymeasureistheJaccardsimilarity: sim(D1,D2)=|C1∩C2|/|C1∪C2|

11

Page 12: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

WorkingAssumpGon

•  Documentsthathavelotsofshinglesincommonhavesimilartext,evenifthetextappearsindifferentorder

•  Caveat:Youmustpickklargeenough,ormostdocumentswillhavemostshingles–  k=5isOKforshortdocuments–  k=10isbe>erforlongdocuments

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org12

Page 13: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

MoGvaGonforMinhash/LSH

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org13

Page 14: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

MinHashing

Step2:Minhashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity

ShinglingDocu-ment

Thesetofstringsoflengthkthatappearinthedoc-ument

Min-Hash-ing

Signatures:shortintegervectorsthatrepresentthesets,andreflecttheirsimilarity

Page 15: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

EncodingSetsasBitVectors•  Manysimilarityproblemscanbe

formalizedasfindingsubsetsthathavesignificantintersecOon

•  Encodesetsusing0/1(bit,boolean)vectors–  Onedimensionperelementintheuniversalset

•  InterpretsetintersecGonasbitwiseAND,andsetunionasbitwiseOR

•  Example:C1=10111;C2=10011–  SizeofintersecGon=3;sizeofunion=4,–  Jaccardsimilarity(notdistance)=3/4–  Distance:d(C1,C2)=1–(Jaccardsimilarity)=1/4

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org15

Page 16: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

FromSetstoBooleanMatrices•  Rows=elements(shingles)•  Columns=sets(documents)

–  1inroweandcolumnsifandonlyifeisamemberofs

–  ColumnsimilarityistheJaccardsimilarityofthecorrespondingsets(rowswithvalue1)

–  Typicalmatrixissparse!

•  Eachdocumentisacolumn:–  Example:sim(C1,C2)=?

•  SizeofintersecGon=3;sizeofunion=6,Jaccardsimilarity(notdistance)=3/6

•  d(C1,C2)=1–(Jaccardsimilarity)=3/6

16J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

01010111

1001

1000

10101011

0111Documents

Shi

ngle

s

Page 17: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

Outline:FindingSimilarColumns•  Sofar:

– Documents→Setsofshingles– Representsetsasbooleanvectorsinamatrix

•  Nextgoal:FindsimilarcolumnswhilecompuOngsmallsignatures– Similarityofcolumns==similarityofsignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org17

Page 18: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

Outline:FindingSimilarColumns•  NextGoal:Findsimilarcolumns,Smallsignatures•  Naïveapproach:

–  1)Signaturesofcolumns:smallsummariesofcolumns–  2)Examinepairsofsignaturestofindsimilarcolumns

•  EssenOal:SimilariGesofsignaturesandcolumnsarerelated

–  3)OpOonal:Checkthatcolumnswithsimilarsignaturesarereallysimilar

•  Warnings:–  ComparingallpairsmaytaketoomuchGme:JobforLSH

•  ThesemethodscanproducefalsenegaGves,andevenfalseposiGves(iftheopGonalcheckisnotmade)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org18

Page 19: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

HashingColumns(Signatures)•  Keyidea:“hash”eachcolumnCtoasmallsignatureh(C),such

that:–  (1)h(C)issmallenoughthatthesignaturefitsinRAM–  (2)sim(C1,C2)isthesameasthe“similarity”ofsignaturesh(C1)andh(C2)

• 

•  Goal:FindahashfuncOonh(·)suchthat:–  Ifsim(C1,C2)ishigh,thenwithhighprob.h(C1)=h(C2)–  Ifsim(C1,C2)islow,thenwithhighprob.h(C1)≠h(C2)

• 

•  Hashdocsintobuckets.Expectthat“most”pairsofnearduplicatedocshashintothesamebucket!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org19

Page 20: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

Min-Hashing•  Goal:FindahashfuncOonh(·)suchthat:

–  ifsim(C1,C2)ishigh,thenwithhighprob.h(C1)=h(C2)–  ifsim(C1,C2)islow,thenwithhighprob.h(C1)≠h(C2)

•  Clearly,thehashfuncOondependsonthesimilaritymetric:–  NotallsimilaritymetricshaveasuitablehashfuncGon

•  ThereisasuitablehashfuncOonfortheJaccardsimilarity:ItiscalledMin-Hashing

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org20

Page 21: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

21

Min-Hashing•  Imaginetherowsofthebooleanmatrixpermutedunder

randompermutaOonπ

•  Definea“hash”funcOonhπ(C)=theindexofthefirst(inthepermutedorderπ)rowinwhichcolumnChasvalue1: hπ (C) = minπ π(C)

•  Useseveral(e.g.,100)independenthashfuncGons(thatis,permutaGons)tocreateasignatureofacolumn

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 22: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

22

Min-HashingExample

3

4

7

2

6

1

5

SignaturematrixM

1212

5

7

6

3

1

2

4

1412

4

5

1

6

7

3

2

2121

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

2nd element of the permutation is the first to map to a 1

4th element of the permutation is the first to map to a 1

0101

0101

1010

1010

10101001

0101

Inputmatrix(ShinglesxDocuments)PermutaOonπ

Page 23: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

TheMin-HashProperty•  ChoosearandompermutaOonπ•  Claim:Pr[hπ(C1)=hπ(C2)]=sim(C1,C2)•  Why?

–  LetXbeadoc(setofshingles),y∈Xisashingle–  Then:Pr[π(y)=min(π(X))]=1/|X|

•  Itisequallylikelythatanyy∈Xismappedtotheminelement

–  Letybes.t.π(y)=min(π(C1∪C2))–  Theneither: π(y)=min(π(C1))ify∈C1,or π(y)=min(π(C2))ify∈C2

–  Sotheprob.thatbotharetrueistheprob.y∈C1∩C2–  Pr[min(π(C1))=min(π(C2))]=|C1∩C2|/|C1∪C2|=sim(C1,C2)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org23

01

10

00

1100

00

One of the two cols had to have 1 at position y

Page 24: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

FourTypesofRows•  GivencolsC1andC2,rowsmaybeclassifiedas:

C1 C2 A 1 1 B 1 0 C 0 1 D 0 0

–  a=#rowsoftypeA,etc.•  Note:sim(C1,C2)=a/(a+b+c)•  Then:Pr[h(C1)=h(C2)]=Sim(C1,C2)

–  LookdownthecolsC1andC2unGlweseea1–  Ifit’satype-Arow,thenh(C1)=h(C2)

Ifatype-Bortype-Crow,thennot

24J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 25: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

25

SimilarityforSignatures

•  Weknow:Pr[hπ(C1)=hπ(C2)]=sim(C1,C2)•  NowgeneralizetomulGplehashfuncGons

•  ThesimilarityoftwosignaturesisthefracOonofthehashfuncOonsinwhichtheyagree

•  Note:BecauseoftheMin-Hashproperty,thesimilarityofcolumnsisthesameastheexpectedsimilarityoftheirsignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 26: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

26

Min-HashingExample

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

SimilariOes:1-32-41-23-4Col/Col0.750.7500Sig/Sig0.671.0000

SignaturematrixM

1212

5

7

6

3

1

2

4

1412

4

5

1

6

7

3

2

2121

0101

0101

1010

1010

10101001

0101

Inputmatrix(ShinglesxDocuments)

3

4

7

2

6

1

5

PermutaOonπ

Page 27: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

Min-HashSignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org27

Page 28: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

ImplementaGonTrick•  PermuOngrowsevenonceisprohibiOve•  Rowhashing!

–  PickK=100hashfuncGonski–  OrderingunderkigivesarandomrowpermutaGon!

•  One-passimplementaOon–  ForeachcolumnCandhash-func.kikeepa“slot”forthemin-hashvalue

–  IniGalizeallsig(C)[i]=∞–  Scanrowslookingfor1s

•  Supposerowjhas1incolumnC•  Thenforeachki:

–  Ifki(j)<sig(C)[i],thensig(C)[i]←ki(j)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org28

How to pick a random hash function h(x)? Universal hashing: ha,b(x)=((a·x+b) mod p) mod N where: a,b … random integers p … prime number (p > N)

Page 29: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

LocalitySensiGveHashingStep3:Locality-Sensi:veHashing:

Focusonpairsofsignatureslikelytobefromsimilardocuments

ShinglingDocu-ment

Thesetofstringsoflengthkthatappearinthedoc-ument

Min-Hash-ing

Signatures:shortintegervectorsthatrepresentthesets,andreflecttheirsimilarity

Locality-SensiGveHashing

Candidatepairs:thosepairsofsignaturesthatweneedtotestforsimilarity

Page 30: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

LSH:FirstCut•  Goal:FinddocumentswithJaccardsimilarityatleasts(forsomesimilaritythreshold,e.g.,s=0.8)

•  LSH–Generalidea:UseafuncGonf(x,y)thattellswhetherxandyisacandidatepair:apairofelementswhosesimilaritymustbeevaluated

•  ForMin-Hashmatrices:– HashcolumnsofsignaturematrixMtomanybuckets– Eachpairofdocumentsthathashesintothesamebucketisacandidatepair

30J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 31: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

CandidatesfromMin-Hash

•  Pickasimilaritythresholds(0<s<1)

•  ColumnsxandyofMareacandidatepairiftheirsignaturesagreeonatleastfracGonsoftheirrows:M(i,x)=M(i,y)foratleastfrac.svaluesofi– Weexpectdocumentsxandytohavethesame(Jaccard)similarityastheirsignatures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org31

Page 32: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

LSHforMin-Hash

•  Bigidea:HashcolumnsofsignaturematrixMseveralOmes

•  Arrangethat(only)similarcolumnsarelikelytohashtothesamebucket,withhighprobability

•  Candidatepairsarethosethathashtothesamebucket

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org32

Page 33: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

ParGGonMintobBands

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org33

SignaturematrixM

rrowsperband

bbands

Onesignature

Page 34: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

ParGGonMintoBands•  DividematrixMintobbandsofrrows

•  Foreachband,hashitsporGonofeachcolumntoahashtablewithkbuckets– Makekaslargeaspossible

•  Candidatecolumnpairsarethosethathashtothesamebucketfor≥1band

•  Tunebandrtocatchmostsimilarpairs,butfewnon-similarpairs

34J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 35: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

MatrixM

rrows bbands

BucketsColumns 2 and 6 are probably identical (candidate pair)

Columns 6 and 7 are surely different.

HashingBands

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org35

Page 36: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

SimplifyingAssumpGon

•  ThereareenoughbucketsthatcolumnsareunlikelytohashtothesamebucketunlesstheyareidenOcalinaparGcularband

•  Hereauer,weassumethat“samebucket”means“idenOcalinthatband”

•  AssumpGonneededonlytosimplifyanalysis,notforcorrectnessofalgorithm

36J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 37: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

ExampleofBands

Assumethefollowingcase:•  Suppose100,000columnsofM(100kdocs)•  Signaturesof100integers(rows)•  Therefore,signaturestake40Mb•  Chooseb=20bandsofr=5integers/band

•  Goal:Findpairsofdocumentsthatareatleasts=0.8similar

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org37

Page 38: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

C1,C2are80%Similar•  Findpairsof≥s=0.8similarity,setb=20,r=5•  Assume:sim(C1,C2)=0.8

–  Sincesim(C1,C2)≥s,wewantC1,C2tobeacandidatepair:Wewantthemtohashtoatleast1commonbucket(atleastonebandisidenGcal)

•  ProbabilityC1,C2idenOcalinoneparOcularband:(0.8)5=0.328

•  ProbabilityC1,C2arenotsimilarinallofthe20bands:(1-0.328)20=0.00035–  i.e.,about1/3000thofthe80%-similarcolumnpairsarefalsenegaOves(wemissthem)

– Wewouldfind99.965%pairsoftrulysimilardocuments38

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 39: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

C1,C2are30%Similar•  Findpairsof≥s=0.8similarity,setb=20,r=5•  Assume:sim(C1,C2)=0.3

–  Sincesim(C1,C2)<swewantC1,C2tohashtoNOcommonbuckets(allbandsshouldbedifferent)

•  ProbabilityC1,C2idenOcalinoneparOcularband:(0.3)5=0.00243

•  ProbabilityC1,C2idenGcalinatleast1of20bands:1-(1-0.00243)20=0.0474–  Inotherwords,approximately4.74%pairsofdocswithsimilarity0.3%endupbecomingcandidatepairs

•  TheyarefalseposiOvessincewewillhavetoexaminethem(theyarecandidatepairs)butthenitwillturnouttheirsimilarityisbelowthresholdsJ.Leskovec,A.Rajaraman,J.Ullman:

MiningofMassiveDatasets,h>p://www.mmds.org

39

Page 40: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

LSHInvolvesaTradeoff•  Pick:

– ThenumberofMin-Hashes(rowsofM)– Thenumberofbandsb,and– Thenumberofrowsrperband

tobalancefalseposiGves/negaGves

•  Example:Ifwehadonly15bandsof5rows,thenumberoffalseposiGveswouldgodown,butthenumberoffalsenegaGveswouldgoup

40J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 41: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

AnalysisofLSH–WhatWeWant

Similarity t =sim(C1, C2) of two sets

Probability of sharing a bucket

Sim

ilarit

y th

resh

old

s

No chance if t < s

Probability = 1 if t > s

41J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 42: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

What1Bandof1RowGivesYou

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org42

Remember: Probability of equal hash-values = similarity

Similarity t =sim(C1, C2) of two sets

Probability of sharing a bucket

Page 43: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

bbands,rrows/band

•  ColumnsC1andC2havesimilarityt•  Pickanyband(rrows)

– Prob.thatallrowsinbandequal=tr– Prob.thatsomerowinbandunequal=1-tr

•  Prob.thatnobandidenGcal=(1-tr)b

•  Prob.thatatleast1bandidenGcal= 1-(1-tr)b

43J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 44: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

WhatbBandsofrRowsGivesYou

t r

All rows of a band are equal

1 -

Some row of a band unequal

( )b

No bands identical

1 -

At least one band identical

s ~ (1/b)1/r

44J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Similarity t=sim(C1, C2) of two sets

Probability of sharing a bucket

Page 45: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

Example:b=20;r=5•  Similaritythresholds•  Prob.thatatleast1bandisidenOcal:

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org45

s 1-(1-sr)b

.2 .006

.3 .047

.4 .186

.5 .470

.6 .802

.7 .975

.8 .9996

Page 46: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

Pickingrandb:TheS-curve•  PickingrandbtogetthebestS-curve

– 50hash-funcGons(r=5,b=10)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org46

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bluearea:FalseNegaGverateGreenarea:FalsePosiGverate

Similarity

Prob

.sharin

gabu

cket

Page 47: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

LSHSummary•  TuneM,b,rtogetalmostallpairswithsimilarsignatures,buteliminatemostpairsthatdonothavesimilarsignatures

•  Checkinmainmemorythatcandidatepairsreallydohavesimilarsignatures

•  OpOonal:Inanotherpassthroughdata,checkthattheremainingcandidatepairsreallyrepresentsimilardocuments

47J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

Page 48: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

Summary:3Steps•  Shingling:Convertdocumentstosets

–  WeusedhashingtoassigneachshingleanID

•  Min-Hashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity–  Weusedsimilaritypreservinghashingtogeneratesignatureswith

propertyPr[hπ(C1)=hπ(C2)]=sim(C1,C2)–  WeusedhashingtogetaroundgeneraGngrandompermutaGons

•  Locality-SensiOveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments–  Weusedhashingtofindcandidatepairsofsimilarity≥s

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org48

Page 49: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of

GENERALIZATIONOFLSH

Page 50: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of
Page 51: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of
Page 52: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of
Page 53: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of
Page 54: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of
Page 55: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of
Page 56: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of
Page 57: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of
Page 58: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of
Page 59: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of
Page 60: New CS60021: Scalable Data Mining Similarity Search and Hashingcse.iitkgp.ac.in/~sourangshu/coursefiles/SDM17A/lsh1.pdf · 2017. 9. 4. · Documents as High-Dim Data ... Mining of