New CS60021: Scalable Data Mining Similarity Search and...
Transcript of New CS60021: Scalable Data Mining Similarity Search and...
CS60021:ScalableDataMining
SimilaritySearchandHashing
SourangshuBha>acharya
FindingSimilarItems
DistanceMeasures¡ Goal:Findnear-neighborsinhigh-dim.space
– Weformallydefine“nearneighbors”aspointsthatarea“smalldistance”apart
• ForeachapplicaGon,wefirstneedtodefinewhat“distance”means
• Today:Jaccarddistance/similarity– TheJaccardsimilarityoftwosetsisthesizeoftheirintersecGon
dividedbythesizeoftheirunion:sim(C1,C2)=|C1∩C2|/|C1∪C2|
– Jaccarddistance:d(C1,C2)=1-|C1∩C2|/|C1∪C2|
3J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
3 in intersection 8 in union Jaccard similarity= 3/8 Jaccard distance = 5/8
Task:FindingSimilarDocuments
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org4
3EssenGalStepsforSimilarDocs
1. Shingling:Convertdocumentstosets
2. Min-Hashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity
3. Locality-Sensi:veHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments
– Candidatepairs!
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org5
TheBigPicture
6
Shingling Docu- ment
The set of strings of length k that appear in the doc- ument
Min Hashing
Signatures: short integer vectors that represent the sets, and reflect their similarity
Locality- Sensitive Hashing
Candidate pairs: those pairs of signatures that we need to test for similarity
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
Shingling
Step1:Shingling:Convertdocumentstosets
ShinglingDocu-ment
Thesetofstringsoflengthkthatappearinthedoc-ument
DocumentsasHigh-DimData
• Step1:Shingling:Convertdocumentstosets
• Simpleapproaches:– Document=setofwordsappearingindocument– Document=setof“important”words– Don’tworkwellforthisapplicaGon.Why?
• Needtoaccountfororderingofwords!• Adifferentway:Shingles!
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org8
Define:Shingles• Ak-shingle(ork-gram)foradocumentisasequenceofktokensthatappearsinthedoc– Tokenscanbecharacters,wordsorsomethingelse,dependingontheapplicaGon
– Assumetokens=charactersforexamples
• Example:k=2;documentD1=abcabSetof2-shingles:S(D1)={ab,bc,ca}– OpOon:Shinglesasabag(mulGset),countabtwice:S’(D1)={ab,bc,ca, ab}
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org9
CompressingShingles• Tocompresslongshingles,wecanhashthemto(say)4bytes• Representadocumentbythesetofhashvaluesofitsk-
shingles– Idea:Twodocumentscould(rarely)appeartohaveshinglesin
common,wheninfactonlythehash-valueswereshared
• Example:k=2;documentD1=abcabSetof2-shingles:S(D1)={ab,bc,ca}Hashthesingles:h(D1)={1,5,7}
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org10
SimilarityMetricforShingles• DocumentD1isasetofitsk-shinglesC1=S(D1)• Equivalently,eachdocumentisa
0/1vectorinthespaceofk-shingles– Eachuniqueshingleisadimension– Vectorsareverysparse
• AnaturalsimilaritymeasureistheJaccardsimilarity: sim(D1,D2)=|C1∩C2|/|C1∪C2|
11
WorkingAssumpGon
• Documentsthathavelotsofshinglesincommonhavesimilartext,evenifthetextappearsindifferentorder
• Caveat:Youmustpickklargeenough,ormostdocumentswillhavemostshingles– k=5isOKforshortdocuments– k=10isbe>erforlongdocuments
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org12
MoGvaGonforMinhash/LSH
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org13
MinHashing
Step2:Minhashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity
ShinglingDocu-ment
Thesetofstringsoflengthkthatappearinthedoc-ument
Min-Hash-ing
Signatures:shortintegervectorsthatrepresentthesets,andreflecttheirsimilarity
EncodingSetsasBitVectors• Manysimilarityproblemscanbe
formalizedasfindingsubsetsthathavesignificantintersecOon
• Encodesetsusing0/1(bit,boolean)vectors– Onedimensionperelementintheuniversalset
• InterpretsetintersecGonasbitwiseAND,andsetunionasbitwiseOR
• Example:C1=10111;C2=10011– SizeofintersecGon=3;sizeofunion=4,– Jaccardsimilarity(notdistance)=3/4– Distance:d(C1,C2)=1–(Jaccardsimilarity)=1/4
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org15
FromSetstoBooleanMatrices• Rows=elements(shingles)• Columns=sets(documents)
– 1inroweandcolumnsifandonlyifeisamemberofs
– ColumnsimilarityistheJaccardsimilarityofthecorrespondingsets(rowswithvalue1)
– Typicalmatrixissparse!
• Eachdocumentisacolumn:– Example:sim(C1,C2)=?
• SizeofintersecGon=3;sizeofunion=6,Jaccardsimilarity(notdistance)=3/6
• d(C1,C2)=1–(Jaccardsimilarity)=3/6
16J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
01010111
1001
1000
10101011
0111Documents
Shi
ngle
s
Outline:FindingSimilarColumns• Sofar:
– Documents→Setsofshingles– Representsetsasbooleanvectorsinamatrix
• Nextgoal:FindsimilarcolumnswhilecompuOngsmallsignatures– Similarityofcolumns==similarityofsignatures
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org17
Outline:FindingSimilarColumns• NextGoal:Findsimilarcolumns,Smallsignatures• Naïveapproach:
– 1)Signaturesofcolumns:smallsummariesofcolumns– 2)Examinepairsofsignaturestofindsimilarcolumns
• EssenOal:SimilariGesofsignaturesandcolumnsarerelated
– 3)OpOonal:Checkthatcolumnswithsimilarsignaturesarereallysimilar
• Warnings:– ComparingallpairsmaytaketoomuchGme:JobforLSH
• ThesemethodscanproducefalsenegaGves,andevenfalseposiGves(iftheopGonalcheckisnotmade)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org18
HashingColumns(Signatures)• Keyidea:“hash”eachcolumnCtoasmallsignatureh(C),such
that:– (1)h(C)issmallenoughthatthesignaturefitsinRAM– (2)sim(C1,C2)isthesameasthe“similarity”ofsignaturesh(C1)andh(C2)
•
• Goal:FindahashfuncOonh(·)suchthat:– Ifsim(C1,C2)ishigh,thenwithhighprob.h(C1)=h(C2)– Ifsim(C1,C2)islow,thenwithhighprob.h(C1)≠h(C2)
•
• Hashdocsintobuckets.Expectthat“most”pairsofnearduplicatedocshashintothesamebucket!
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org19
Min-Hashing• Goal:FindahashfuncOonh(·)suchthat:
– ifsim(C1,C2)ishigh,thenwithhighprob.h(C1)=h(C2)– ifsim(C1,C2)islow,thenwithhighprob.h(C1)≠h(C2)
• Clearly,thehashfuncOondependsonthesimilaritymetric:– NotallsimilaritymetricshaveasuitablehashfuncGon
• ThereisasuitablehashfuncOonfortheJaccardsimilarity:ItiscalledMin-Hashing
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org20
21
Min-Hashing• Imaginetherowsofthebooleanmatrixpermutedunder
randompermutaOonπ
• Definea“hash”funcOonhπ(C)=theindexofthefirst(inthepermutedorderπ)rowinwhichcolumnChasvalue1: hπ (C) = minπ π(C)
• Useseveral(e.g.,100)independenthashfuncGons(thatis,permutaGons)tocreateasignatureofacolumn
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
22
Min-HashingExample
3
4
7
2
6
1
5
SignaturematrixM
1212
5
7
6
3
1
2
4
1412
4
5
1
6
7
3
2
2121
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
2nd element of the permutation is the first to map to a 1
4th element of the permutation is the first to map to a 1
0101
0101
1010
1010
10101001
0101
Inputmatrix(ShinglesxDocuments)PermutaOonπ
TheMin-HashProperty• ChoosearandompermutaOonπ• Claim:Pr[hπ(C1)=hπ(C2)]=sim(C1,C2)• Why?
– LetXbeadoc(setofshingles),y∈Xisashingle– Then:Pr[π(y)=min(π(X))]=1/|X|
• Itisequallylikelythatanyy∈Xismappedtotheminelement
– Letybes.t.π(y)=min(π(C1∪C2))– Theneither: π(y)=min(π(C1))ify∈C1,or π(y)=min(π(C2))ify∈C2
– Sotheprob.thatbotharetrueistheprob.y∈C1∩C2– Pr[min(π(C1))=min(π(C2))]=|C1∩C2|/|C1∪C2|=sim(C1,C2)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org23
01
10
00
1100
00
One of the two cols had to have 1 at position y
FourTypesofRows• GivencolsC1andC2,rowsmaybeclassifiedas:
C1 C2 A 1 1 B 1 0 C 0 1 D 0 0
– a=#rowsoftypeA,etc.• Note:sim(C1,C2)=a/(a+b+c)• Then:Pr[h(C1)=h(C2)]=Sim(C1,C2)
– LookdownthecolsC1andC2unGlweseea1– Ifit’satype-Arow,thenh(C1)=h(C2)
Ifatype-Bortype-Crow,thennot
24J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
25
SimilarityforSignatures
• Weknow:Pr[hπ(C1)=hπ(C2)]=sim(C1,C2)• NowgeneralizetomulGplehashfuncGons
• ThesimilarityoftwosignaturesisthefracOonofthehashfuncOonsinwhichtheyagree
• Note:BecauseoftheMin-Hashproperty,thesimilarityofcolumnsisthesameastheexpectedsimilarityoftheirsignatures
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
26
Min-HashingExample
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
SimilariOes:1-32-41-23-4Col/Col0.750.7500Sig/Sig0.671.0000
SignaturematrixM
1212
5
7
6
3
1
2
4
1412
4
5
1
6
7
3
2
2121
0101
0101
1010
1010
10101001
0101
Inputmatrix(ShinglesxDocuments)
3
4
7
2
6
1
5
PermutaOonπ
Min-HashSignatures
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org27
ImplementaGonTrick• PermuOngrowsevenonceisprohibiOve• Rowhashing!
– PickK=100hashfuncGonski– OrderingunderkigivesarandomrowpermutaGon!
• One-passimplementaOon– ForeachcolumnCandhash-func.kikeepa“slot”forthemin-hashvalue
– IniGalizeallsig(C)[i]=∞– Scanrowslookingfor1s
• Supposerowjhas1incolumnC• Thenforeachki:
– Ifki(j)<sig(C)[i],thensig(C)[i]←ki(j)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org28
How to pick a random hash function h(x)? Universal hashing: ha,b(x)=((a·x+b) mod p) mod N where: a,b … random integers p … prime number (p > N)
LocalitySensiGveHashingStep3:Locality-Sensi:veHashing:
Focusonpairsofsignatureslikelytobefromsimilardocuments
ShinglingDocu-ment
Thesetofstringsoflengthkthatappearinthedoc-ument
Min-Hash-ing
Signatures:shortintegervectorsthatrepresentthesets,andreflecttheirsimilarity
Locality-SensiGveHashing
Candidatepairs:thosepairsofsignaturesthatweneedtotestforsimilarity
LSH:FirstCut• Goal:FinddocumentswithJaccardsimilarityatleasts(forsomesimilaritythreshold,e.g.,s=0.8)
• LSH–Generalidea:UseafuncGonf(x,y)thattellswhetherxandyisacandidatepair:apairofelementswhosesimilaritymustbeevaluated
• ForMin-Hashmatrices:– HashcolumnsofsignaturematrixMtomanybuckets– Eachpairofdocumentsthathashesintothesamebucketisacandidatepair
30J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
CandidatesfromMin-Hash
• Pickasimilaritythresholds(0<s<1)
• ColumnsxandyofMareacandidatepairiftheirsignaturesagreeonatleastfracGonsoftheirrows:M(i,x)=M(i,y)foratleastfrac.svaluesofi– Weexpectdocumentsxandytohavethesame(Jaccard)similarityastheirsignatures
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org31
LSHforMin-Hash
• Bigidea:HashcolumnsofsignaturematrixMseveralOmes
• Arrangethat(only)similarcolumnsarelikelytohashtothesamebucket,withhighprobability
• Candidatepairsarethosethathashtothesamebucket
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org32
ParGGonMintobBands
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org33
SignaturematrixM
rrowsperband
bbands
Onesignature
ParGGonMintoBands• DividematrixMintobbandsofrrows
• Foreachband,hashitsporGonofeachcolumntoahashtablewithkbuckets– Makekaslargeaspossible
• Candidatecolumnpairsarethosethathashtothesamebucketfor≥1band
• Tunebandrtocatchmostsimilarpairs,butfewnon-similarpairs
34J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
MatrixM
rrows bbands
BucketsColumns 2 and 6 are probably identical (candidate pair)
Columns 6 and 7 are surely different.
HashingBands
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org35
SimplifyingAssumpGon
• ThereareenoughbucketsthatcolumnsareunlikelytohashtothesamebucketunlesstheyareidenOcalinaparGcularband
• Hereauer,weassumethat“samebucket”means“idenOcalinthatband”
• AssumpGonneededonlytosimplifyanalysis,notforcorrectnessofalgorithm
36J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
ExampleofBands
Assumethefollowingcase:• Suppose100,000columnsofM(100kdocs)• Signaturesof100integers(rows)• Therefore,signaturestake40Mb• Chooseb=20bandsofr=5integers/band
• Goal:Findpairsofdocumentsthatareatleasts=0.8similar
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org37
C1,C2are80%Similar• Findpairsof≥s=0.8similarity,setb=20,r=5• Assume:sim(C1,C2)=0.8
– Sincesim(C1,C2)≥s,wewantC1,C2tobeacandidatepair:Wewantthemtohashtoatleast1commonbucket(atleastonebandisidenGcal)
• ProbabilityC1,C2idenOcalinoneparOcularband:(0.8)5=0.328
• ProbabilityC1,C2arenotsimilarinallofthe20bands:(1-0.328)20=0.00035– i.e.,about1/3000thofthe80%-similarcolumnpairsarefalsenegaOves(wemissthem)
– Wewouldfind99.965%pairsoftrulysimilardocuments38
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
C1,C2are30%Similar• Findpairsof≥s=0.8similarity,setb=20,r=5• Assume:sim(C1,C2)=0.3
– Sincesim(C1,C2)<swewantC1,C2tohashtoNOcommonbuckets(allbandsshouldbedifferent)
• ProbabilityC1,C2idenOcalinoneparOcularband:(0.3)5=0.00243
• ProbabilityC1,C2idenGcalinatleast1of20bands:1-(1-0.00243)20=0.0474– Inotherwords,approximately4.74%pairsofdocswithsimilarity0.3%endupbecomingcandidatepairs
• TheyarefalseposiOvessincewewillhavetoexaminethem(theyarecandidatepairs)butthenitwillturnouttheirsimilarityisbelowthresholdsJ.Leskovec,A.Rajaraman,J.Ullman:
MiningofMassiveDatasets,h>p://www.mmds.org
39
LSHInvolvesaTradeoff• Pick:
– ThenumberofMin-Hashes(rowsofM)– Thenumberofbandsb,and– Thenumberofrowsrperband
tobalancefalseposiGves/negaGves
• Example:Ifwehadonly15bandsof5rows,thenumberoffalseposiGveswouldgodown,butthenumberoffalsenegaGveswouldgoup
40J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
AnalysisofLSH–WhatWeWant
Similarity t =sim(C1, C2) of two sets
Probability of sharing a bucket
Sim
ilarit
y th
resh
old
s
No chance if t < s
Probability = 1 if t > s
41J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
What1Bandof1RowGivesYou
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org42
Remember: Probability of equal hash-values = similarity
Similarity t =sim(C1, C2) of two sets
Probability of sharing a bucket
bbands,rrows/band
• ColumnsC1andC2havesimilarityt• Pickanyband(rrows)
– Prob.thatallrowsinbandequal=tr– Prob.thatsomerowinbandunequal=1-tr
• Prob.thatnobandidenGcal=(1-tr)b
• Prob.thatatleast1bandidenGcal= 1-(1-tr)b
43J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
WhatbBandsofrRowsGivesYou
t r
All rows of a band are equal
1 -
Some row of a band unequal
( )b
No bands identical
1 -
At least one band identical
s ~ (1/b)1/r
44J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
Similarity t=sim(C1, C2) of two sets
Probability of sharing a bucket
Example:b=20;r=5• Similaritythresholds• Prob.thatatleast1bandisidenOcal:
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org45
s 1-(1-sr)b
.2 .006
.3 .047
.4 .186
.5 .470
.6 .802
.7 .975
.8 .9996
Pickingrandb:TheS-curve• PickingrandbtogetthebestS-curve
– 50hash-funcGons(r=5,b=10)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org46
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Bluearea:FalseNegaGverateGreenarea:FalsePosiGverate
Similarity
Prob
.sharin
gabu
cket
LSHSummary• TuneM,b,rtogetalmostallpairswithsimilarsignatures,buteliminatemostpairsthatdonothavesimilarsignatures
• Checkinmainmemorythatcandidatepairsreallydohavesimilarsignatures
• OpOonal:Inanotherpassthroughdata,checkthattheremainingcandidatepairsreallyrepresentsimilardocuments
47J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org
Summary:3Steps• Shingling:Convertdocumentstosets
– WeusedhashingtoassigneachshingleanID
• Min-Hashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity– Weusedsimilaritypreservinghashingtogeneratesignatureswith
propertyPr[hπ(C1)=hπ(C2)]=sim(C1,C2)– WeusedhashingtogetaroundgeneraGngrandompermutaGons
• Locality-SensiOveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments– Weusedhashingtofindcandidatepairsofsimilarity≥s
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://
www.mmds.org48
GENERALIZATIONOFLSH