New CS60021: Scalable Data Mining Similarity Search and...

CS60021:ScalableDataMining

SimilaritySearchandHashing

SourangshuBha>acharya

FindingSimilarItems

DistanceMeasures¡  Goal:Findnear-neighborsinhigh-dim.space

–  Weformallydefine“nearneighbors”aspointsthatarea“smalldistance”apart

•  ForeachapplicaGon,wefirstneedtodefinewhat“distance”means

•  Today:Jaccarddistance/similarity–  TheJaccardsimilarityoftwosetsisthesizeoftheirintersecGon

dividedbythesizeoftheirunion:sim(C1,C2)=|C1∩C2|/|C1∪C2|

–  Jaccarddistance:d(C1,C2)=1-|C1∩C2|/|C1∪C2|

3J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org

3 in intersection 8 in union Jaccard similarity= 3/8 Jaccard distance = 5/8

Task:FindingSimilarDocuments

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org4

3EssenGalStepsforSimilarDocs

1.   Shingling:Convertdocumentstosets

2.   Min-Hashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity

3.   Locality-Sensi:veHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments

–  Candidatepairs!


www.mmds.org5

TheBigPicture

6

Shingling Docu- ment

The set of strings of length k that appear in the document

Min Hashing

Signatures: short integer vectors that represent the sets, and reflect their similarity

Locality- Sensitive Hashing

Candidate pairs: those pairs of signatures that we need to test for similarity


www.mmds.org

Shingling

Step1:Shingling:Convertdocumentstosets

ShinglingDocu-ment

Thesetofstringsoflengthkthatappearinthedoc-ument

DocumentsasHigh-DimData

•  Step1:Shingling:Convertdocumentstosets

•  Simpleapproaches:– Document=setofwordsappearingindocument– Document=setof“important”words– Don’tworkwellforthisapplicaGon.Why?

•  Needtoaccountfororderingofwords!•  Adifferentway:Shingles!


www.mmds.org8

Define:Shingles•  Ak-shingle(ork-gram)foradocumentisasequenceofktokensthatappearsinthedoc–  Tokenscanbecharacters,wordsorsomethingelse,dependingontheapplicaGon

–  Assumetokens=charactersforexamples

•  Example:k=2;documentD1=abcabSetof2-shingles:S(D1)={ab,bc,ca}–  OpOon:Shinglesasabag(mulGset),countabtwice:S’(D1)={ab,bc,ca, ab}


www.mmds.org9

CompressingShingles•  Tocompresslongshingles,wecanhashthemto(say)4bytes•  Representadocumentbythesetofhashvaluesofitsk-

shingles–  Idea:Twodocumentscould(rarely)appeartohaveshinglesin

common,wheninfactonlythehash-valueswereshared

•  Example:k=2;documentD1=abcabSetof2-shingles:S(D1)={ab,bc,ca}Hashthesingles:h(D1)={1,5,7}


www.mmds.org10

SimilarityMetricforShingles•  DocumentD1isasetofitsk-shinglesC1=S(D1)•  Equivalently,eachdocumentisa

0/1vectorinthespaceofk-shingles–  Eachuniqueshingleisadimension–  Vectorsareverysparse

•  AnaturalsimilaritymeasureistheJaccardsimilarity: sim(D1,D2)=|C1∩C2|/|C1∪C2|

11

WorkingAssumpGon

•  Documentsthathavelotsofshinglesincommonhavesimilartext,evenifthetextappearsindifferentorder

•  Caveat:Youmustpickklargeenough,ormostdocumentswillhavemostshingles–  k=5isOKforshortdocuments–  k=10isbe>erforlongdocuments


www.mmds.org12

MoGvaGonforMinhash/LSH


www.mmds.org13

MinHashing

Step2:Minhashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity

ShinglingDocu-ment


Min-Hash-ing

Signatures:shortintegervectorsthatrepresentthesets,andreflecttheirsimilarity

EncodingSetsasBitVectors•  Manysimilarityproblemscanbe

formalizedasfindingsubsetsthathavesignificantintersecOon

•  Encodesetsusing0/1(bit,boolean)vectors–  Onedimensionperelementintheuniversalset

•  InterpretsetintersecGonasbitwiseAND,andsetunionasbitwiseOR

•  Example:C1=10111;C2=10011–  SizeofintersecGon=3;sizeofunion=4,–  Jaccardsimilarity(notdistance)=3/4–  Distance:d(C1,C2)=1–(Jaccardsimilarity)=1/4


www.mmds.org15

FromSetstoBooleanMatrices•  Rows=elements(shingles)•  Columns=sets(documents)

–  1inroweandcolumnsifandonlyifeisamemberofs

–  ColumnsimilarityistheJaccardsimilarityofthecorrespondingsets(rowswithvalue1)

–  Typicalmatrixissparse!

•  Eachdocumentisacolumn:–  Example:sim(C1,C2)=?

•  SizeofintersecGon=3;sizeofunion=6,Jaccardsimilarity(notdistance)=3/6

•  d(C1,C2)=1–(Jaccardsimilarity)=3/6


www.mmds.org

01010111

1001

1000

10101011

0111Documents

Shi

ngle

s

Outline:FindingSimilarColumns•  Sofar:

– Documents→Setsofshingles– Representsetsasbooleanvectorsinamatrix

•  Nextgoal:FindsimilarcolumnswhilecompuOngsmallsignatures– Similarityofcolumns==similarityofsignatures


www.mmds.org17

Outline:FindingSimilarColumns•  NextGoal:Findsimilarcolumns,Smallsignatures•  Naïveapproach:

–  1)Signaturesofcolumns:smallsummariesofcolumns–  2)Examinepairsofsignaturestofindsimilarcolumns

•  EssenOal:SimilariGesofsignaturesandcolumnsarerelated

–  3)OpOonal:Checkthatcolumnswithsimilarsignaturesarereallysimilar

•  Warnings:–  ComparingallpairsmaytaketoomuchGme:JobforLSH

•  ThesemethodscanproducefalsenegaGves,andevenfalseposiGves(iftheopGonalcheckisnotmade)


www.mmds.org18

HashingColumns(Signatures)•  Keyidea:“hash”eachcolumnCtoasmallsignatureh(C),such

that:–  (1)h(C)issmallenoughthatthesignaturefitsinRAM–  (2)sim(C1,C2)isthesameasthe“similarity”ofsignaturesh(C1)andh(C2)

• 

•  Goal:FindahashfuncOonh(·)suchthat:–  Ifsim(C1,C2)ishigh,thenwithhighprob.h(C1)=h(C2)–  Ifsim(C1,C2)islow,thenwithhighprob.h(C1)≠h(C2)

• 

•  Hashdocsintobuckets.Expectthat“most”pairsofnearduplicatedocshashintothesamebucket!


www.mmds.org19

Min-Hashing•  Goal:FindahashfuncOonh(·)suchthat:

–  ifsim(C1,C2)ishigh,thenwithhighprob.h(C1)=h(C2)–  ifsim(C1,C2)islow,thenwithhighprob.h(C1)≠h(C2)

•  Clearly,thehashfuncOondependsonthesimilaritymetric:–  NotallsimilaritymetricshaveasuitablehashfuncGon

•  ThereisasuitablehashfuncOonfortheJaccardsimilarity:ItiscalledMin-Hashing


www.mmds.org20

21

Min-Hashing•  Imaginetherowsofthebooleanmatrixpermutedunder

randompermutaOonπ

•  Definea“hash”funcOonhπ(C)=theindexofthefirst(inthepermutedorderπ)rowinwhichcolumnChasvalue1: hπ (C) = minπ π(C)

•  Useseveral(e.g.,100)independenthashfuncGons(thatis,permutaGons)tocreateasignatureofacolumn


www.mmds.org

22

Min-HashingExample

3

4

7

2

6

1

5

SignaturematrixM

1212

5

7

6

3

1

2

4

1412

4

5

1

6

7

3

2

2121


www.mmds.org

2nd element of the permutation is the first to map to a 1

4th element of the permutation is the first to map to a 1

0101

0101

1010

1010

10101001

0101

Inputmatrix(ShinglesxDocuments)PermutaOonπ

TheMin-HashProperty•  ChoosearandompermutaOonπ•  Claim:Pr[hπ(C1)=hπ(C2)]=sim(C1,C2)•  Why?

–  LetXbeadoc(setofshingles),y∈Xisashingle–  Then:Pr[π(y)=min(π(X))]=1/|X|

•  Itisequallylikelythatanyy∈Xismappedtotheminelement

–  Letybes.t.π(y)=min(π(C1∪C2))–  Theneither: π(y)=min(π(C1))ify∈C1,or π(y)=min(π(C2))ify∈C2

–  Sotheprob.thatbotharetrueistheprob.y∈C1∩C2–  Pr[min(π(C1))=min(π(C2))]=|C1∩C2|/|C1∪C2|=sim(C1,C2)


www.mmds.org23

01

10

00

1100

00

One of the two cols had to have 1 at position y

FourTypesofRows•  GivencolsC1andC2,rowsmaybeclassifiedas:

C1 C2 A 1 1 B 1 0 C 0 1 D 0 0

–  a=#rowsoftypeA,etc.•  Note:sim(C1,C2)=a/(a+b+c)•  Then:Pr[h(C1)=h(C2)]=Sim(C1,C2)

–  LookdownthecolsC1andC2unGlweseea1–  Ifit’satype-Arow,thenh(C1)=h(C2)

Ifatype-Bortype-Crow,thennot


www.mmds.org

25

SimilarityforSignatures

•  Weknow:Pr[hπ(C1)=hπ(C2)]=sim(C1,C2)•  NowgeneralizetomulGplehashfuncGons

•  ThesimilarityoftwosignaturesisthefracOonofthehashfuncOonsinwhichtheyagree

•  Note:BecauseoftheMin-Hashproperty,thesimilarityofcolumnsisthesameastheexpectedsimilarityoftheirsignatures


www.mmds.org

26

Min-HashingExample


www.mmds.org

SimilariOes:1-32-41-23-4Col/Col0.750.7500Sig/Sig0.671.0000

SignaturematrixM

1212

5

7

6

3

1

2

4

1412

4

5

1

6

7

3

2

2121

0101

0101

1010

1010

10101001

0101

Inputmatrix(ShinglesxDocuments)

3

4

7

2

6

1

5

PermutaOonπ

Min-HashSignatures


www.mmds.org27

ImplementaGonTrick•  PermuOngrowsevenonceisprohibiOve•  Rowhashing!

–  PickK=100hashfuncGonski–  OrderingunderkigivesarandomrowpermutaGon!

•  One-passimplementaOon–  ForeachcolumnCandhash-func.kikeepa“slot”forthemin-hashvalue

–  IniGalizeallsig(C)[i]=∞–  Scanrowslookingfor1s

•  Supposerowjhas1incolumnC•  Thenforeachki:

–  Ifki(j)<sig(C)[i],thensig(C)[i]←ki(j)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h>p://

www.mmds.org28

How to pick a random hash function h(x)? Universal hashing: ha,b(x)=((a·x+b) mod p) mod N where: a,b … random integers p … prime number (p > N)

LocalitySensiGveHashingStep3:Locality-Sensi:veHashing:

Focusonpairsofsignatureslikelytobefromsimilardocuments

ShinglingDocu-ment


Min-Hash-ing

Signatures:shortintegervectorsthatrepresentthesets,andreflecttheirsimilarity

Locality-SensiGveHashing

Candidatepairs:thosepairsofsignaturesthatweneedtotestforsimilarity

LSH:FirstCut•  Goal:FinddocumentswithJaccardsimilarityatleasts(forsomesimilaritythreshold,e.g.,s=0.8)

•  LSH–Generalidea:UseafuncGonf(x,y)thattellswhetherxandyisacandidatepair:apairofelementswhosesimilaritymustbeevaluated

•  ForMin-Hashmatrices:– HashcolumnsofsignaturematrixMtomanybuckets– Eachpairofdocumentsthathashesintothesamebucketisacandidatepair


www.mmds.org

CandidatesfromMin-Hash

•  Pickasimilaritythresholds(0<s<1)

•  ColumnsxandyofMareacandidatepairiftheirsignaturesagreeonatleastfracGonsoftheirrows:M(i,x)=M(i,y)foratleastfrac.svaluesofi– Weexpectdocumentsxandytohavethesame(Jaccard)similarityastheirsignatures


www.mmds.org31

LSHforMin-Hash

•  Bigidea:HashcolumnsofsignaturematrixMseveralOmes

•  Arrangethat(only)similarcolumnsarelikelytohashtothesamebucket,withhighprobability

•  Candidatepairsarethosethathashtothesamebucket


www.mmds.org32

ParGGonMintobBands


www.mmds.org33

SignaturematrixM

rrowsperband

bbands

Onesignature

ParGGonMintoBands•  DividematrixMintobbandsofrrows

•  Foreachband,hashitsporGonofeachcolumntoahashtablewithkbuckets– Makekaslargeaspossible

•  Candidatecolumnpairsarethosethathashtothesamebucketfor≥1band

•  Tunebandrtocatchmostsimilarpairs,butfewnon-similarpairs


www.mmds.org

MatrixM

rrows bbands

BucketsColumns 2 and 6 are probably identical (candidate pair)

Columns 6 and 7 are surely different.

HashingBands


www.mmds.org35

SimplifyingAssumpGon

•  ThereareenoughbucketsthatcolumnsareunlikelytohashtothesamebucketunlesstheyareidenOcalinaparGcularband

•  Hereauer,weassumethat“samebucket”means“idenOcalinthatband”

•  AssumpGonneededonlytosimplifyanalysis,notforcorrectnessofalgorithm


www.mmds.org

ExampleofBands

Assumethefollowingcase:•  Suppose100,000columnsofM(100kdocs)•  Signaturesof100integers(rows)•  Therefore,signaturestake40Mb•  Chooseb=20bandsofr=5integers/band

•  Goal:Findpairsofdocumentsthatareatleasts=0.8similar


www.mmds.org37

C1,C2are80%Similar•  Findpairsof≥s=0.8similarity,setb=20,r=5•  Assume:sim(C1,C2)=0.8

–  Sincesim(C1,C2)≥s,wewantC1,C2tobeacandidatepair:Wewantthemtohashtoatleast1commonbucket(atleastonebandisidenGcal)

•  ProbabilityC1,C2idenOcalinoneparOcularband:(0.8)5=0.328

•  ProbabilityC1,C2arenotsimilarinallofthe20bands:(1-0.328)20=0.00035–  i.e.,about1/3000thofthe80%-similarcolumnpairsarefalsenegaOves(wemissthem)

– Wewouldfind99.965%pairsoftrulysimilardocuments38


www.mmds.org

C1,C2are30%Similar•  Findpairsof≥s=0.8similarity,setb=20,r=5•  Assume:sim(C1,C2)=0.3

–  Sincesim(C1,C2)<swewantC1,C2tohashtoNOcommonbuckets(allbandsshouldbedifferent)

•  ProbabilityC1,C2idenOcalinoneparOcularband:(0.3)5=0.00243

•  ProbabilityC1,C2idenGcalinatleast1of20bands:1-(1-0.00243)20=0.0474–  Inotherwords,approximately4.74%pairsofdocswithsimilarity0.3%endupbecomingcandidatepairs

•  TheyarefalseposiOvessincewewillhavetoexaminethem(theyarecandidatepairs)butthenitwillturnouttheirsimilarityisbelowthresholdsJ.Leskovec,A.Rajaraman,J.Ullman:

MiningofMassiveDatasets,h>p://www.mmds.org

39

LSHInvolvesaTradeoff•  Pick:

– ThenumberofMin-Hashes(rowsofM)– Thenumberofbandsb,and– Thenumberofrowsrperband

tobalancefalseposiGves/negaGves

•  Example:Ifwehadonly15bandsof5rows,thenumberoffalseposiGveswouldgodown,butthenumberoffalsenegaGveswouldgoup


www.mmds.org

AnalysisofLSH–WhatWeWant

Similarity t =sim(C1, C2) of two sets

Probability of sharing a bucket

Sim

ilarit

y th

resh

old

s

No chance if t < s

Probability = 1 if t > s


www.mmds.org

What1Bandof1RowGivesYou


www.mmds.org42

Remember: Probability of equal hash-values = similarity

Similarity t =sim(C1, C2) of two sets


bbands,rrows/band

•  ColumnsC1andC2havesimilarityt•  Pickanyband(rrows)

– Prob.thatallrowsinbandequal=tr– Prob.thatsomerowinbandunequal=1-tr

•  Prob.thatnobandidenGcal=(1-tr)b

•  Prob.thatatleast1bandidenGcal= 1-(1-tr)b


www.mmds.org

WhatbBandsofrRowsGivesYou

t r

All rows of a band are equal

1 -

Some row of a band unequal

( )b

No bands identical

1 -

At least one band identical

s ~ (1/b)1/r


www.mmds.org

Similarity t=sim(C1, C2) of two sets


Example:b=20;r=5•  Similaritythresholds•  Prob.thatatleast1bandisidenOcal:


www.mmds.org45

s 1-(1-sr)b

.2 .006

.3 .047

.4 .186

.5 .470

.6 .802

.7 .975

.8 .9996

Pickingrandb:TheS-curve•  PickingrandbtogetthebestS-curve

– 50hash-funcGons(r=5,b=10)


www.mmds.org46

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bluearea:FalseNegaGverateGreenarea:FalsePosiGverate

Similarity

Prob

.sharin

gabu

cket

LSHSummary•  TuneM,b,rtogetalmostallpairswithsimilarsignatures,buteliminatemostpairsthatdonothavesimilarsignatures

•  Checkinmainmemorythatcandidatepairsreallydohavesimilarsignatures

•  OpOonal:Inanotherpassthroughdata,checkthattheremainingcandidatepairsreallyrepresentsimilardocuments


www.mmds.org

Summary:3Steps•  Shingling:Convertdocumentstosets

–  WeusedhashingtoassigneachshingleanID

•  Min-Hashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity–  Weusedsimilaritypreservinghashingtogeneratesignatureswith

propertyPr[hπ(C1)=hπ(C2)]=sim(C1,C2)–  WeusedhashingtogetaroundgeneraGngrandompermutaGons

•  Locality-SensiOveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments–  Weusedhashingtofindcandidatepairsofsimilarity≥s


www.mmds.org48

GENERALIZATIONOFLSH

New CS60021: Scalable Data Mining Similarity Search and...

Documents

Transcript of New CS60021: Scalable Data Mining Similarity Search and...