Efficient Exact Set-Similarity Joins

70
Efficient Exact Set- Efficient Exact Set- Similarity Joins Similarity Joins Arvind Arasu Arvind Arasu Venkatesh Ganti Venkatesh Ganti Raghav Kaushik Raghav Kaushik DMX Group, Microsoft Research DMX Group, Microsoft Research

description

Efficient Exact Set-Similarity Joins. Arvind Arasu Venkatesh Ganti Raghav Kaushik DMX Group, Microsoft Research. Data Cleaning. Data Cleaning. Data Cleaning. Data Cleaning. Data Cleaning. String Similarity Join. Reference Table. String Similarity (Self) Join. Strings  Sets [CGK ’06]. - PowerPoint PPT Presentation

Transcript of Efficient Exact Set-Similarity Joins

Page 1: Efficient Exact Set-Similarity Joins

Efficient Exact Set-Similarity Efficient Exact Set-Similarity JoinsJoins

Arvind ArasuArvind ArasuVenkatesh GantiVenkatesh GantiRaghav KaushikRaghav Kaushik

DMX Group, Microsoft ResearchDMX Group, Microsoft Research

Page 2: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 22

Data CleaningData Cleaning

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799

INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL

SANTA ANASANTA ANA CACA 9279992799

Page 3: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 33

Data CleaningData Cleaning

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799

INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL

SANTA ANASANTA ANA CACA 9279992799

Page 4: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 44

Data CleaningData Cleaning

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCONLNL ILIL 9279992799

INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL

SANTA ANASANTA ANA CACA 9279992799

Page 5: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 55

Data CleaningData Cleaning

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCONLNL ILIL 9279992799

INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL

SANTA ANASANTA ANA CACA 9279992799

Page 6: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 66

Data CleaningData Cleaning

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT 0690106901

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER NYNY 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCOLINCOLNLN ILIL 9279992799

Page 7: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 77

String Similarity JoinString Similarity Join

CITYCITY

ALABASTERALABASTER

ALBERTVILLEALBERTVILLE

……

……

……LINCOLNLINCOLN

……

……YUCAIPAYUCAIPA

Reference Table

…… …… CityCity …… ………… …… …… …… ……

…… …… LINCOLINCONLNL …… ……

…… …… …… …… ……

…… …… …… …… ……

Page 8: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 88

NameName StreetStreet CityCity StateState ZipZipINGRAM INGRAM MICROMICRO

1600 ST ANDREWS PL1600 ST ANDREWS PL SANTA ANASANTA ANA CACA 9279992799

GTE CORPGTE CORP 1 STAMFORD FORUM1 STAMFORD FORUM STAMFORDSTAMFORD CTCT

LOGISOFTLOGISOFT 274 GOODMAN ST N274 GOODMAN ST N ROCHESTERROCHESTER 1460714607

CIEDCCIEDC 1800 5TH ST1800 5TH ST LINCONLLINCONL ILIL 9279992799

INGRAM MCROINGRAM MCRO 1600 ST ANDREW’S 1600 ST ANDREW’S PLPL

SANTA ANASANTA ANA CACA 9279992799

String Similarity (Self) JoinString Similarity (Self) Join

Page 9: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 99

Strings Strings Sets [CGK ’06] Sets [CGK ’06]

microsoft mcrosoft

{mc, cr, ro, os, so, of, ft}{mi, ic, cr, ro, os, so, of, ft}

(edit distance (edit distance ≤ 1) ----> (≤ 1) ----> (ΔΔ ≤ 4) ≤ 4)

2-grams2-grams

Page 10: Efficient Exact Set-Similarity Joins

mcrosoft…

……

microsoft…

……

… SR

String Sim Join edit distance edit distance ≤ 1≤ 1

Strings Sets

Page 11: Efficient Exact Set-Similarity Joins

mcrosoft…

……

microsoft…

……

Set Sim Join ΔΔ ≤ 4≤ 4

R S

TokenizeTokenize

Post-Process

Strings Sets

Page 12: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1212

String String Set: Advantages Set: Advantages

Generalizes to many string similarity Generalizes to many string similarity funcsfuncs Powerful primitivePowerful primitive

Sets Sets ≈ Relations≈ Relations Leverage relational data processingLeverage relational data processing

[CGK ‘06][CGK ‘06]

Page 13: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1313

ContributionsContributions

New algorithms for set-similarity New algorithms for set-similarity joinsjoins Exact answersExact answers Performance guaranteesPerformance guarantees Outperform previous exact algorithmsOutperform previous exact algorithms

Orders of magnitudeOrders of magnitude

Exact answers are important for operatorsExact answers are important for operators

Page 14: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 1414

OutlineOutline

IntroductionIntroduction AlgorithmsAlgorithms ExperimentsExperiments ConclusionConclusion

Page 15: Efficient Exact Set-Similarity Joins

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

Page 16: Efficient Exact Set-Similarity Joins

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

Intersection size Intersection size ≥ 5 ≥ 5

Page 17: Efficient Exact Set-Similarity Joins

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

Intersection size Intersection size ≥ 5 ≥ 5

Page 18: Efficient Exact Set-Similarity Joins

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

Intersection size Intersection size ≥ 5 ≥ 5

Page 19: Efficient Exact Set-Similarity Joins

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }

{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

{ mc, cr, ro, os, so, of, ft }

{ mi, ic, cr, ro, os, so, of, ft }

Intersection size Intersection size ≥ 5 ≥ 5

Page 20: Efficient Exact Set-Similarity Joins

{ mi, ic, cr, ro, os, so, of, ft }

{ lo, og, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }

{ mc, cr, ro, os, so, of, ft }

{ lg, gi, is, so, of, ft }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

{ mc, cr, ro, os, so, of, ft }

{ mi, ic, cr, ro, os, so, of, ft }

Intersection size Intersection size ≥ 5 ≥ 5

{ lg, gi, is, so, of, ft }

{ lo, og, gi, is, so, of, ft }

Page 21: Efficient Exact Set-Similarity Joins

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

{ mc, cr, ro, os, so, of, ft }

{ mi, ic, cr, ro, os, so, of, ft }

Sim Sim ( ( rrii , s , sjj ) ) ≥ ≥ θθ

{ lg, gi, is, so, of, ft }

{ lo, og, gi, is, so, of, ft }

ss22

ss33

ssmm

ss11

rr22

rr33

rrnn

rr11

Page 22: Efficient Exact Set-Similarity Joins

{ … }

{ … }

{ … }

{ … }

{ bo, oe, ei, in, ng }

{ … }

{ … }

{ … }

{ … }

{ … }

SR

{ mc, cr, ro, os, so, of, ft }

{ mi, ic, cr, ro, os, so, of, ft }

Sim Sim ( ( rrii , s , sjj ) ) ≥ ≥ θθ

{ lg, gi, is, so, of, ft }

{ lo, og, gi, is, so, of, ft }

ss22

ss33

ssmm

ss11

rr22

rr33

rrnn

rr11

Larg

e

Page 23: Efficient Exact Set-Similarity Joins

Input:Input: R: R: rr11 , , rr22 , … , , … , rrnn (n sets) (n sets) S: S: ss1 1 , , ss2 2 , … , , … , ssmm (m sets) (m sets)

Output: All pairs (Output: All pairs (rrii , s , sj j ) such that:) such that: ||rrii ΔΔ s sjj | | ≤ ≤ kk

Set-Similarity Join: Symmetric Set-Similarity Join: Symmetric DifferenceDifference

≤ kRunning example: Running example: k k = 4= 4

Page 24: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2424

Alternate Set Alternate Set RepresentationRepresentation

s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }

Page 25: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2525

Alternate Set Alternate Set RepresentationRepresentation

s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }

1 25 50

Page 26: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2626

Alternate Set Alternate Set RepresentationRepresentation

s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }

1 25 50

Page 27: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2727

Alternate Set Alternate Set RepresentationRepresentation

s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }

1 25 50

Page 28: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2828

Alternate Set Alternate Set RepresentationRepresentation

s = { 4, 10, 13, 24, 29, 35, 41, 46, 48 }

1 25 50

Page 29: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 2929

EnumerationEnumeration

s

r

|r Δ s | ≤ 4

Page 30: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3030

EnumerationEnumeration

s

r

|r Δ s | ≤ 4

Page 31: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3131

EnumerationEnumeration

s

r

|r Δ s | ≤ 4

ErrorsErrors

Page 32: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3232

EnumerationEnumeration

2 3 4 51

s

r

|r Δ s | ≤ 4

Page 33: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3333

Enumeration: Signature Enumeration: Signature GenerationGeneration

s

, , ,,{ }

Sig (s )

Page 34: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3434

Enumeration: Signature Enumeration: Signature GenerationGeneration

s

, , ,,{ }

Sig (s )

{ 0x4f72ba91, 0x29c8af10, 0x594b2c17, 0xa3b0e20f, 0xdd21f32a}

Hash32()

Page 35: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3535

Property of SignaturesProperty of Signatures

||r r ΔΔ ss | | ≤ 4≤ 4 Sig (Sig (rr ) Sig ( ) Sig (s s ) ) ≠ ≠ ΦΦ

UU

2 3 4 51

s

r

Page 36: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3636

Enumeration: AlgorithmEnumeration: Algorithm

Generate signatures for each Generate signatures for each rrii , , ssjj

Enumerate (Enumerate (rrii , s , sjj ) s.t ) s.t Sig ( Sig (rrii ) Sig () Sig (ssjj ) ) ≠ ≠ ΦΦ

Output those satisfying |Output those satisfying |rrii ΔΔ ssjj | ≤ 4| ≤ 4

U

Page 37: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3737

EnumerationEnumeration

s1

s5

s2

s3

s4

Sig (s2)

Sig (s5)

Sig (s3)

Sig (s4)UU

r1

r5

r2

r3

r4

Sig (s1)

Sig (r2)

Sig (r5)

Sig (r3)

Sig (r4)

Sig (r1)

Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ

Page 38: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3838

EnumerationEnumeration

s1

s5

s2

s3

s4

Sig (s2)

Sig (s5)

Sig (s3)

Sig (s4)UU

r1

r5

r2

r3

r4

Sig (s1)

Sig (r2)

Sig (r5)

Sig (r3)

Sig (r4)

Sig (r1)

Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ

Page 39: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 3939

EnumerationEnumeration

s1

s5

s2

s3

s4

Sig (s2)

Sig (s5)

Sig (s3)

Sig (s4)UU

r1

r5

r2

r3

r4

Sig (s1)

Sig (r2)

Sig (r5)

Sig (r3)

Sig (r4)

Sig (r1)

Sig (Sig (rr22)) Sig (Sig (ss11)) ≠≠ ΦΦ

OutputOutput False positive candidate pairsFalse positive candidate pairs

Page 40: Efficient Exact Set-Similarity Joins

S (Id, Elem)

R.Sig = S.Sig

δ R.Id, S.Id

R (Id, Elem)

Post-Process each R.Id, S.Id

Gen SignaturesGen Signatures

S’ (Id, Sig)R’ (Id, Sig)

Page 41: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4141

No False Positive Candidate No False Positive Candidate PairPair

2 3 4 51

s

r

|r Δ s | = 5

Page 42: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4242

False Positive Candidate False Positive Candidate PairPair

s2

s1

2 3 4 51

|r Δ s | = 5

Page 43: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4343

Enumeration: PerformanceEnumeration: Performance

0

0.25

0.5

0.75

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Symmetric Difference

Pro

bab

ility

of

Co

mm

on

Sig

nat

ure

k = 4

Page 44: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4444

Enumeration: PerformanceEnumeration: Performance

0

0.25

0.5

0.75

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Symmetric Difference

Pro

bab

ility

of

Co

mm

on

Sig

nat

ure

Ideal PerformanceIdeal Performance

k = 4

Page 45: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4545

EnumerationEnumeration

|r Δ s | ≤ 4

s

r

Page 46: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4646

EnumerationEnumeration

2 3 4 61 5

s

r

|r Δ s | ≤ 4

Page 47: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4747

Enumeration: Signature Enumeration: Signature GenerationGeneration

s1

2 3 4 61 5

Page 48: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4848

Enumeration: Signature Enumeration: Signature GenerationGeneration

s1

2 3 4 61 5

Page 49: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 4949

Enumeration: Signature Enumeration: Signature GenerationGeneration

s1

2 3 4 61 5

Page 50: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5050

Enumeration: Signature Enumeration: Signature GenerationGeneration

s1

2 3 4 61 5

Page 51: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5151

Enumeration: Signature Enumeration: Signature GenerationGeneration

s1

2 3 4 61 5

( )( )6622

= 15= 15

Page 52: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5252

AlgorithmAlgorithm

Generate signatures for each Generate signatures for each rrii , , ssjj

Enumerate (Enumerate (rrii , s , sjj ) s.t ) s.t Sig ( Sig (rrii ) Sig () Sig (ssjj ) ) ≠ ≠ ΦΦ

Output those satisfying |Output those satisfying |rrii ΔΔ ssjj | ≤ 4| ≤ 4

U

Only the signature function changesOnly the signature function changes

Page 53: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5353

Enumeration: PerformanceEnumeration: Performance

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Symmetric Difference

Pro

b. o

f Com

mon

Sig

natu

re

n1 = 5 n1 = 6

k = 4

Page 54: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5454

False Positive Candidate False Positive Candidate PairPair

2 3 4 61 5

s

r

|r Δ s | = 5

Page 55: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5555

Enumeration: PerformanceEnumeration: Performance

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Symmetric Difference

Prob

. of C

omm

on S

igna

ture

n1 = 5 n1 = 6 n1 = 7 n1 = 20

k = 4

Page 56: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5656

Enumeration: PerformanceEnumeration: Performance

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Symmetric Difference

Prob

. of C

omm

on S

igna

ture

n1 = 5 n1 = 6 n1 = 7 n1 = 20

55

15153535

48454845

k = 4

Page 57: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5757

PartEnum: Divide and PartEnum: Divide and ConquerConquer

s1

21

k = 4

k2 = 1k1 = 2

Generate signatures using EnumerationGenerate signatures using Enumeration

Page 58: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5858

PartEnum: Asymptotic PartEnum: Asymptotic PerformancePerformance

Theorem: There is an instance of Theorem: There is an instance of PartEnum such that: PartEnum such that: If If ||r r ΔΔ s s || > 7.5 > 7.5 kk, , then then r r and and s s do not do not

share a signature with probability 1 – share a signature with probability 1 – o(1)o(1)

The number of signatures per set: The number of signatures per set: O (O (kk22 ) )

Page 59: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 5959

PartEnum: SummaryPartEnum: Summary

Set-Similarity Joins with predicate Set-Similarity Joins with predicate ||rr ΔΔ ss | ≤ | ≤ kk

Theoretical guaranteesTheoretical guarantees First exact algorithmFirst exact algorithm

Page 60: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6060

Other resultsOther results

PartEnum extensions:PartEnum extensions: Larger class of set-similarity join predicatesLarger class of set-similarity join predicates

JaccardJaccard Basic idea: reduce to symmetric set Basic idea: reduce to symmetric set

differencedifference WtEnumWtEnum class of signature functions: class of signature functions:

Use frequency of elementsUse frequency of elements Weighted set-similarity joinsWeighted set-similarity joins

Page 61: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6161

OutlineOutline

IntroductionIntroduction AlgorithmsAlgorithms ExperimentsExperiments ConclusionConclusion

Page 62: Efficient Exact Set-Similarity Joins

S (Id, Elem)

R.Sig = S.Sig

δ R.Id, S.Id

R (Id, Elem)

Post-Process each R.Id, S.Id

Gen SignaturesGen Signatures

Implementation

DBMSDBMS

Client + DBMSClient + DBMS

DBMSDBMS

ClientClient

Page 63: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6363

Previous WorkPrevious Work

Prefix Filtering [CGK ’06]Prefix Filtering [CGK ’06] ExactExact

Locality Sensitive Hashing [IM ’98]Locality Sensitive Hashing [IM ’98] ApproximateApproximate False negative rate: 5%False negative rate: 5%

Page 64: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6464

Data SetsData Sets

Organization addresses [MS Sales]Organization addresses [MS Sales] Concatenation: Org name, street, city, Concatenation: Org name, street, city,

zipzip Input size: 1 millionInput size: 1 million Avg. length: 11 words, 58 charsAvg. length: 11 words, 58 chars Tokenization: Words, n-gramsTokenization: Words, n-grams

Page 65: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6565

Jaccard, 1M, MS SalesJaccard, 1M, MS Sales

0

1000

2000

3000

4000

PEN LSH PF PEN LSH PF PEN LSH PF

Sec

on

ds

SigGen CandPair PostFilter

0.80.9 0.85

Page 66: Efficient Exact Set-Similarity Joins

S (Id, Elem)

R.Sig = S.Sig

δ R.Id, S.Id

R (Id, Elem)

Post-Process each R.Id, S.Id

Gen SignaturesGen Signatures

Evaluation

DBMSDBMS

DBMSDBMS

IntermediateIntermediateResult sizeResult size

Client + DBMSClient + DBMS

ClientClient

Page 67: Efficient Exact Set-Similarity Joins

Jaccard, 1M, MS SalesJaccard, 1M, MS Sales

0.00E+00

5.00E+07

1.00E+08

1.50E+08

2.00E+08

2.50E+08

PEN LSH PF PEN LSH PF PEN LSH PF

Inte

rmed

iate

Res

ult

Siz

e

0.80.9 0.85

Page 68: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6868

Jaccard, SyntheticJaccard, Synthetic

1.0E+03

1.0E+04

1.0E+05

1.0E+06

1.0E+07

1.0E+08

1.0E+09

1.0E+10

1.0E+11

1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 1.0E+09

Input Size

Inte

rmed

iate

Res

ult S

ize

LSH(0.95) PEN PF

Page 69: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 6969

Similar Results for …Similar Results for …

Other data setsOther data sets DBLP, Synthetic data setsDBLP, Synthetic data sets

Other similarity functionsOther similarity functions Weighted jaccardWeighted jaccard Edit distanceEdit distance

Page 70: Efficient Exact Set-Similarity Joins

Sept. 15, 2006Sept. 15, 2006 Set-Similarity JoinsSet-Similarity Joins 7070

ConclusionConclusion

New algorithms for set-similarity New algorithms for set-similarity joinsjoins ExactExact Performance guaranteesPerformance guarantees Outperform previous exact algorithmsOutperform previous exact algorithms

Search: “data cleaning project”