Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares...

30
Efficient Parallel Set- Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica

Transcript of Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares...

Page 1: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Efficient Parallel Set-Similarity Joins Using Hadoop

Chen Li

Joint work with

Michael Carey and Rares Vernica

Page 2: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

2

Motivation: Data Cleaning

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-FiTom Hanks Toy Story 3 2010 AnimationSchwarzenegger The Terminator 1984 Sci-FiSamuel Jackson The man 2006 Crime

Find movies starring Tom Hanks

Page 3: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

3

Movies starring S..warz…ne…ger?

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-FiTom Hanks Toy Story 3 2010 AnimationSchwarzenegger The Terminator 1984 Sci-FiSamuel Jackson The man 2006 Crime

Page 4: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

4

Similarity Search

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Iron man 2008 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson The man 2006 Crime

Find movies with a star “similar to” Schwarrzenger.

Page 5: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

5

Record linkage

Star

Keanu Reeves

Samuel Jackson

Schwarzenegger

Table R Table S

Star

Keanu Reeves

Samuel L. Jackson

Schwarzenegger

Page 6: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Step 2: Verification

6

Two-step solution

Star

Table R Table S

Star

Step 1:

Similarity Join

Page 7: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Similarity join for large data sets

Techniques applicable to other domains, e.g.:› Finding similar documents

› Finding customers with similar patterns

7

Focus of this talk

Page 8: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Formulation: set-similarity join

Hadoop-based solutions

Experiments

More results: see SIGMOD2010 paper

8

Talk Outline

Page 9: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

9

Set-Similarity Join

Finding pairs of records with a similarity on their join attributes > t

Page 10: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

10

Why this formulation?

“Samuel L. Jackson” {Samuel, L., Jackson}“Samuel Jackson” {Samuel, Jackson}

Word tokens:

Gram tokens:S c h w a r z e n e g g e r

Page 11: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Set-similarity functions

Jaccard Dice Cosine Hamming …

All solvable in this framework

11

tBA

BABAJaccard

),(

Page 12: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Formulation of set-similarity join

Hadoop-based solutions

Experiments

12

Talk Outline

Page 13: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Large amounts of data

Data or processing does not fit in one machine

Assumptions: › Self join: R = S

› Two similar sets share at least 1 token

13

Why Hadoop?

Page 14: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Map: <23, (a,b,c)> (a, 23), (b, 23), (c, 23)

14

A naïve solution

Too much data to transfer Too many pairs to verify .

Reduce:(a,23),(a,29),(a,50), … Verify each pair

Page 15: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Solving frequency skew: prefix filtering

prefix

r1

r2

15

Prefixes of similar sets should share tokens

Sort tokens by frequency (ascending)

Prefix of a set: least frequent tokens

Sorted by frequency

Chaudhuri, Ganti, Kaushik: A Primitive Operator for Similarity Joins in Data Cleaning. ICDE 2006: 5

Page 16: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Prefix filtering: example

16

Each set has 5 tokens

“Similar”: they share at least 4 tokens

Prefix length: 2

Record 1

Record 2

Page 17: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Stage 1: Order tokens by frequency

Stage 2: Finding “similar” id pairs

Stage 3: id pairs record paris

17

Hadoop Solution: Overview

Page 18: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

18

Stage 1: Sort tokens by frequency

Compute token frequencies Sort them

MapReduce phase 1 MapReduce phase 2

Page 19: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

19

Stage 2: Find “similar” id pairs

Partition using prefixes Verify similarity

Page 20: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

20

Stage 3: id pairs record pairs (phase 1)

Bring records for each id in each pair

Page 21: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

21

Stage 3: id pairs record pairs (phase 2)

Join two half filled records

Page 22: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Formulation of set-similarity join

Hadoop-based solutions

Experiments

22

Talk Outline

Page 23: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Hardware

› 10-node IBM x3650 cluster

› Intel Xeon processor E5520 2.26GHz with four cores

› Four 300GB hard disks

› 12GB RAM

Software

› Ubuntu 9.06, 64-bit, server edition OS

› Java 1.6, 64-bit, server

› Hadoop 0.20.1

Datasets: publications (DBLP and CITESEERX)

23

Experimental Setting

Page 24: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

24

Running time

Stage 2

Stage 1

Stage 3

Page 25: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

25

Speedup

Page 26: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

26

Speedup Breakdown

Stage 2 has good speedup

Page 27: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

27

Scaleup

Good scaleup

Page 28: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Other methods for the 3 stages

Case: R <> S

Dealing with limited memory

28

Additional results

Page 29: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Set-similarity joins in Hadoop:

Three-stage approach using Hadoop

Experimental study

29

Summary

Page 30: Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Thank you

Chen Li @ UC Irvine

Source code available at:

http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

Acknowledgements:

NSF, Google, IBM.