A Comparison of Join Algorithms for Log Processing in MapReduce

A Comparison of Join Algorithmsfor Log Processing in MapReduce

Spyros Blanas, Jignesh M. Patel (University of Wiscon-sin-Madison)Eugene J. Shekita, Yuanyuan Tian (IBM Almaden Re-search Center)

SIGMOD 2010

August 1, 2010Presented by Hyojin Song

Contents

Introduction

Join Algorithms In MapReduce

Experimental Evaluation

Discussion

Conclusion

2 / 30

Introduction(1/3) Log Processing

– Important type of data analysis commonly done with MapRe-duce

– A log of events click-stream log of phone call records a sequence of transactions

– To compute various statistics for business insight filtered aggregated mined for patterns

– Often needs to be join Log data and Reference data(user information)

3 / 30

Log Table

Call records Number

2010.09.24.14:20.30

01191655603

2010.09.24.14:30.45

01046841397

2010.09.25.19:11.118

01926540846

2010.09.28.06:40.97

01098446512

2010.09.29.08:44.08

01013461655

…… ……

Reference TableNumber Name

01191655603 송효진

01046841397 안철수

01926540846 한효주

01098446512 안인석

01013461655 마음이

…… ……

Introduction(2/3) MapReduce Framework

– Used to analyze large volumes of data

– The success of MapReduce Simple programming framework To manage parallelization, fault tolerance, and load balancing

– The critics of MapReduce lack of a schema lack of a declarative query language lack of indexes

– Difficult for joins Not originally designed to combine information from several

data sources To use simple but inefficient algorithms to perform joins

4 / 30

Introduction(3/3) The benefits of MapReduce for log processing

– Scalability China Mobile gathers 5-8TB of phone call records per day Facebook collect almost 6TB of new log data everyday with to-

tally 1.7PB

– Schema free flexibility a log record may also change over time

– Simple scans preferable (<-> index scans)– Time consuming work

gracefully fault tolerance support (<-> parallel RDBMS)

5 / 30

The goal of this paper– the implementation of several well-known join strategies in MapReduce – comprehensive experiments to compare these join techniques

Contents

Introduction

Join Algorithms In MapReduce Experimental Evaluation

Discussion

Conclusion

6 / 30

Problem Statement1. Repartition Join2. Improved Repartition Join3. Directed Join4. Broadcast Join5. Semi-Join6. Per-split Semi-Join

Join Algorithms in MRProblem Statement

An equi-join between a log table L and a reference table R on single column, with |L| >> |R|

7 / 30

To propose further improving its performance with some preprocessing techniques– Well-known in the RDBMS literature– Adapting them to MapReduce is not always straightforward– Crucial implementation details of these join algorithms

To implement two additional functions: init() and close()– These are called before and after each map or reduce task

Join Algorithms in MR1. Repartition Join

The most commonly used join strategy in the MapRe-duce framework– L and R are dynamically partitioned on the join key– The corresponding pairs of partitions are joined– Similar to partitioned sort-merge join in the parallel RDBMS

8 / 30

Log Tablelog Student ID

DB B+ 2008-2424

KRR A 2010-8281

Opt A- 2005-3682

ML C0 2009-0078

OS A+ 2010-1004

NL D- 2008-0909

… …

User TableStudent ID Name

2008-0909 Ahn Jaemin

2010-1004 Kim Somin

2009-0078 Song Hyo-jin

2005-3682 Lee taewhi

2010-8281 An Inseok

… …

Example Tables(Log table & User table)– Log table

500,000 records Log has a lecture name and degree

– User table 10,000 records

– Join key is the student ID


9 / 30

Song 2009-0078

An 2010-8281

…….

A split of R or L(Distributed File System)

DB B 2008-2424

KRR A 2010-8281

NL D 2008-0909

ML C 2009-0078

OPT A 2005-3682

Map Phase Reduce Phase

2008-2424

L: DB B

R

L

L

2010-8281

L: KRR A

2010-8281

R: An

2008-0909

L: NL D

2009-0078

L: ML C

2009-0078

R: Song

2005-3682

L: OPT A

Local disk

Intermediate results

.

.

.

2008-0909

L: NL D

2010-8281

L: KRR A

2010-8281

R: An

2009-0078

R: Song

2005-3682

L: OPT A

2008-2424

L: DB B

2009-0078

L: ML C

Buffer


10 / 30

Output File(Distributed File System)

Reduce Phase

Student ID

Name Log

2009-0078

An In Seok KRR A

2010-8281

Song Hyo Jin ML C

BL

2008-0909

L: NL D

2010-8281

L: KR A

BR2010-8281

R: An

BR2009-0078

R: Song

BL

2005-3682

L: OPT A

2008-2424

L: DB B

2009-0078

L: ML C

Buffer

2008-2424

L: DB B

2010-8281

L: KRR A

2010-8281

R: An

2008-0909

L: NL D

2009-0078

L: ML C

2009-0078

R: Song

2005-3682

L: OPT A

Local disk


Standard Repartition Join– Potential problem

all records have to be buffered.

– May not fit in memory The data is highly skewed The key cardinality is small

– Variants of the standard repartition join are used in Pig, Hive, and Jaql today.

They all suffer from the buffering problem

11 / 30

Improved Repartition Join– The output key is changed to a composite of the join key and

the table tag– The partitioning & grouping function is customized– Records from the smaller table R are buffered and L records

are streamed to generate the join output

Join Algorithms in MR2. Improved Repartition Join

12 / 30

Song 2009-0078

An 2010-8281

…….

A split of R or L(Distributed File System)

DB B 2008-2424

KRR A 2010-8281

NL D 2008-0909

ML C 2009-0078

OPT A 2005-3682

Map Phase Reduce Phase

2008-2424 L

L: DB B

R

L

L

2010-8281 L

L: KRR A

2010-8281 R

R: An

2008-0909 L

L: NL D

2009-0078 L

L: ML C

2009-0078 R

R: Song

2005-3682 L

L: OPT A

Local disk

Intermediate results

.

.

.

2008-0909 L

L: NL D

2010-8281 L

L: KRR A

2010-8281 R

R: An

2009-0078 R

R: Song

2005-3682 L

L: OPT A

2008-2424 L

L: DB B

2009-0078 L

L: ML C

Buffer

Join Algorithms in MR2. Improved Repartition Join

13 / 30

Output File(Distributed File System)

Reduce Phase

Student ID

Name Log

2009-0078

An In Seok KRR A

2010-8281

Song Hyo Jin ML C

L records are streamed

BR2010-8281

R: An

BR2009-0078

R: Song

Buffer

2008-2424 L

L: DB B

2010-8281 L

L: KRR A

2010-8281 R

R: An

2008-0909 L

L: NL D

2009-0078 L

L: ML C

2009-0078 R

R: Song

2005-3682 L

L: OPT A

Local disk

L records are streamed

Join Algorithms in MR3. Directed Join

Preprocessing for Repartition Join (Directed Join)– Both L and R have already been partitioned on the join key

Pre-partitioning L on the join key Then at query time, matching partitions from L and R can be di-

rectly joined

– A map-only MapReduce job. During the init phase, Ri is retrieved from the DFS To use a main memory hash table, if it’s not already in local

storage

14 / 30

Join Algorithms in MR4. Broadcast Join

Broadcast Join– In most applications, |R| << |L|– Instead of moving both R and L across the network,– To broadcast the smaller table R to avoids the network over-

head– A map-only job– Each map task uses a main-memory hash table for either L

or R

15 / 30

Join Algorithms in MR4. Broadcast Join

Broadcast Join– If R < a split of L

To build the hash table on R

– If R > a split of L To build the hash table on a split of L

16 / 30

Preprocessing for Broadcast Join– Most nodes in the cluster

have a local copy of R in advance

– To avoid retrieving Rfrom the DFS in its init() function

Join Algorithms in MR5. Semi-Join

Semi-Join– Some applications, |R| << |L|

In Facebook, user table has hundreds of millions of records A few million unique active users per hour

– To avoid sending the records in R over the network that will not join with L

Preprocessing for Semi-Join– First two phases of semi-join can preprocess

17 / 30

Join Algorithms in MR6. Per-Split Semi-Join

Per-Split Semi-Join– The problem of Semi-join : All records of extracted R will not

join Li

– Li can be joined with Ri directly

Preprocessing for Per-split Semi-join– Also benefit from moving its first two phases

18 / 30

Contents

Introduction


Experimental Evaluation Discussion

Conclusion

19 / 30

1. Environment2. Datasets3. MapReduce Time Breakdown4. Experimental Results

Experimental Evaluation1. Environment

System Specification– All experiments run on a 100-node cluster– Single 2.4GHz Intel Core 2 Duo processor– 4GB of DRAM and two SATA disks– Red Hat Enterprise Server 5.2 running Linux 2.6.18

20 / 30

Network Specification– The 100 nodes were spread across two racks– Each node can execute two map and two reduce tasks con-

currently– Each rack had its own gigabit Ethernet switch– The rack level bandwidth is 32Gb/s– Under full load, 35MB/s cross-rack node-to-node bandwidth

version 0.19.0, HDFS (128MB block size)

Experimental Evaluation2. Datasets

Datasets

21 / 30

Event Log (L) User Info (R)

Join column size 10 bytes 5 bytes

Record size 100bytes (average) 100 bytes (exactly)

Total size 500GB 10MB~100GB

• Join result is a 10 bytes join key• n-to-1 join• many users are inactive• All the records in L always appear in the result• To fix the fraction of R that was referenced by L to be 0.1%, 1%, or 10%• To simulate some active users, a Zipf distribution was used

Experimental Evaluation3. MapReduce Time Breakdown

22 / 30


MapReduce Time Breakdown– What transpires during the execution of a MapReduce job– The overhead of various execution components of MapRe-

duce

– System Environment The standard repartition join algorithm 500GB log table and 30MB reference table 1% actually referenced by the log records 4000 map tasks and 200 reduce tasks A node was assigned 40 map and 2 reduce tasks

23 / 30


Interesting Observations on MapReduce– The map phase was clearly CPU-bound– The reduce phase was limited by the network bandwidth

Writing the three copies of the join result to HDFS

– The disk and the network activities were moderate and peri-odic during map phase

The peaks were related to the output generation in the map task The shuffle phase in the reduce task

– Almost idle for about 30 secondsbetween the 9 min and 10 min mark

Waiting for the slowest map task

– By enabling independent and concurrent map tasks, almost all CPU, disk andnetwork activities can be overlapped

24 / 30

Experimental Evaluation4. Experimental Results

25 / 30

▣ No preprocess-ing

▣ preprocessing

Experimental Evaluation4. Experimental Results

26 / 30

Contents

Introduction



Discussion Conclusion

27 / 30

Discussion Choosing the Right Strategy

– To determine what is the right join strategy for a given cir-cumstance

– To provide an important first step for query optimization

28 / 30

Contents

Introduction



Discussion

Conclusion

29 / 30

Conclusion Joining log data with reference data in MapReduce

has emerged as an important part– Analytic operations for enterprise customers– Web 2.0 companies

30 / 30

To design a series of join algorithms on top of MapRe-duce– Without requiring any modification to the actual framework– To propose many details for efficient implementation

Two additional function: Init(), close() Practical preprocessing techniques

Future work– Multi-way joins– Indexing methods to speedup join queries– Optimization module (selecting appropriate join algorithms)– New programming models to extend the MapReduce framework

A Comparison of Join Algorithms for Log Processing in MapReduce

Documents

Transcript of A Comparison of Join Algorithms for Log Processing in MapReduce