A Comparison of Join Algorithms for Log Processing in MapReduce
description
Transcript of A Comparison of Join Algorithms for Log Processing in MapReduce
A Comparison of Join Algorithmsfor Log Processing in MapReduce
Spyros Blanas, Jignesh M. Patel (University of Wiscon-sin-Madison)Eugene J. Shekita, Yuanyuan Tian (IBM Almaden Re-search Center)
SIGMOD 2010
August 1, 2010Presented by Hyojin Song
Contents
Introduction
Join Algorithms In MapReduce
Experimental Evaluation
Discussion
Conclusion
2 / 30
Introduction(1/3) Log Processing
– Important type of data analysis commonly done with MapRe-duce
– A log of events click-stream log of phone call records a sequence of transactions
– To compute various statistics for business insight filtered aggregated mined for patterns
– Often needs to be join Log data and Reference data(user information)
3 / 30
Log Table
Call records Number
2010.09.24.14:20.30
01191655603
2010.09.24.14:30.45
01046841397
2010.09.25.19:11.118
01926540846
2010.09.28.06:40.97
01098446512
2010.09.29.08:44.08
01013461655
…… ……
Reference TableNumber Name
01191655603 송효진
01046841397 안철수
01926540846 한효주
01098446512 안인석
01013461655 마음이
…… ……
Introduction(2/3) MapReduce Framework
– Used to analyze large volumes of data
– The success of MapReduce Simple programming framework To manage parallelization, fault tolerance, and load balancing
– The critics of MapReduce lack of a schema lack of a declarative query language lack of indexes
– Difficult for joins Not originally designed to combine information from several
data sources To use simple but inefficient algorithms to perform joins
4 / 30
Introduction(3/3) The benefits of MapReduce for log processing
– Scalability China Mobile gathers 5-8TB of phone call records per day Facebook collect almost 6TB of new log data everyday with to-
tally 1.7PB
– Schema free flexibility a log record may also change over time
– Simple scans preferable (<-> index scans)– Time consuming work
gracefully fault tolerance support (<-> parallel RDBMS)
5 / 30
The goal of this paper– the implementation of several well-known join strategies in MapReduce – comprehensive experiments to compare these join techniques
Contents
Introduction
Join Algorithms In MapReduce Experimental Evaluation
Discussion
Conclusion
6 / 30
Problem Statement1. Repartition Join2. Improved Repartition Join3. Directed Join4. Broadcast Join5. Semi-Join6. Per-split Semi-Join
Join Algorithms in MRProblem Statement
An equi-join between a log table L and a reference table R on single column, with |L| >> |R|
7 / 30
To propose further improving its performance with some preprocessing techniques– Well-known in the RDBMS literature– Adapting them to MapReduce is not always straightforward– Crucial implementation details of these join algorithms
To implement two additional functions: init() and close()– These are called before and after each map or reduce task
Join Algorithms in MR1. Repartition Join
The most commonly used join strategy in the MapRe-duce framework– L and R are dynamically partitioned on the join key– The corresponding pairs of partitions are joined– Similar to partitioned sort-merge join in the parallel RDBMS
8 / 30
Log Tablelog Student ID
DB B+ 2008-2424
KRR A 2010-8281
Opt A- 2005-3682
ML C0 2009-0078
OS A+ 2010-1004
NL D- 2008-0909
… …
User TableStudent ID Name
2008-0909 Ahn Jaemin
2010-1004 Kim Somin
2009-0078 Song Hyo-jin
2005-3682 Lee taewhi
2010-8281 An Inseok
… …
Example Tables(Log table & User table)– Log table
500,000 records Log has a lecture name and degree
– User table 10,000 records
– Join key is the student ID
Join Algorithms in MR1. Repartition Join
9 / 30
Song 2009-0078
An 2010-8281
…….
A split of R or L(Distributed File System)
DB B 2008-2424
KRR A 2010-8281
NL D 2008-0909
ML C 2009-0078
OPT A 2005-3682
Map Phase Reduce Phase
2008-2424
L: DB B
R
L
L
2010-8281
L: KRR A
2010-8281
R: An
2008-0909
L: NL D
2009-0078
L: ML C
2009-0078
R: Song
2005-3682
L: OPT A
Local disk
Intermediate results
.
.
.
2008-0909
L: NL D
2010-8281
L: KRR A
2010-8281
R: An
2009-0078
R: Song
2005-3682
L: OPT A
2008-2424
L: DB B
2009-0078
L: ML C
Buffer
Join Algorithms in MR1. Repartition Join
10 / 30
Output File(Distributed File System)
Reduce Phase
Student ID
Name Log
2009-0078
An In Seok KRR A
2010-8281
Song Hyo Jin ML C
BL
2008-0909
L: NL D
2010-8281
L: KR A
BR2010-8281
R: An
BR2009-0078
R: Song
BL
2005-3682
L: OPT A
2008-2424
L: DB B
2009-0078
L: ML C
Buffer
2008-2424
L: DB B
2010-8281
L: KRR A
2010-8281
R: An
2008-0909
L: NL D
2009-0078
L: ML C
2009-0078
R: Song
2005-3682
L: OPT A
Local disk
Join Algorithms in MR1. Repartition Join
Standard Repartition Join– Potential problem
all records have to be buffered.
– May not fit in memory The data is highly skewed The key cardinality is small
– Variants of the standard repartition join are used in Pig, Hive, and Jaql today.
They all suffer from the buffering problem
11 / 30
Improved Repartition Join– The output key is changed to a composite of the join key and
the table tag– The partitioning & grouping function is customized– Records from the smaller table R are buffered and L records
are streamed to generate the join output
Join Algorithms in MR2. Improved Repartition Join
12 / 30
Song 2009-0078
An 2010-8281
…….
A split of R or L(Distributed File System)
DB B 2008-2424
KRR A 2010-8281
NL D 2008-0909
ML C 2009-0078
OPT A 2005-3682
Map Phase Reduce Phase
2008-2424 L
L: DB B
R
L
L
2010-8281 L
L: KRR A
2010-8281 R
R: An
2008-0909 L
L: NL D
2009-0078 L
L: ML C
2009-0078 R
R: Song
2005-3682 L
L: OPT A
Local disk
Intermediate results
.
.
.
2008-0909 L
L: NL D
2010-8281 L
L: KRR A
2010-8281 R
R: An
2009-0078 R
R: Song
2005-3682 L
L: OPT A
2008-2424 L
L: DB B
2009-0078 L
L: ML C
Buffer
Join Algorithms in MR2. Improved Repartition Join
13 / 30
Output File(Distributed File System)
Reduce Phase
Student ID
Name Log
2009-0078
An In Seok KRR A
2010-8281
Song Hyo Jin ML C
L records are streamed
BR2010-8281
R: An
BR2009-0078
R: Song
Buffer
2008-2424 L
L: DB B
2010-8281 L
L: KRR A
2010-8281 R
R: An
2008-0909 L
L: NL D
2009-0078 L
L: ML C
2009-0078 R
R: Song
2005-3682 L
L: OPT A
Local disk
L records are streamed
Join Algorithms in MR3. Directed Join
Preprocessing for Repartition Join (Directed Join)– Both L and R have already been partitioned on the join key
Pre-partitioning L on the join key Then at query time, matching partitions from L and R can be di-
rectly joined
– A map-only MapReduce job. During the init phase, Ri is retrieved from the DFS To use a main memory hash table, if it’s not already in local
storage
14 / 30
Join Algorithms in MR4. Broadcast Join
Broadcast Join– In most applications, |R| << |L|– Instead of moving both R and L across the network,– To broadcast the smaller table R to avoids the network over-
head– A map-only job– Each map task uses a main-memory hash table for either L
or R
15 / 30
Join Algorithms in MR4. Broadcast Join
Broadcast Join– If R < a split of L
To build the hash table on R
– If R > a split of L To build the hash table on a split of L
16 / 30
Preprocessing for Broadcast Join– Most nodes in the cluster
have a local copy of R in advance
– To avoid retrieving Rfrom the DFS in its init() function
Join Algorithms in MR5. Semi-Join
Semi-Join– Some applications, |R| << |L|
In Facebook, user table has hundreds of millions of records A few million unique active users per hour
– To avoid sending the records in R over the network that will not join with L
Preprocessing for Semi-Join– First two phases of semi-join can preprocess
17 / 30
Join Algorithms in MR6. Per-Split Semi-Join
Per-Split Semi-Join– The problem of Semi-join : All records of extracted R will not
join Li
– Li can be joined with Ri directly
Preprocessing for Per-split Semi-join– Also benefit from moving its first two phases
18 / 30
Contents
Introduction
Join Algorithms In MapReduce
Experimental Evaluation Discussion
Conclusion
19 / 30
1. Environment2. Datasets3. MapReduce Time Breakdown4. Experimental Results
Experimental Evaluation1. Environment
System Specification– All experiments run on a 100-node cluster– Single 2.4GHz Intel Core 2 Duo processor– 4GB of DRAM and two SATA disks– Red Hat Enterprise Server 5.2 running Linux 2.6.18
20 / 30
Network Specification– The 100 nodes were spread across two racks– Each node can execute two map and two reduce tasks con-
currently– Each rack had its own gigabit Ethernet switch– The rack level bandwidth is 32Gb/s– Under full load, 35MB/s cross-rack node-to-node bandwidth
version 0.19.0, HDFS (128MB block size)
Experimental Evaluation2. Datasets
Datasets
21 / 30
Event Log (L) User Info (R)
Join column size 10 bytes 5 bytes
Record size 100bytes (average) 100 bytes (exactly)
Total size 500GB 10MB~100GB
• Join result is a 10 bytes join key• n-to-1 join• many users are inactive• All the records in L always appear in the result• To fix the fraction of R that was referenced by L to be 0.1%, 1%, or 10%• To simulate some active users, a Zipf distribution was used
Experimental Evaluation3. MapReduce Time Breakdown
22 / 30
Experimental Evaluation3. MapReduce Time Breakdown
MapReduce Time Breakdown– What transpires during the execution of a MapReduce job– The overhead of various execution components of MapRe-
duce
– System Environment The standard repartition join algorithm 500GB log table and 30MB reference table 1% actually referenced by the log records 4000 map tasks and 200 reduce tasks A node was assigned 40 map and 2 reduce tasks
23 / 30
Experimental Evaluation3. MapReduce Time Breakdown
Interesting Observations on MapReduce– The map phase was clearly CPU-bound– The reduce phase was limited by the network bandwidth
Writing the three copies of the join result to HDFS
– The disk and the network activities were moderate and peri-odic during map phase
The peaks were related to the output generation in the map task The shuffle phase in the reduce task
– Almost idle for about 30 secondsbetween the 9 min and 10 min mark
Waiting for the slowest map task
– By enabling independent and concurrent map tasks, almost all CPU, disk andnetwork activities can be overlapped
24 / 30
Experimental Evaluation4. Experimental Results
25 / 30
▣ No preprocess-ing
▣ preprocessing
Experimental Evaluation4. Experimental Results
26 / 30
Contents
Introduction
Join Algorithms In MapReduce
Experimental Evaluation
Discussion Conclusion
27 / 30
Discussion Choosing the Right Strategy
– To determine what is the right join strategy for a given cir-cumstance
– To provide an important first step for query optimization
28 / 30
Contents
Introduction
Join Algorithms In MapReduce
Experimental Evaluation
Discussion
Conclusion
29 / 30
Conclusion Joining log data with reference data in MapReduce
has emerged as an important part– Analytic operations for enterprise customers– Web 2.0 companies
30 / 30
To design a series of join algorithms on top of MapRe-duce– Without requiring any modification to the actual framework– To propose many details for efficient implementation
Two additional function: Init(), close() Practical preprocessing techniques
Future work– Multi-way joins– Indexing methods to speedup join queries– Optimization module (selecting appropriate join algorithms)– New programming models to extend the MapReduce framework