MapReduce
-
Upload
zuhair-khayyat -
Category
Education
-
view
1.416 -
download
5
Transcript of MapReduce
Zuhair Khayyat3/11/2012
Introduction to
MapReduce
CS245 - 2012 Introduction to MapReduce 2
What is MapReduce
● A programming model introduced by Google in OSDI '04 for processing large datasets efficiently.
● Features:
– Automatic parallelization, no parallel experience required.
– Data and process redundancy for failure recovery.
– Auto scheduling and Load balancing.
– Easy to program, based on two simple functions:
● Map● Reduce.
CS245 - 2012 Introduction to MapReduce 3
Why MapReduce?
● For a cluster of:
– 2000 machines.
– Total 16 TB Ram (≈ 8 GB each).
– Total 2 PB Disk space (≈ 1 TB each).● Use the maximum capacity of the cluster to:
– Implement a parallel word count for input size 100 TB.
CS245 - 2012 Introduction to MapReduce 4
Why MapReduce?
● For a cluster of:
– 2000 machines.
– Total 16 TB Ram (≈ 8 GB each).
– Total 2 PB Disk space (≈ 1 TB each).● Use the maximum capacity of the cluster to:
– Implement a parallel word count for input size 100 TB.
– Implement a parallel sort for the same input file.● Can you use the same code for both applications?
CS245 - 2012 Introduction to MapReduce 5
How Fast is MapReduce (Hadoop)
● Sort Benchmark competition (http://sortbenchmark.org/):
– 2009: 100 TB in 173 minutes using 3452 nodes:● 2 x Quad core Xeons @ 2.5 GHz.● 8 GB RAM.
– 2008: 1TB in 3.48 minutes using 910 nodes:● 4 x Dual core Xeons @ 2.0 GHz.● 8 GB RAM.
CS245 - 2012 Introduction to MapReduce 6
Who uses MapReduce?
CS245 - 2012 Introduction to MapReduce 7
Map & Reduce functions
● The Mapper (Pick a key):
– Input: Read input from disk.
– Output: Create pairs of <key, value>, known as intermediate pairs.
– More input partitions == More parallel Mappers.● The Reducer (Process values):
– Input: a list of <key,value> pairs with a unique key.
– Output: Single or multiple of <key, values>
– More unique keys == More Parallel Reducers.
CS245 - 2012 Introduction to MapReduce 8
How MapReduce Work
1) Partition input file into M partitions.
2) Create M Map tasks, read M partitions in parallel and emits intermediate <key, value> pairs. Store them into local storage.
3) Wait for all Map workers to finish, sort and partition intermediate <key, value> pairs into R regions.
4) Start R reduce workers, each reads a list of intermediate with a unique key from remote disks.
5) Write the output of reduce workers to file(s).
CS245 - 2012 Introduction to MapReduce 9
Example – Word count
● Assume an input as following:
cat flower picturesnow cat cat
prince flower sunking queen AC
CS245 - 2012 Introduction to MapReduce 10
Example – Word count
● Step1: Partition input file into M partitions.
cat flower picturesnow cat cat
prince flower sunking queen AC
cat flower picture
snow cat cat
prince flower sun
king queen AC
CS245 - 2012 Introduction to MapReduce 11
Example – Word count
● Step2: Create M Map tasks, read M partitions in parallel and emits intermediate <key, value> pairs. Store them into local storage.
cat flower picture
snow cat cat
prince flower sun
king queen AC
Mapper 1
Mapper 2
Mapper 3
Mapper 4
<cat,1> <flower,1> <picture,1>
<snow,1> <cat,1> <cat,1>
<prince,1> <flower,1> <sun,1>
<king,1> <queen,1> <AC,1>
CS245 - 2012 Introduction to MapReduce 12
Example – Word count
● Step3: Wait for all Map workers to finish, sort and partition intermediate <key, value> pairs into R regions.
<cat,1> <flower,1> <picture,1>
<snow,1> <cat,1> <cat,1>
<prince,1> <flower,1> <sun,1>
<king,1> <queen,1> <AC,1>
<AC,1><cat,1><cat,1><cat,1>
<flower,1><flower,1><king,1>
<picture,1><prince,1><queen,1><snow,1><sun,1>
<cat,1><flower,1><picture,1>
<cat,1><cat,1>
<snow,1><flower,1><prince,1><sun,1>
<AC,1><king,1>
<queen,1>
CS245 - 2012 Introduction to MapReduce 13
Example – Word count
● Step4: Start R reduce workers, each reads a list of intermediate with a unique key from remote disks.
Reducer 1
Reducer 2
Reducer 3
Reducer 9
<AC,1>
<cat,3>
<flower,2>
<sun,1>
<AC,1><cat,1><cat,1><cat,1>
<flower,1><flower,1><king,1>
<picture,1><prince,1><queen,1><snow,1><sun,1>
CS245 - 2012 Introduction to MapReduce 14
Example – Word count
● Step5: Write the output of reduce workers to file(s).
<AC,1>
<cat,3>
<flower,2>
<sun,1>
<AC,1><cat,3>
<flower,2><king,1>
<picture,1><prince,1><queen,1><snow,1><sun,1>
<king,1>
<picture,1>
CS245 - 2012 Introduction to MapReduce 15
MapReduce framework
CS245 - 2012 Introduction to MapReduce 16
MapReduce Failure Recovery
● The framework works as master worker paradigm.
● The master keeps records of the work done on each worker.
● If a worker fails, the master assigns the same work to another worker.
● If a worker is late, another copy of the same work is assigned to another worker.
● If the master fails, another backup copy of the master can pick up and continue execution from the last check points.
CS245 - 2012 Introduction to MapReduce 17
Advantages of MapReduce
● Parallel IO: hides disk latency.
● Parallel Processing:
– Map functions works independently in parallel, each process one unique partition.
– Reduce functions work independently in parallel, each on a unique intermediate key.
● Using large clusters of commodity machines gives better results than small expensive clusters.
CS245 - 2012 Introduction to MapReduce 18
Advantages of MapReduce
● Parallel IO: hides disk latency.
● Parallel Processing:
– Map functions works independently in parallel, each process one unique partition.
– Reduce functions work independently in parallel, each on a unique intermediate key.
● Using large clusters of commodity machines gives comparable results than small expensive clusters.
CS245 - 2012 Introduction to MapReduce 19
Hadoop vs. others
● Algorithm: Sorting 100 TB data.
Hadoop DEMSort TritonSort
Nodes Count 3452 195 47
Processor 2x Quad-core Xeons @ 2.5 GHz
2x Quad-core Xeons @ 2.6 GHz
2x Quad-core Xeons @ 2.27 GHz
Memory 8 GB 16 GB 24 GB
Network 1 Gigabit Ethernet InfiniBand 10 Gigabit Fiber
Throughput 0.578 TB/Min 0.564 TB/Min 0.582 TB/Min
CS245 - 2012 Introduction to MapReduce 20
MapReduce weak points
● Overhead of MapReduce is huge.
● Data dependent applications may need multiple iterations of MapReduce, for example:
– K-means.
– PageRank.● Complex algorithms can be very hard to implement.
– Range Queries.● Sensitive to <key,value> pairs' skewed distribution
CS245 - 2012 Introduction to MapReduce 21
Implementations of MapReduce
● Hadoop in Java.
● Mars in C++ & CUDA.
● Skynet in Ruby.
● Phoenix in C++
● Microsoft Dryad:
– Schedule multiple levels of “MapReduce” like operations..
CS245 - 2012 Introduction to MapReduce 22
MapReduce in Database
CS245 - 2012 Introduction to MapReduce 23
MapReduce in Database - Ex1
● Select Name from Students where age = 23;
Name ID Age
Ahmed 1177 23
Bob 1131 20
Sara 1197 22
Students:
CS245 - 2012 Introduction to MapReduce 24
MapReduce in Database - Ex2
● Select COUNT(Name) from Students where age > 20 group by Name;
Name ID Age
Ahmed 1177 23
Bob 1131 20
Sara 1197 22
Students:
CS245 - 2012 Introduction to MapReduce 25
MapReduce in Database - Ex3
● Select Name, Term from Students, Enrolment where ID = SID and age != 20;
CID SID Term
CS290 1177 042
CS260 1177 052
ME222 1131 051
AMCS220 1197 051
Enrolment:Name ID Age
Ahmed 1177 23
Bob 1131 20
Sara 1197 22
Students:
CS245 - 2012 Introduction to MapReduce 26
MapReduce in Database - Ex4
● Select Name, Term from Students, Enrolment where ID != SID;
● What if the condition ID > SID?
CID SID Term
CS290 1177 042
CS260 1177 052
ME222 1131 051
AMCS220 1197 051
Name ID Age
Ahmed 1177 23
Bob 1131 20
Sara 1197 22
Enrolment:Students:
CS245 - 2012 Introduction to MapReduce 27
MapReduce in Database - Ex5
● Select Name, Term from Students, Enrolment where ID = SID and Admission != Term;
CID SID Term
CS290 1177 042
CS260 1177 052
ME222 1131 051
AMCS220 1197 051
Enrolment:Name ID Age Admission
Ahmed 1177 23 042
Bob 1131 20 051
Sara 1197 22 042
Students: Enrolment:Students:
CS245 - 2012 Introduction to MapReduce 28
MapReduce in Database - Ex6
● Select y from R, S, T where R.x = S.x and T.a = S.a;
a b xS:
x y zR:
m n aT:
CS245 - 2012 Introduction to MapReduce 29
MapReduce in Academic Papers
● NIPS '07: Map-Reduce for Machine Learning on Multicore.
● Escience '08: CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications.
● KDD '09: Large-scale behavioral targeting.
● GCC '09: Spatial Queries Evaluation with MapReduce.
● SIGIR '09: On single-pass indexing with MapReduce.
● MDAC '10: A novel approach to multiple sequence alignment using hadoop data grids.
● VLDB Endowment '11: Social Content Matching in MapReduce.
● VLDB '12: Building Wavelet Histograms on Large Data in MapReduce.
CS245 - 2012 Introduction to MapReduce 30
Links
● http://code.google.com/edu/parallel/mapreduce-tutorial.html
● http://hadoop.apache.org/mapreduce/
● http://www.cse.ust.hk/gpuqp/Mars.html
● http://skynet.rubyforge.org/
● http://mapreduce.stanford.edu/
● http://wiki.apache.org/hadoop/PoweredBy
● http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-academic-papers-4th-update-may-2011/