iMapReduce : A Distributed Computing Framework for Iterative Computation
description
Transcript of iMapReduce : A Distributed Computing Framework for Iterative Computation
![Page 1: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/1.jpg)
iMapReduce: A Distributed Computing Framework for
Iterative Computation
Yanfeng Zhang, Northeastern University, ChinaQixin Gao, Northeastern University, China
Lixin Gao, UMass AmherstCuirong Wang, Northeastern University, China
![Page 2: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/2.jpg)
Iterative Computation Everywhere
PageRank
Clustering
BFSVideo suggestion
Pattern Recognition
![Page 3: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/3.jpg)
Iterative Computation on Massive Data
PageRank
Clustering
BFSVideo suggestion
Pattern Recognition
100 million videos
700PB human genomics
29.7 billion pages
500TB/yr hospital
Google maps 70.5TB
![Page 4: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/4.jpg)
Large-Scale Solution MapReduce
• Programming model– Map: distribute computation workload– Reduce: aggregate the partial results
• But, no support of iterative processing– Batched processing– Must design a series of jobs, do the same
thing many times• Task reassignment, data loading, data dumping
![Page 5: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/5.jpg)
Outline
• Introduction• Example: PageRank in MapReduce• System Design: iMapReduce• Evaluation• Conclusions
![Page 6: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/6.jpg)
PageRank• Each node associated with a ranking score R
– Evenly to its neighbors with portion d– Retain portion (1-d)
![Page 7: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/7.jpg)
PageRank in MapReduce
0 1:2,31 1:32 1:0,3
3 1:44 1:1,2,3
map
map
2 1/23 1/2
reduce
reduce4 1 3 11 1/3
2 1/3
3 1/3
0 1/2
3 1/2
1 0.333 2.33
0 0.52 0.834 1
0 p:2,31 p:32 p:0,3
3 p:44 p:1,2,3
0 0.5:2,32 0.83:0,34 1:1,2,3
1 0.33:33 2.33:4
for the next MapReduce job
![Page 8: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/8.jpg)
Problems of Iterative Implementation
• 1. Multiple times job/task initializations
• 2. Re-shuffling the static data
• 3. Synchronization barrier between jobs
![Page 9: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/9.jpg)
Outline
• Introduction• Example: PageRank• System Design: iMapReduce• Evaluation• Conclusions
![Page 10: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/10.jpg)
iMapReduce Solutions
• 1. Multiple job/task initializations– Iterative processing
• 2. Re-shuffling the static data– Maintain static data locally
• 3. Synchronization barrier between jobs– Asynchronous map execution
![Page 11: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/11.jpg)
Iterative Processing
• Reduce Map– Load/dump once– Persistent map/reduce
task– Each reduce task has a
correspondent map task– Locally connected
![Page 12: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/12.jpg)
Maintain Static Data Locally
• Iterate state data– Update iteratively
• Maintain static data– Join with state data at
map phase
![Page 13: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/13.jpg)
State Data Static Data
0 12 14 1
1 13 1
map 0
map 1
2 1/23 1/2
reduce 0
reduce 14 1 3 11 1/3
2 1/3
3 1/3
0 1/2
3 1/2
1 0.333 2.33
0 0.52 0.834 1
for the next iteration
0 2,32 0,34 1,2,3
1 33 4
![Page 14: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/14.jpg)
Asynchronous Map Execution
• Persistent socket reduce -> map• Reduce completion directly trigger map• Map tasks no need to wait
MapReduce iMapReduce
time
![Page 15: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/15.jpg)
Outline
• Introduction• Example: PageRank• System Design: iMapReduce• Evaluation• Conclusions
![Page 16: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/16.jpg)
Experiment Setup • Cluster
– Local cluster: 4 nodes– Amazon EC2 cluster: 80 small instances
• Implemented algorithms– Shortest path, PageRank, K-Means
• Data sets
SSSP PageRank
![Page 17: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/17.jpg)
Results on Local Cluster
PageRank: google webgraph
iMapReduce
Hadoop
Sync. iMapReduce
Hadoop ex. initialization
![Page 18: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/18.jpg)
Results cont.
PageRank: google webgraph PageRank: Berk-Stan webgraph
SSSP: Facebook datasetSSSP: DBLP dataset
![Page 19: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/19.jpg)
Scale to Large Data Sets
• On Amazon EC2 cluster• Various-size graphs
PageRankSSSP
![Page 20: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/20.jpg)
Different Factors on Running Time Reduction
iMapReduce
Shuffling
Sync. mapsInit time
Hadoop
![Page 21: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/21.jpg)
Conclusion
• iMapReduce reduces running time by– Reducing the overhead of job/task
initialization– Eliminating the shuffling of static data– Allowing asynchronous map execution
• Achieve a factor of 1.2x to 5x speedup• Suitable for most iterative computations
![Page 22: iMapReduce : A Distributed Computing Framework for Iterative Computation](https://reader035.fdocuments.net/reader035/viewer/2022062501/568165d2550346895dd8e113/html5/thumbnails/22.jpg)
Q & A
• Thank you!