VLDB, August 2012 (to appear) Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat.
-
Upload
elijah-sherman -
Category
Documents
-
view
219 -
download
0
Transcript of VLDB, August 2012 (to appear) Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat.
M3R: INCREASED PERFORMANCE FOR IN-MEMORY HADOOP JOBS
VLDB, August 2012 (to appear)
Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat
BACKGROUND
Hadoop Map Reduce engine Posing a transformational effect on the practice of Big
Data computing Based on HDFS (a resilient distributed filesystem). Automaticlly partition data across nodes and operations
are applied in parallel. The remarkable properties
Simple Widely applicable Parallelizable framework Scalable framework Resilient framework
DISADVANTAGE
Design Point Offline,long-lived,resilient computations
HMR API Support only single-job execution.
Incure I/O and (de-)serialization cost. Mappers and reducers for each job are
started in new JVMs(JVMs typically have high startup cost).
An out-of-core shuffle implementation is used.
Pose a substatial effect on performance. We need interactive analytics
M3R---A NEW DESING POINT
M3R(Main Memory MapReduce) It is a new implementation of the HMR API.
M3R/Hadoop Implementation of HMR API using managed X10 Existing Hadoop applications just work. Reuse HDFS (and some other parts of Hadoop) In-memory: problem size must fit in cluster RAM Not resilient: any node goes down lead to fail. But considerably faster (closer to HPC speeds)
X10
A type-safe, objectoriented,multi-threaded, multi-node, garbage-collected programming language
X10 is built on the two fundamental notions of places and asynchrony.
Place Also called Process. Supplies memory and worker-threads. Collection of resident mutable data objects and activities that
operate on data. Asynchrony
Use asynchrony within a place and for communication across places
ADVANTAGES IN M3R
Reducing Disk I/O
Reducing network communication
Reducing serialization/deserialization cost.
M3R affords significant benefits for job pipelines.
OUTLINE
HMR engine execution flows. M3R engine execution flows. EVALUATION CONCLUSIONS FUTURE WORK
8
BASIC FLOW FOR AN HADOOP JOB
Input
(InputFormat/
RecordReader/
InputSplit)
File System
(HDFS)
Map
(Mapper)
Reduce
(Reducer)
Output
(OutputFormat/
RecordWriter
OutputCommitter)
ShuffleFile
System
File
System
Network and disk i/os and
deser cost
disk i/o and seri cost
Seri cost and disk i/o
Network and disk i/o
How can we eliminate these i/os?M3R
Network and disk i/os
M3R EXECUTION FLOW
The general flow of M3R is similar to the flow of the HMR engine. An M3R instance is associated with a fixed set of
JVMs. Significant benefits in avoiding network, file
i/o and (de-)serialization costs.(job pipelines) Input/Output Cache Co-location Partition Stability DeDuplication.
INPUT/OUTPUT CACHE
Introduce an in-memory key/value cache. M3R caches the key/value pairs in memory
before passing key/value pair to the mapper. before serializing it and write it to disk.
Bypass the required key/value sequence directly from the cache.As the data is stored in memory, there are no
attendant (de)serialization costs or disk/network I/O activity.
11
BASIC FLOW FOR AN M3R/HADOOP JOB
Input
(InputFormat/
RecordReader/
InputSplit)
File System
(HDFS)
Map
(Mapper)
Reduce
(Reducer)
Output
(OutputFormat/
RecordWriter
OutputCommitter)
Shuffle
Cache
Eliminate disk,network I/Os and (de)ser costs specially for shuffle
Single job: Eliminate disk I/Os. Get rid of the file system backing for the
two sides of the shuffle.
No disk I/O
No disk I/O
Job pipelines:No network ,disk I/Os,no (de)serili costs
SHUFFLE
Shuffle 描述着数据从 map task 输出到 reduce task 输入的这段过程。
大部分 map task 与 reduce task 的执行是在不同的节点上 . Reduce 执行时需要跨节点去拉取其它节点上的 map
task 结果 ---network I/O Shuffle 的目标:
完整地从 map task 端拉取数据到 reduce 端。 在跨节点拉取数据时,尽可能地减少对带宽的不必要消耗。 减少磁盘 IO 对 task 执行的影响。
能优化的地方主要在于减少拉取数据的量及尽量使用内存而不是磁盘。
ELIMINATE NETWORK I/O AND DISK I/O
Co-location Start multiple mappers and reducers in
each place. Some of the data a mapper is sending is
destined for a reducer running in the same JVM.
The M3R engine guarantees that no network, or disk I/O is involved.
MINIMIZE THE AMOUNT THAT NEEDS TO BE COMMUNICATED
We can’t avoid the time and space overhead of (de)serialization in shuffle.
The nodes need to communicate. We can reduce the amount that needs to
be communicated.
15
MAPPERS/SHUFFLE/REDUCERS
Mapper1
Shuffle
Mapper2
Mapper3
Mapper4
Mapper5
Mapper6
Reducer1
Reducer2
Reducer3
Reducer4
Reducer5
Reducer6
Through the shuffle, the mappers send data to various reducers.
M3R---PARTITION STABILITY
M3R provides partition stability guarantee The mapping from partitions to places is
deterministic. Allows job sequences to use a consistent
partitioner to route data locally. The reducer associated with a given
partition number will always be run at the same place Same place => Same memory Can reuse existing data structures.
Avoid a significant amount of communication
17
PARTITIONER: CONNECTING MAPPERS AND REDUCERS
Mapper1
ShuffleMapper2
Mapper3
Mapper4
Mapper5
Mapper6
Reducer1
Reducer2
Reducer3
Reducer4
Reducer5
Reducer6
int partitionNumber = getPartition(key, value);
Partitioner
DE-DUPLICATION
M3R co-locate reducers Coalesce duplicate keys and duplicate
values, and only send one copy. On deserialization ,at the destination, there
will be some aliases to that copy.
This also works if multiple mappers at a single place send the same data.
19
HADOOP BROADCAST
Mapper1
Shuffle
Mapper2
Mapper3
Mapper4
Mapper5
Mapper6
Reducer1
Reducer2
Reducer3
Reducer4
Reducer5
Reducer6
20
M3R BROADCAST VIA DE-DUPLICATION
Mapper1
Shuffle
Mapper2
Mapper3
Mapper4
Mapper5
Mapper6
Reducer1
Reducer2
Reducer3
Reducer4
Reducer5
Reducer6
21
M3R BROADCAST VIA DE-DUPLICATION
Mapper1
Shuffle
Mapper2
Mapper3
Mapper4
Mapper5
Mapper6
Reducer1
Reducer2
Reducer3
Reducer4
Reducer5
Reducer6
22
EXAMPLE-----ITERATED MATRIX VECTOR MULTIPLICATION IN HADOOP
Reducer (*)
Map/Pass (G)
File System
(HDFS)
Map/Bcast (V)
Input (G)
Input (V)Output V#Shuffle
Map/Pass (V#)
Input (V#) Reducer (+) Output V’Shuffle
23
ITERATED MATRIX VECTOR MULTIPLICATION IN M3R
Reducer (*)
Map/Pass (G)
File System
(HDFS)
Map/Bcast (V)
Input (G)
Input (V)
Shuffle
Map/Pass (V#) Reducer (+) Output V’
Shuffle
Cache
Do not communicate G
Do no communication
EVALUATION
20 node cluster of IBM LS-22 blades connected by Gigabit Ethernet.
Each node has 2 quad-core AMD 2.3Ghz Opteron processors, 16 GB of memory, and is running Red Hat Enterprise Linux 6.2.
The JVM used is IBM J9 1.6.0. When running M3R on this cluster, we used
one process per host, using 8 worker threads to exploit the 8 cores.
No partition stability,no cacheEvery iteration takes the sameamount of time
Performance changes drasticallyaccording to the amount of remote shuffling
CONCULUSIONS
Sacrifice resilience and out-of-core execution Gain performance. Used X10 to build a fast map/reduce engine Used X10 features to implement distributed
cache Avoid serialization, disk, network I/O costs.
50x faster for Hadoop app designed for M3R
Thank you for your time!