An Enhanced MapReduce Model (on BSP)

26
Weekly Seminar An Enhanced MapReduce Model Yu LIU@NII 2013-02-04

Transcript of An Enhanced MapReduce Model (on BSP)

Page 1: An Enhanced MapReduce Model (on BSP)

Weekly Seminar

An Enhanced MapReduce Model

Yu LIU@NII2013-02-04

Page 2: An Enhanced MapReduce Model (on BSP)

Architecture of MapReduce

This is the standard MapReduce processing flow:1. MAP2 Shuffle(omit sort)3 REDUCE

Suppose we have a 3-node cluster. Inside the cluster, there is a file which is spitted to 6 splits

The total slots for parallel MAP tasks are 3 (one per node)

When the MAP task Tm1 is finished, Tm4 will be spwaned at node1

batch-oriented

Page 3: An Enhanced MapReduce Model (on BSP)

Architecture of MapReduce

Bach executation model The entire output of each map and reduce task is

materialized to a local file before it can be consumed by the next stage

Such materialization is often argued to be inefficient

but it's a important part of MapReduce's FT strategy

Page 4: An Enhanced MapReduce Model (on BSP)

Architecture of MapReduce

Want to make some change ? If we introduce some barriers or some functions which make previous spawned tasks continuing, some MAP tasks might be blocked.

Long running MAP tasks actually changes the whole system behaviors: scheduling, fault tolerance and so on. It cannot be simply implemented.

Mapreduce online [NDSI'10]Pregel [SIGMOD'10]MapReduce vs BSP [ICCS'12]

Page 5: An Enhanced MapReduce Model (on BSP)

Modifications and Alternatives

MapReduce online (HOP) Google's Pregale (BSP) Hadoop Hama [CloudCom'10] (BSP)

Page 6: An Enhanced MapReduce Model (on BSP)

Long Running Jobs

HOP(Hadoop online prototype) Long running jobs data are pipelined between tasks and between jobs get approximation of results before jobs are finished retains the fault tolerance properties of Hadoop Programming interfaces are almost the same

Page 7: An Enhanced MapReduce Model (on BSP)

HOP Details (inside a job)

MAP and REDUCE tasks are simultaneously exist

Piplines between MAP and REDUCE Send results form a MAP process to a RED process Output of MAP process are buffered in memory

Scheduling of MAP and REDUCE tasks Resolve the blocking problems (free slots and so

on) Omitted here

Page 8: An Enhanced MapReduce Model (on BSP)

HOP Details (between jobs)

The reduce tasks of one job can optionally pipeline their output directly to the map tasks of the next job, sidestepping the need for expensive fault-tolerant storage in HDFS

In some sense, “overlaps” the 1st REDUCE step and 2nd MAP (not really overlapped)

Page 9: An Enhanced MapReduce Model (on BSP)

HOP Functionality

Online Aggregation Single job online aggregation (SQL query, ...) Multi-jobs online aggregation

Continuous Queries Process stream data (MapReduce jobs that run

continuously, accepting new data as it becomes available and analyzing it immediately)

Monitoring …

Page 10: An Enhanced MapReduce Model (on BSP)

Evaluation

Omitted, in general for some problems are much faster

Paper: MapReduce Online [NDSI'10]

Page 11: An Enhanced MapReduce Model (on BSP)

BPS-Style Frameworks

Pregel and Hama Different PI Long running services(tasks) Prefer in-memory processing (Pregel)

Page 12: An Enhanced MapReduce Model (on BSP)

Hama Examples

Different with MapReduce, the main PI is a compute function (for a vertex)

Page 13: An Enhanced MapReduce Model (on BSP)

Hama Examples

Or a bsp function (for iterative computation)

Page 14: An Enhanced MapReduce Model (on BSP)

A Summary

HOP changes the Hadoop tasks' behaviors but keep almost the same programming interfaces and also programming patterns

map and reduce functions MAP* + REDUCE pattern

BSP provides different style Pis and also different programming patterns

compute and bsp functions (sync, sendMessage ...) Super-step pattern

Page 15: An Enhanced MapReduce Model (on BSP)

My Proposal

More flexible MapReduce Combine advantages of both MapReduce ad

BSP a small step from the work of HOP a small step from the work of BSP

Page 16: An Enhanced MapReduce Model (on BSP)

MapReduce ( +BSP )

New patterns MAP* + REDUCEG*

REDUCEL* + MAP* + REDUCEG* MAP * = reveiveMsg + MAP + sendMsg + sync REDUCE* = reveiveMsg + REDUCE + sendMsg + sync

MapReduce style batch processing

BSP(Hama) style receive/sendMsg + sync

Long-running tasks

Executer:: map/reduce, Executer holds map and reduce functions

Indexed Executers (have Id, name)

Page 17: An Enhanced MapReduce Model (on BSP)

Architecture of MapReduce*

Executors are long running processes

In MAP phases, each executor invokes its map method on each input item

While the map processing in progress, “messages” can be add to the “message box”

The messages are sent asynchronously, and a BSP style barrier assures that all messages are delivered and received before generate output (note that output could be nothing)

Page 18: An Enhanced MapReduce Model (on BSP)

Architecture of MapReduce*

Similar with MAP phase, in the REDUCE phase, executors invoke reduce function on its input list

Still, messages can be sent and received

Page 19: An Enhanced MapReduce Model (on BSP)

Architecture of MapReduce*

Programming patterns (need more analysis) Not necessary always MAP → REDUCE but also

REDUCE → MAP This REDUCE is a local REDUCE, usually we

use map to implement it in Hadoop, but acutally it is a local REDUCE.

MAP and REDUCE phases actually should be free With MapReduce*, logical MAP and REDUCE

won't cause heavy synchronization for memory to disk, we can freely arrange the MAP and REDUCE

Page 20: An Enhanced MapReduce Model (on BSP)

Lightweight MAP/REDUCE Phases

For example, scan, Hadoop need two-phases MAP

1st MAP tasks computes (local) sums of each splits 2nd MAP tasks computes final result (I omit 1st

REDUCE)

With MapReduce* these two phases are computed by using the same Executors

No need to spawn new MAP tasks Even no need to re-read the input file (but in case

the we don't have enough memory, we can still simple re-open the input splits)

[Usually write-to-disk/transfer-through-network is more costive than read from local file system]

Page 21: An Enhanced MapReduce Model (on BSP)

Architecture of MapReduce*

Make model/program transformations much easier

… need prove it. Current I implemented Scan/Accumulation and feel good.

Lower cost than original Hadoop/MapReduce Compatible to original Hadoop/MapReduce

programs Keep the compatibility in which level need to be

considered (future work)

Page 22: An Enhanced MapReduce Model (on BSP)

Examples

A MAP task programming interface: map and addMsg (current impl. is just a prototype, still use some Hama APIs underground)

Context

Page 23: An Enhanced MapReduce Model (on BSP)

A Summary

Combine both advantages of HOP and BSP Avoid the heavy “materialization” between MAP and

REDUCE Efficient communications between MAP tasks,

REDUCE tasks, and MAP tasks to REDUCE tasks Intermediate statements could be inherited form

MAP phase to REDUCE phase (through long-running Executors)

Messages also be materialized (for fault tolerance) No necessary in-memory (save memory and good

for FT)

Page 24: An Enhanced MapReduce Model (on BSP)

A Summary(continue)

No harm to fault tolerance (as currently understanding)

Keep the programming interfaces of MapReduce (almost same)

More flex style than Hadoop/HOP Compatible to original Hadoop/MapReduce

programs (depend on impl.)

Page 25: An Enhanced MapReduce Model (on BSP)

Current Status

Have a simplified prototype Implemented using Hama(message, sync) and

Hadoop(HDFS) Workable (tested with some examples and get good

performance)

Further work Programming pattern in theoretic analysis Implementation (1 month)

Page 26: An Enhanced MapReduce Model (on BSP)

Performance

100 *2^20 items (200MB) 2-pass MR (Liu's impl.): 23s+24s 1-pass MR (Tung's impl.): 3-4 min (due to the

input data, job failed) MapReduce*: 22 s

I have test-results form bigger data sets Omitted here