MapReduce

25
MapReduce Ahmed Elmorsy

description

 

Transcript of MapReduce

Page 1: MapReduce

MapReduceAhmed Elmorsy

Page 2: MapReduce

What is MapReduce?● MapReduce is a programming model for

processing and generating large data sets.

● Inspired by the map and reduce primitives present in Lisp and many other functional languages

● Use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily.

Page 3: MapReduce

Map functionMap, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key I and passes them to the Reduce function.

map (k1,v1) → list(k2,v2)

Page 4: MapReduce

Reduce functionThe Reduce function, also written by the user, accepts an intermediate key I and a set of values for that key. It merges together these values to form a possibly smaller set of values. Typically just zero or one output value is produced per Reduce invocation.

reduce (k2,list(v2)) → list(v2)

Page 5: MapReduce

Example (Word Count)

Problem

Counting the number of occurrences of each word in a large collection of documents

Page 6: MapReduce

Example (Word Count)Map function:

map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

Page 7: MapReduce

Example (Word Count)Reduce function:

reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

Page 8: MapReduce
Page 9: MapReduce

● The Map invocations are distributed across multiple machines by automatically partitioning the input data into a set of M splits.

● Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function.

● The number of partitions (R) and the partitioning function are specified by the user.

Execution Overview

Page 10: MapReduce

How Master works?● The master picks idle workers and assigns

each one a map task or a reduce task.

● For each map task and reduce task, it stores the state (idle, in-progress, or completed), and the identity of the worker machine (for non-idle tasks).

Page 11: MapReduce

Fault Tolerance

● Worker Failure

● Master Failure

Page 12: MapReduce

Worker Failure● The master pings every worker periodically.

● If no response is received from a worker, the master marks the worker as failed.

● Any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.

Page 13: MapReduce

Worker Failure● Any map tasks completed by the worker are

reset back to their initial idle state, and therefore become eligible for scheduling on other workers (WHY?!).

● Completed reduce tasks do not need to be re-executed (WHY?!).

Page 14: MapReduce

Master FailureThere are two options:1. Make the master write periodic checkpoints

of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state.

2. Abort the MapReduce computation if the master fails.

Page 15: MapReduce

Backup Tasks● One of the common causes that lengthens

the total time taken for a MapReduce operation is a “straggler”.

● When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks.

● The task is marked as completed whenever either the primary or the backup execution completes

Page 16: MapReduce

Refinements1. Partitioning Function2. Combiner Function3. Input and Output Types4. Skipping Bad Records5. Status Information6. Counters

Page 17: MapReduce

More Examples● Distributed Grep

● Count of URL Access Frequency

● Reverse Web-Link Graph

● Inverted Index

● Distributed Sort

Page 18: MapReduce

Apache HadoopOpen Source Implementation of MapReduce

Page 19: MapReduce

Hadoop Modules● Hadoop Common

● Hadoop Distributed File System (HDFS™)

● Hadoop YARN

● Hadoop MapReduce

Page 20: MapReduce

Projects based on Hadoop● Apache HiveDeveloped by Facebook and used by Netflix.

● Apache PigDeveloped at Yahoo! and used by Twitter.

● Apache CassandraDeveloped by Facebook

Page 21: MapReduce

Template Hadoop Programpublic class MyJob extends Configured implements Tool {

public static class MapClass extends MapReduceBase implements Mapper<Text, Text, Text, Text> {

public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

//Map Function}

}public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {}

}

Page 22: MapReduce

public int run(String[] args) throws Exception {Configuration conf = getConf();JobConf job = new JobConf(conf, MyJob.class);Path in = new Path(args[0]);Path out = new Path(args[1]);FileInputFormat.setInputPaths(job, in);FileOutputFormat.setOutputPath(job, out);job.setJobName("MyJob");job.setMapperClass(MapClass.class);job.setReducerClass(Reduce.class);job.setInputFormat(KeyValueTextInputFormat.class);job.setOutputFormat(TextOutputFormat.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);job.set("key.value.separator.in.input.line", ",");JobClient.runJob(job);return 0;

}

Page 23: MapReduce

public static void main(String[] args) throws Exception {int res = ToolRunner.run(new Configuration(), new MyJob(), args);System.exit(res);

}}

To Run it, You have to generate the JAR file, then you can use the command:

bin/hadoop jar playground/MyJob.jar MyJob input/cite75_99.txt output

Page 24: MapReduce

Readings

Chapter 4 in (Lam, Chuck. Hadoop in action.

Manning Publications Co., 2010.)

Page 25: MapReduce

References[1] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processingon large clusters. In OSDI, pages 137–150, 2004.

[2] Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.

[3] http://hadoop.apache.org/