Data Engineering How MapReduce Works Shivnath Babu.

Data Engineering

How MapReduce Works

Shivnath Babu

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

Job Submission

Initialization

Scheduling

Execution

Map Task

Reduce Tasks

Hadoop Distributed File-System (HDFS)

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Map Wave 1

Map Wave 3

Map Wave 2

ReduceWave 1

ReduceWave 2

Shuffle

Map Wave 1

Map Wave 3

Map Wave 2

ReduceWave 1

ReduceWave 2

ReduceWave 3

What if#reducesincreased

to 9?

Spark Programming Model

14

Datanode

HDFS

Datanode…User

(Developer)

Writes

sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map


Driver Program

SparkContextSparkContextCluster

ManagerCluster

Manager

Worker Node

ExecuterExecuter CacheCache

TaskTask TaskTask

Worker Node

ExecuterExecuter CacheCache

TaskTask TaskTask

Taken from http://www.slideshare.net/fabiofumarola1/11-from-hadoop-to-spark-12

Spark Programming Model

15

User (Developer)

Writes



Driver Program

RDD(Resilient

Distributed Dataset)

RDD(Resilient

Distributed Dataset)

• Immutable Data structure• In-memory (explicitly)• Fault Tolerant• Parallel Data Structure• Controlled partitioning to

optimize data placement• Can be manipulated using a

rich set of operators

• Immutable Data structure• In-memory (explicitly)• Fault Tolerant• Parallel Data Structure• Controlled partitioning to

optimize data placement• Can be manipulated using a

rich set of operators

RDD• Programming Interface: Programmer can perform 3 types of

operations

16

Transformations

•Create a new dataset from an existing one.

•Lazy in nature. They are executed only when some action is performed.

•Example :• Map(func)• Filter(func)• Distinct()

Transformations

•Create a new dataset from an existing one.

•Lazy in nature. They are executed only when some action is performed.

•Example :• Map(func)• Filter(func)• Distinct()

Actions

•Returns to the driver program a value or exports data to a storage system after performing a computation.

•Example:• Count()• Reduce(funct)• Collect• Take()

Actions

•Returns to the driver program a value or exports data to a storage system after performing a computation.

•Example:• Count()• Reduce(funct)• Collect• Take()

Persistence

•For caching datasets in-memory for future operations.

•Option to store on disk or RAM or mixed (Storage Level).

•Example:• Persist() • Cache()

Persistence

•For caching datasets in-memory for future operations.

•Option to store on disk or RAM or mixed (Storage Level).

•Example:• Persist() • Cache()

How Spark works

• RDD: Parallel collection with partitions• User application can create RDDs, transform them,

and run actions.• This results in a DAG (Directed Acyclic Graph) of

operators.• DAG is compiled into stages• Each stage is executed as a collection of Tasks (one

Task for each Partition).

17

Summary of Components

• Task : The fundamental unit of execution in Spark

• Stage: Collection of Tasks that run in parallel

• DAG : Logical Graph of RDD operations

• RDD : Parallel dataset of objects with partitions

18

Example

19

sc.textFile(“/wiki/pagecounts”) RDD[String]

textFile

Example

20

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”))

RDD[String]

textFile map

RDD[List[String]]

Example

21

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1])))

RDD[String]

textFile map

RDD[List[String]]RDD[(String, Int)]

map

Example

22

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_)

RDD[String]

textFile map

RDD[List[String]]RDD[(String, Int)]

map

RDD[(String, Int)]

reduceByKey

Example

23

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3).collect()

RDD[String]RDD[List[String]]RDD[(String, Int)]RDD[(String, Int)]

reduceByKey

Array[(String, Int)]

collect

Execution Plan

Stages are sequences of RDDs, that don’t have a Shuffle in between

24

textFile map map reduceByKey

collect

Stage 1 Stage 2

Execution Plan

25

textFile map map reduceByKey

collect

Stage 1 Stage 2

Stage 1 Stage 2

1. Read HDFS split2. Apply both the maps3. Start partial reduce4. Write shuffle data

1. Read shuffle data2. Final reduce3. Send result to

driver program

Stage Execution

• Create a task for each Partition in the input RDD

• Serialize the Task

• Schedule and ship Tasks to Slaves

And all this happens internally (you don’t have to do anything)

26

Task 1

Task 2

Task 2

Task 2

Spark Executor (Slave)

27

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Core 1

Core 2

Core 3

Data Engineering How MapReduce Works Shivnath Babu.

Documents

Transcript of Data Engineering How MapReduce Works Shivnath Babu.