Data Engineering How MapReduce Works Shivnath Babu.

27
Data Engineering How MapReduce Works Shivnath Babu

Transcript of Data Engineering How MapReduce Works Shivnath Babu.

Page 1: Data Engineering How MapReduce Works Shivnath Babu.

Data Engineering

How MapReduce Works

Shivnath Babu

Page 2: Data Engineering How MapReduce Works Shivnath Babu.

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Page 3: Data Engineering How MapReduce Works Shivnath Babu.

Lifecycle of a MapReduce Job

Map function

Reduce function

Run this program as aMapReduce job

Page 4: Data Engineering How MapReduce Works Shivnath Babu.

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

Page 5: Data Engineering How MapReduce Works Shivnath Babu.

Job Submission

Page 6: Data Engineering How MapReduce Works Shivnath Babu.

Initialization

Page 7: Data Engineering How MapReduce Works Shivnath Babu.

Scheduling

Page 8: Data Engineering How MapReduce Works Shivnath Babu.

Execution

Page 9: Data Engineering How MapReduce Works Shivnath Babu.

Map Task

Page 10: Data Engineering How MapReduce Works Shivnath Babu.

Reduce Tasks

Page 11: Data Engineering How MapReduce Works Shivnath Babu.

Hadoop Distributed File-System (HDFS)

Page 12: Data Engineering How MapReduce Works Shivnath Babu.

Map Wave 1

ReduceWave 1

Map Wave 2

ReduceWave 2

Input Splits

Lifecycle of a MapReduce JobTime

How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?

Page 13: Data Engineering How MapReduce Works Shivnath Babu.

Map Wave 1

Map Wave 3

Map Wave 2

ReduceWave 1

ReduceWave 2

Shuffle

Map Wave 1

Map Wave 3

Map Wave 2

ReduceWave 1

ReduceWave 2

ReduceWave 3

What if#reducesincreased

to 9?

Page 14: Data Engineering How MapReduce Works Shivnath Babu.

Spark Programming Model

14

Datanode

HDFS

Datanode…User

(Developer)

Writes

sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map

sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map

Driver Program

SparkContextSparkContextCluster

ManagerCluster

Manager

Worker Node

ExecuterExecuter CacheCache

TaskTask TaskTask

Worker Node

ExecuterExecuter CacheCache

TaskTask TaskTask

Taken from http://www.slideshare.net/fabiofumarola1/11-from-hadoop-to-spark-12

Page 15: Data Engineering How MapReduce Works Shivnath Babu.

Spark Programming Model

15

User (Developer)

Writes

sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map

sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map

Driver Program

RDD(Resilient

Distributed Dataset)

RDD(Resilient

Distributed Dataset)

• Immutable Data structure• In-memory (explicitly)• Fault Tolerant• Parallel Data Structure• Controlled partitioning to

optimize data placement• Can be manipulated using a

rich set of operators

• Immutable Data structure• In-memory (explicitly)• Fault Tolerant• Parallel Data Structure• Controlled partitioning to

optimize data placement• Can be manipulated using a

rich set of operators

Page 16: Data Engineering How MapReduce Works Shivnath Babu.

RDD• Programming Interface: Programmer can perform 3 types of

operations

16

Transformations

•Create a new dataset from an existing one.

•Lazy in nature. They are executed only when some action is performed.

•Example :• Map(func)• Filter(func)• Distinct()

Transformations

•Create a new dataset from an existing one.

•Lazy in nature. They are executed only when some action is performed.

•Example :• Map(func)• Filter(func)• Distinct()

Actions

•Returns to the driver program a value or exports data to a storage system after performing a computation.

•Example:• Count()• Reduce(funct)• Collect• Take()

Actions

•Returns to the driver program a value or exports data to a storage system after performing a computation.

•Example:• Count()• Reduce(funct)• Collect• Take()

Persistence

•For caching datasets in-memory for future operations.

•Option to store on disk or RAM or mixed (Storage Level).

•Example:• Persist() • Cache()

Persistence

•For caching datasets in-memory for future operations.

•Option to store on disk or RAM or mixed (Storage Level).

•Example:• Persist() • Cache()

Page 17: Data Engineering How MapReduce Works Shivnath Babu.

How Spark works

• RDD: Parallel collection with partitions• User application can create RDDs, transform them,

and run actions.• This results in a DAG (Directed Acyclic Graph) of

operators.• DAG is compiled into stages• Each stage is executed as a collection of Tasks (one

Task for each Partition).

17

Page 18: Data Engineering How MapReduce Works Shivnath Babu.

Summary of Components

• Task : The fundamental unit of execution in Spark

• Stage: Collection of Tasks that run in parallel

• DAG : Logical Graph of RDD operations

• RDD : Parallel dataset of objects with partitions

18

Page 19: Data Engineering How MapReduce Works Shivnath Babu.

Example

19

sc.textFile(“/wiki/pagecounts”) RDD[String]

textFile

Page 20: Data Engineering How MapReduce Works Shivnath Babu.

Example

20

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”))

RDD[String]

textFile map

RDD[List[String]]

Page 21: Data Engineering How MapReduce Works Shivnath Babu.

Example

21

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1])))

RDD[String]

textFile map

RDD[List[String]]RDD[(String, Int)]

map

Page 22: Data Engineering How MapReduce Works Shivnath Babu.

Example

22

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_)

RDD[String]

textFile map

RDD[List[String]]RDD[(String, Int)]

map

RDD[(String, Int)]

reduceByKey

Page 23: Data Engineering How MapReduce Works Shivnath Babu.

Example

23

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3).collect()

RDD[String]RDD[List[String]]RDD[(String, Int)]RDD[(String, Int)]

reduceByKey

Array[(String, Int)]

collect

Page 24: Data Engineering How MapReduce Works Shivnath Babu.

Execution Plan

Stages are sequences of RDDs, that don’t have a Shuffle in between

24

textFile map map reduceByKey

collect

Stage 1 Stage 2

Page 25: Data Engineering How MapReduce Works Shivnath Babu.

Execution Plan

25

textFile map map reduceByKey

collect

Stage 1 Stage 2

Stage 1 Stage 2

1. Read HDFS split2. Apply both the maps3. Start partial reduce4. Write shuffle data

1. Read shuffle data2. Final reduce3. Send result to

driver program

Page 26: Data Engineering How MapReduce Works Shivnath Babu.

Stage Execution

• Create a task for each Partition in the input RDD

• Serialize the Task

• Schedule and ship Tasks to Slaves

And all this happens internally (you don’t have to do anything)

26

Task 1

Task 2

Task 2

Task 2

Page 27: Data Engineering How MapReduce Works Shivnath Babu.

Spark Executor (Slave)

27

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Core 1

Core 2

Core 3