Data Engineering How MapReduce Works Shivnath Babu.
-
Upload
willa-wade -
Category
Documents
-
view
219 -
download
0
Transcript of Data Engineering How MapReduce Works Shivnath Babu.
Data Engineering
How MapReduce Works
Shivnath Babu
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as aMapReduce job
Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as aMapReduce job
Map Wave 1
ReduceWave 1
Map Wave 2
ReduceWave 2
Input Splits
Lifecycle of a MapReduce JobTime
Job Submission
Initialization
Scheduling
Execution
Map Task
Reduce Tasks
Hadoop Distributed File-System (HDFS)
Map Wave 1
ReduceWave 1
Map Wave 2
ReduceWave 2
Input Splits
Lifecycle of a MapReduce JobTime
How are the number of splits, number of map and reducetasks, memory allocation to tasks, etc., determined?
Map Wave 1
Map Wave 3
Map Wave 2
ReduceWave 1
ReduceWave 2
Shuffle
Map Wave 1
Map Wave 3
Map Wave 2
ReduceWave 1
ReduceWave 2
ReduceWave 3
What if#reducesincreased
to 9?
Spark Programming Model
14
Datanode
HDFS
Datanode…User
(Developer)
Writes
sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map
sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map
Driver Program
SparkContextSparkContextCluster
ManagerCluster
Manager
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask
Worker Node
ExecuterExecuter CacheCache
TaskTask TaskTask
Taken from http://www.slideshare.net/fabiofumarola1/11-from-hadoop-to-spark-12
Spark Programming Model
15
User (Developer)
Writes
sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map
sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map
Driver Program
RDD(Resilient
Distributed Dataset)
RDD(Resilient
Distributed Dataset)
• Immutable Data structure• In-memory (explicitly)• Fault Tolerant• Parallel Data Structure• Controlled partitioning to
optimize data placement• Can be manipulated using a
rich set of operators
• Immutable Data structure• In-memory (explicitly)• Fault Tolerant• Parallel Data Structure• Controlled partitioning to
optimize data placement• Can be manipulated using a
rich set of operators
RDD• Programming Interface: Programmer can perform 3 types of
operations
16
Transformations
•Create a new dataset from an existing one.
•Lazy in nature. They are executed only when some action is performed.
•Example :• Map(func)• Filter(func)• Distinct()
Transformations
•Create a new dataset from an existing one.
•Lazy in nature. They are executed only when some action is performed.
•Example :• Map(func)• Filter(func)• Distinct()
Actions
•Returns to the driver program a value or exports data to a storage system after performing a computation.
•Example:• Count()• Reduce(funct)• Collect• Take()
Actions
•Returns to the driver program a value or exports data to a storage system after performing a computation.
•Example:• Count()• Reduce(funct)• Collect• Take()
Persistence
•For caching datasets in-memory for future operations.
•Option to store on disk or RAM or mixed (Storage Level).
•Example:• Persist() • Cache()
Persistence
•For caching datasets in-memory for future operations.
•Option to store on disk or RAM or mixed (Storage Level).
•Example:• Persist() • Cache()
How Spark works
• RDD: Parallel collection with partitions• User application can create RDDs, transform them,
and run actions.• This results in a DAG (Directed Acyclic Graph) of
operators.• DAG is compiled into stages• Each stage is executed as a collection of Tasks (one
Task for each Partition).
17
Summary of Components
• Task : The fundamental unit of execution in Spark
• Stage: Collection of Tasks that run in parallel
• DAG : Logical Graph of RDD operations
• RDD : Parallel dataset of objects with partitions
18
Example
19
sc.textFile(“/wiki/pagecounts”) RDD[String]
textFile
Example
20
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”))
RDD[String]
textFile map
RDD[List[String]]
Example
21
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1])))
RDD[String]
textFile map
RDD[List[String]]RDD[(String, Int)]
map
Example
22
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_)
RDD[String]
textFile map
RDD[List[String]]RDD[(String, Int)]
map
RDD[(String, Int)]
reduceByKey
Example
23
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3).collect()
RDD[String]RDD[List[String]]RDD[(String, Int)]RDD[(String, Int)]
reduceByKey
Array[(String, Int)]
collect
Execution Plan
Stages are sequences of RDDs, that don’t have a Shuffle in between
24
textFile map map reduceByKey
collect
Stage 1 Stage 2
Execution Plan
25
textFile map map reduceByKey
collect
Stage 1 Stage 2
Stage 1 Stage 2
1. Read HDFS split2. Apply both the maps3. Start partial reduce4. Write shuffle data
1. Read shuffle data2. Final reduce3. Send result to
driver program
Stage Execution
• Create a task for each Partition in the input RDD
• Serialize the Task
• Schedule and ship Tasks to Slaves
And all this happens internally (you don’t have to do anything)
26
Task 1
Task 2
Task 2
Task 2
Spark Executor (Slave)
27
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Core 1
Core 2
Core 3