How MapReduce part of Hadoop works (i.e. system's view) ?

Hadoop MapReduce -System’s View

By Niketan Pansare ([email protected])Rice University

Wednesday, March 27, 13

JobSubmission at Client’s side

Client Node Job tracker Node

Task tracker Node


Client Node

Client pgm


Client Node

Client pgm

Job


Client Node

Client pgm

Job

job.submit()


Client Node

Client pgm

Job

job.submit()JobClient


Client Node

Client pgm

Job


jobClient.submitJobInternal()


Client Node

Client pgm

Job



Client stub to JobTracker


Client Node

Client pgm

Job



Client stub to JobTrackerjobSubmissionClient.getNewJobID()


Client Node

Client pgm

Job




JobTracker

jobSubmissionClient.getNewJobID()


Client Node

Client pgm

Job




JobTracker

jobSubmissionClient.getNewJobID()

RPC call


Client Node

Client pgm

Job




Client Node

Client pgm

Job



jobConf.getOutputFormat().checkOutputSpecs()


Client Node

Client pgm

Job




Client Node

Client pgm

Job



Copy Job Resources


Client Node

Client pgm

Job



Copy Job Resources

JobSubmissionFiles


Client Node

Client pgm

Job



Copy Job Resources

1. Get destination paths- Job staging area (getStagingArea())- Job submission area- Job config file path (getJobConfPath())- Job jar file path (getJobJar())- Information about splits: (a) split meta file (getJobSplitMetaFile()) (b) split file (getJobSplitFile())

JobSubmissionFiles


Client Node

Client pgm

Job



Copy Job Resources (jar)

Shared FS (HDFS)


Client Node

Client pgm

Job




Shared FS (HDFS)

jar file + replication = 10


Client Node

Client pgm

Job




Shared FS (HDFS)

jar file + replication = 10

replication = mapred.submit.replication = default: 10


Client Node

Client pgm

Job



Copy Job Resources (splits/config)

Shared FS (HDFS)


Client Node

Client pgm

Job




Shared FS (HDFS)

a. Compute splits jobConf.getInputFormat().getSplits()


Client Node

Client pgm

Job




Shared FS (HDFS)


b. Sort splits based on size (biggest goes first)- Modify Array.sort() in writeSplit() for randomization


Client Node

Client pgm

Job




Shared FS (HDFS)



c. Copy split “meta” file to jobtracker into path given by


Client Node

Client pgm

Job




JobSubmissionFiles

Shared FS (HDFS)





Client Node

Client pgm

Job



JobTracker


JobSubmissionFiles

Shared FS (HDFS)





Client Node

Client pgm

Job



JobTracker


JobSubmissionFiles

Shared FS (HDFS)




JobSplit.SplitMetaInfo


Client Node

Client pgm

Job



JobTracker


JobSubmissionFiles

Shared FS (HDFS)





d. Copy split file to HDFS (replica=10) path given by


Client Node

Client pgm

Job



JobTracker


JobSubmissionFiles

Shared FS (HDFS)






JobSplit.TaskSplitIndex


Client Node

Client pgm

Job



JobTracker


JobSubmissionFiles

Shared FS (HDFS)







e. Copy job config file to JobTracker path given by


Client Node

Client pgm

Job



JobTracker


JobSubmissionFiles

Shared FS (HDFS)







e. Copy job config file to JobTracker path given by

job config file


Client Node

Client pgm

Job




JobTracker

After copying job resources (jar, split files, config)


Client Node

Client pgm

Job




JobTracker


RPC submitJob()


Client Node

Client pgm

Job




JobTracker


RPC submitJob()

Done with Job Submission at Client side ....Now let’s look at JobTracker’s side.


JobSubmission at Job tracker node


Task tracker Node


JobTracker




Task tracker Node


RPC submitJob() JobTracker



Job tracker NodesubmitJob()

JobTracker




JobTracker

Read job config file




JobTracker

JobInProgress (job)

Read job config file




JobTracker

JobInProgress (job)




JobTracker

JobInProgress (job)

job.initTasks()




JobTracker

JobInProgress (job)

job.initTasks()

createSplits()




JobTracker

JobInProgress (job)

job.initTasks()

split meta file (JobSplit.SplitMetaInfo)

createSplits()




JobTracker

JobInProgress (job)

job.initTasks()


createSplits()JobSplit.TaskSplitMetaInfo[] splits




JobTracker

JobInProgress (job)

job.initTasks()


JobSplit.TaskSplitMetaInfo[] splits




JobTracker

JobInProgress (job)

job.initTasks()





JobTracker

JobInProgress (job)

job.initTasks()


TaskInProgress[] maps




JobTracker

JobInProgress (job)

job.initTasks()



1 map per split




JobTracker

JobInProgress (job)

job.initTasks()






JobTracker

JobInProgress (job)

job.initTasks()



Map<Node, List<TIP>>nonRunningMapCache




JobTracker

JobInProgress (job)

job.initTasks()



TaskInProgress[] reduces





JobTracker

JobInProgress (job)

job.initTasks()





mapred.reduce.tasks




JobTracker

JobInProgress (job)

job.initTasks()








JobTracker

JobInProgress (job)

job.initTasks()





Set<TaskInProgress>nonRunningReduces




JobTracker

JobInProgress (job)

job.initTasks()






Other bookkeeping structures:

runningMapCache, nonLocalMaps, failedMaps, ...

+JobProfile, JobStatus




JobTracker

JobInProgress (job)

job.initTasks()









JobTracker

JobInProgress (job)

job.initTasks()






TaskInProgress[2]setup




JobTracker

JobInProgress (job)

job.initTasks()







TaskInProgress[2]cleanup




JobTracker

JobInProgress (job)

job.initTasks()








Run by TaskTracker and are used to setup and to cleanup tasks




JobTracker

JobInProgress (job)

job.initTasks()











JobTracker

JobInProgress (job)

job.initTasks()








2 = One for map and other for reduce task




JobTracker

JobInProgress (job)

job.initTasks()











JobTracker

JobInProgress (job)

job.initTasks()








What code to run by TaskInProgress ?




JobTracker

JobInProgress (job)

job.initTasks()








What code to run by TaskInProgress ?User-defined




JobTracker

JobInProgress (job)

job.initTasks()












JobTracker

JobInProgress (job)

job.initTasks()









For setup and cleanup, specified by mapred.output.committer.classDefault: FileOutputCommitter




JobTracker

JobInProgress (job)

job.initTasks()












JobTracker

JobInProgress (job)

job.initTasks()








Done initializing:




JobTracker

JobInProgress (job)








Done initializing:



Job tracker Node

JobTracker

JobInProgress (job)











JobTracker

JobInProgress (job)











JobTracker QueueManagerqueueManager

JobInProgress (job)












JobInProgress (job)








Queue exists ? + User permissions





JobInProgress (job)












JobInProgress (job)








addJob()





JobInProgress (job)








addJob()

Notify Listeners of the queue





JobInProgress (job)








addJob()





JobInProgress (job)








addJob()

Done submitting the job !!!


TaskScheduler class


TaskScheduler class• Used by JobTracker to schedule Task on TaskTracker.


TaskScheduler class• Used by JobTracker to schedule Task on TaskTracker.• Uses one or more JobInProgressListener to receive notifications about the jobs.


TaskScheduler class• Used by JobTracker to schedule Task on TaskTracker.• Uses one or more JobInProgressListener to receive notifications about the jobs.• Uses ClusterStatus to get info about the state of cluster.


TaskScheduler class• Used by JobTracker to schedule Task on TaskTracker.• Uses one or more JobInProgressListener to receive notifications about the jobs.• Uses ClusterStatus to get info about the state of cluster. • Methods:


TaskScheduler class• Used by JobTracker to schedule Task on TaskTracker.• Uses one or more JobInProgressListener to receive notifications about the jobs.• Uses ClusterStatus to get info about the state of cluster. • Methods:• start(), terminate(), refresh()


TaskScheduler class• Used by JobTracker to schedule Task on TaskTracker.• Uses one or more JobInProgressListener to receive notifications about the jobs.• Uses ClusterStatus to get info about the state of cluster. • Methods:• start(), terminate(), refresh()• Collection<JobInProgress> getJobs(String queueName)


TaskScheduler class• Used by JobTracker to schedule Task on TaskTracker.• Uses one or more JobInProgressListener to receive notifications about the jobs.• Uses ClusterStatus to get info about the state of cluster. • Methods:• start(), terminate(), refresh()• Collection<JobInProgress> getJobs(String queueName)• List<Task> assignTasks(TaskTracker)



• Implementations:



• Implementations:• Specified by mapred.jobtracker.taskScheduler



• Implementations:• Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)




- Multiple queue, each with different priority (VERY_HIGH, HIGH, ....)




- Multiple queue, each with different priority (VERY_HIGH, HIGH, ....)- User specifies job priority (mapred.job.priority)




- Multiple queue, each with different priority (VERY_HIGH, HIGH, ....)- User specifies job priority (mapred.job.priority)- Logic:




- Multiple queue, each with different priority (VERY_HIGH, HIGH, ....)- User specifies job priority (mapred.job.priority)- Logic: First select queue with highest priority




- Multiple queue, each with different priority (VERY_HIGH, HIGH, ....)- User specifies job priority (mapred.job.priority)- Logic: First select queue with highest priority Then FIFO within that queue


Task Scheduling

Job tracker Node


JobInProgress (job)








JobQueueTaskScheduler


Task Scheduling

Job tracker Node


JobInProgress (job)







TaskInProgress[2]cleanupJobQueueTaskScheduler


Task Scheduling

Job tracker Node


JobInProgress (job)








JIPListener


Task Scheduling

Job tracker Node


JobInProgress (job)








JIPListener

Callback jobAdded(JIP)


Task Scheduling

Job tracker Node


JobInProgress (job)








JIPListener


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener

List<Task> assignTasks(TaskTracker)


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener


1. Calculate availableMapSlots


Task Scheduling

Job tracker Node



1. Calculate availableMapSlots

JobTracker

availableMapSlots = trackerCurrentMapCapacity � trackerRunningMaps

= min(dmapLoadFactor ⇤ trackerMapCapacitye, trackerMapCapacity)

� trackerRunningMaps

where,

trackerMapCapacity = taskTrackerStatus.getMaxMapSlots()

trackerRunningMaps = taskTrackerStatus.countMapTasks()

mapLoadFactor =

X

8jobsJIP’s numMapTask � finishedMapTask

clusterStatus.getMaxMapTasks()

TaskTrackerStatus

ClusterStatusJIPListener

JobInProgress (JIP)


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener


for(i = 1 to availableMapSlots) {for(JIP job : JIPListener.getJobQ()) {

}}


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener getJobQueue() usesMap<JobSchedulingInfo, JIP> +

FIFO_JOB_QUEUE comparator

Process jobs in higher priority queue first



}}


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener



}}


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener



}}

Task t = job.findNewMapTask()


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener



}}


- Return task with most failures (not on given m/c) w/o locality (JIP’s failedMaps) - Return non-running tasks using locality info (JIP’s nonRunningMapCache)- Return speculative task


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener



}}



Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener



}}


assignedTasks.add(t)

// Also, make sure there are free slots in cluster for speculative tasks


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener



}}




Do same thing for reducer


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener



}}





Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener



}}




return assignedTasks


Task Scheduling

Job tracker Node


JobInProgress (job)









JIPListener


}}




return assignedTasks



• Implementations:• Specified by mapred.jobtracker.taskScheduler • Default: FIFO scheduler (o.a.h.mapred.JobQueueTaskScheduler)• Facebook’s FairScheduler• Yahoo’s CapacityScheduler




- Doesnot support preemption- Bad for production cluster (high priority can be misused)




Goal: Provide fast response time for small jobs and guaranteed service levels for productions jobs.







Pools:






Pools:

Min share: 30 slots 40 slots






Pools:Cluster: 100 slots available. Allocate them !









40 slots30 slots30 slots









15 15Wednesday, March 27, 13





Additional features: - Job weights for unequal sharing (based on priority or size) - Limits for #running jobs per user/pool

Usage:cp build/contrib/fairscheduler/*.jar libmapred.jobtracker.taskScheduler to o.a.h.m.FairSchedulermapred.fairscheduler.allocation.file to /path/pool.xml




15 15Wednesday, March 27, 13




~ FairScheduler, queues instead of pools.





~ FairScheduler, queues instead of pools.Queue share % of cluster. Queue can have jobs of different priorities






FIFO scheduling within each queue. Scheduling more deterministic than FairScheduler.






FIFO scheduling within each queue. Scheduling more deterministic than FairScheduler.

Also, unlike other 2, provides support for memory-based scheduling and preemption.


Task creation

Job tracker Node Task tracker Node

JobTracker TaskTracker

TaskScheduler

Heartbeat protocol:- Periodic- Indicate health of TaskTracker

- Failure detection- Remote Procedure Call- Piggyback directives

- Launch a task- Perform cleanup/commit


Task creation


JobTracker TaskTrackerjobClient

TaskScheduler





Task creation



this.jobClient = (InterTrackerProtocol) UserGroupInformation.getLoginUser().doAs( new PrivilegedExceptionAction<Object>() { public Object run() throws IOException { return RPC.waitForProxy(InterTrackerProtocol.class, InterTrackerProtocol.versionID, jobTrackAddr, fConf); } });

TaskScheduler





Task creation



jobClient.heartbeat(…);


TaskScheduler





Task creation



HeartbeatResponse heartbeatResponse = jobClient.heartbeat(…);


TaskScheduler






Task creation



TaskScheduler

jobClient






Task creation



TaskScheduler

jobClient



void run() { offerService();}





Task creation



TaskScheduler

jobClient



offerService() {






Task creation



TaskScheduler

jobClient



offerService() { while(is task tracker running flags) {






Task creation



TaskScheduler

jobClient



offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now);






Task creation



TaskScheduler

jobClient



offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions();






Task creation



TaskScheduler

jobClient



offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction






Task creation



TaskScheduler

jobClient



offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive






Task creation



TaskScheduler

jobClient



offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks();






Task creation



TaskScheduler

jobClient



offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress






Task creation



TaskScheduler

jobClient



offerService() { while(is task tracker running flags) { HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); // type: LaunchTaskAction, CommitTaskAction // or explicit cleanup directive markUnresponsiveTasks(); killOverflowingTasks(); // if low disk space: reduce first, then least progress }}






Task creation



TaskScheduler

jobClient






Task creation



TaskScheduler

jobClient





TaskTracker uses 2 internal


Task creation



TaskScheduler

jobClient





TaskTracker uses 2 internal classes:


Task creation



TaskScheduler

jobClient





TaskTracker uses 2 internal classes: - TaskLauncher


Task creation



TaskScheduler

jobClient






mapLauncher,reduceLauncher


Task creation



TaskScheduler

jobClient






mapLauncher,reduceLauncher- TaskInProgress’s launchTask()


Task creation



TaskScheduler

jobClient







Calls TaskRunner


Task creation



TaskScheduler

jobClient







Calls TaskRunner

TaskRunner

start()


Task creation



TaskScheduler

jobClient



TaskRunner

start()

LaunchTaskAction

void run() {

}


Task creation



TaskScheduler

jobClient



TaskRunner

start()

LaunchTaskAction

void run() {

}

- Launches a new “child” JVM per task using class JvmManager.


Task creation



TaskScheduler

jobClient



TaskRunner

start()

LaunchTaskAction

void run() {

}

- Launches a new “child” JVM per task using class JvmManager. - Why? Any bug in map/reduce don’t affect TaskTracker.


Task creation



TaskScheduler

jobClient



TaskRunner

start()

LaunchTaskAction

void run() {

}


- Builds child JVM options using property mapred.java.child.opts (heapsize (max/initial), garbage collection options). Default: -Xmx200m


Task creation



TaskScheduler

jobClient



TaskRunner

start()

LaunchTaskAction

void run() {

}



- To control additional processes by child JVM (eg: Hadoop Streaming), use property mapred.child.ulimit (limit of virtual memory)


Task creation



TaskScheduler

jobClient



TaskRunner

start()

LaunchTaskAction

void run() {

}




- For short-lived tasks, reuse JVMs using mapred.job.reuse.jvm.num.tasks (default 1)


Task creation



TaskScheduler

jobClient



TaskRunner

start()

LaunchTaskAction

void run() {

}





- Task for a given JVM: sequentially; but across JVMs: parallelly.


Task creation in little more detail



TaskScheduler

jobClient



TaskRunner

start()

LaunchTaskAction

void run() {

}








Task tracker Node

TaskTrackerjobClientvoid run() { offerService();}

TaskRunner

start()

LaunchTaskAction

void run() {

}








Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}






JvmManager



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}






JvmManager

JvmRunnerrunChild() {..tracker.getTaskController().launchTask(...)..}



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}






JvmManager


- TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController)



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}






JvmManager


- TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController)- Creates directories for task (attempt, working, log)



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}






JvmManager


- TaskController pluggable through mapred.task.tracker.task-controller (DefaultTaskController or LinuxTaskController)- Creates directories for task (attempt, working, log)- Pass JVM args and OS specific manipulations to TaskLog and then to o.a.h.util.Shell, which invokes JVM through java’s ProcessBuilder.



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}






JvmManager



Note, args for JVM already set by TaskRunner’s getJVMArgs(...)



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}






JvmManager



Note, args for JVM already set by TaskRunner’s getJVMArgs(...)- Default main class: Child.java



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}






JvmManager




Different JVM



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}






JvmManager




Different JVM

Childvoid main(..) { .... }



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}






JvmManager




Different JVM

umbilicalChildvoid main(..) { .... }



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}






JvmManager




Different JVM


MapTask or Reduce Taskrun(job, umbilical) {

}



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}

JvmManager


Different JVM



}



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}

JvmManager


Different JVM



}

TaskReporter

- Create TaskReporter that also uses umbilical object.



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}

JvmManager


Different JVM



}

TaskReporter

- Create TaskReporter that also uses umbilical object.- Check if it is job/task setup/cleanup task.



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}

JvmManager


Different JVM



}

TaskReporter


- If so, run their respective method and return.



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}

JvmManager


Different JVM



}

TaskReporter


- If so, run their respective method and return.- Else, do Map/Reduce specific actions !!!



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}

JvmManager


Different JVM



}

TaskReporter



- Perform commit operation if it is required.



Task tracker Node


TaskRunner

start()

LaunchTaskAction

void run() {

}

JvmManager


Different JVM



}

TaskReporter



- Perform commit operation if it is required.- If speculative task, ensure only one of the duplicate task is committed.


Map-specific actions:



map

map

map

MapperInputFormat

mapper & input using ReflectionUtils.newInstance(...)



map

map

map

MapperInputFormat


split 1

split 2

split 3

split 4

split 5

Build split using MapTask’s getSplitDetails(splitIndex, ...) + Use FileSystem/Deserializer from JobConf



map

map

map

MapperInputFormat


split 1

split 2

split 3

split 4

split 5


For each key-value read from the split (through context.nextKeyValue()), call user-defined map



map

map

map

MapperInputFormat


split 1

split 2

split 3

split 4

split 5



Sort/Spill



map

map

map

MapperInputFormat


split 1

split 2

split 3

split 4

split 5



Store output of map into in-memory circular buffer (MapOutputBuffer)

Sort/Spill



map

map

map

MapperInputFormat


split 1

split 2

split 3

split 4

split 5



Store output of map into in-memory circular buffer (MapOutputBuffer)- If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk.

Sort/Spill



map

map

map

MapperInputFormat


split 1

split 2

split 3

split 4

split 5



Store output of map into in-memory circular buffer (MapOutputBuffer)- If no reducer, uses DirectMapOutputCollector instead, which writes immediately to disk.- When buffer reaches certain threshold, a background thread MapOutputBuffer’s inner class SpillThread will start spilling the buffer to the disk (mapred.local.dir).

Sort/Spill



map

map

map

MapperInputFormat


split 1

split 2

split 3

split 4

split 5




- If specified, run combiner if at least 3 spill files (min.num.spills.for.combine)

Sort/Spill



map

map

map

MapperInputFormat


split 1

split 2

split 3

split 4

split 5




- If specified, run combiner if at least 3 spill files (min.num.spills.for.combine)- Before writing to disk, compress if mapred.compress.map.output is true.

Sort/Spill



map

map

map

MapperInputFormat


split 1

split 2

split 3

split 4

split 5




- If specified, run combiner if at least 3 spill files (min.num.spills.for.combine)- Before writing to disk, compress if mapred.compress.map.output is true.- Sort uses user-defined Comparator and Partitioner.

Sort/Spill



map

map

map

MapperInputFormat


split 1

split 2

split 3

split 4

split 5




- If specified, run combiner if at least 3 spill files (min.num.spills.for.combine)- Before writing to disk, compress if mapred.compress.map.output is true.- Sort uses user-defined Comparator and Partitioner.

Sort/SpillFinal output: One sorted

partitioned file


In-memory circular buffer



io.sort.mb (Default: 100MB = 104857600 bytes) = $1




$1 * io.sort.spill.percent (Default: 0.8)





$1 * io.sort.record.percent (Default: 0.05)

Record pointers






Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:






Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:

<Partition, Key offset, Value offset>






Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:






Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:

Avail data buffer: $1 * (1 - 0.05) * 0.8 = 79691776






Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:


Max #records w/o spill: $1 * 0.05 / (4 ints * 4 bytes) = 327680






Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:


INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680







Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:




2 common cases for spilling:






Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:




2 common cases for spilling:1. Lot of small records filling up the record buffer






Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:





- Spill before the data buffer is full. Tweak io.sort.record.percent using heuristic:






Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:





- Spill before the data buffer is full. Tweak io.sort.record.percent using heuristic:= 16 / (16 + avgRecordSize) ... (0.05 optimal if avgRecordSize ~ 300 byte)






Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:






- See https://issues.apache.org/jira/browse/MAPREDUCE-64


https://issues.apache.org/jira/browse/MAPREDUCE-64






Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:






- See https://issues.apache.org/jira/browse/MAPREDUCE-64INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true








Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:







2. Few but very large records filling up the data buffer








Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:







2. Few but very large records filling up the data buffer- Increase buffer size and also spill percent (~ 1). Key: Try to spill only once.








Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:







2. Few but very large records filling up the data buffer- Increase buffer size and also spill percent (~ 1). Key: Try to spill only once.- Tradeoff: Buffer takes memory from JVM (i.e. from mapred.child.java.opts). Therefore, if Max JVM =1GB and $1=128MB, then user code gets only 896MB.








Record pointers

kvindices(1 int)

kvoffsets (3 ints)

Index buffer:

Partition buffer:







2. Few but very large records filling up the data buffer- Increase buffer size and also spill percent (~ 1). Key: Try to spill only once.- Tradeoff: Buffer takes memory from JVM (i.e. from mapred.child.java.opts). Therefore, if Max JVM =1GB and $1=128MB, then user code gets only 896MB.

INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full = true




Sort/Spill

Reduce-specific actions:

map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5


Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskTracker (map-side)

mapping info


Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5


mapping info

TaskTracker (reduce-side)


JobTracker

thru heartbeat


Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5


mapping info



JobTracker

thru heartbeat

Reducers know which machines to fetch data from.


Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5


mapping info


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5


mapping info


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.


mapping info


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE


mapping info


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE

ReduceTaskTaskTracker (map-side)

mapping info


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE

ReduceTaskif(mapred.job.tracker != local)


mapping info


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



mapping info

ReduceCopierfetchOutput() {

}


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



mapping info


}

MapOutputCopier


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



mapping info


}

MapOutputCopier

HttpServer

MapOutputServlet

- Get output using HTTP


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



mapping info


}

MapOutputCopier

HttpServer

MapOutputServlet


- mapred.reduce.parallel.copies: #MapOutputCopier threads (i.e. # fetches in parallel on each reduce task)


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



mapping info


}

MapOutputCopier

HttpServer

MapOutputServlet



- Default: 5


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



mapping info


}

MapOutputCopier

HttpServer

MapOutputServlet



- Default: 5

- tasktracker.http.threads: #clients HttpServer will service


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



mapping info


}

MapOutputCopier

HttpServer

MapOutputServlet



- Default: 5

- tasktracker.http.threads: #clients HttpServer will service- Default: 40


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



mapping info


}

MapOutputCopier

HttpServer

MapOutputServlet



- Default: 5

- tasktracker.http.threads: #clients HttpServer will service- Default: 40- Mapreduce2 will use Netty (2x #processors)


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier


mapping info

HttpServer

MapOutputServlet


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier

Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ?


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier

Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ?- Yes: Keep output in memory


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier

Is map output size < ShuffleRamManager’s MaxSingleShuffleLimit ?- Yes: Keep output in memory- No: Write it to disk


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier


MaxSingleShuffleLimit = mapred.child.java.opts’s -Xmx * mapred.job.shuffle.input.buffer.percent (default: 0.7) * 0.25f


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier



INFO org.apache.hadoop.mapred.ReduceTask: Shuffling ? bytes (? raw bytes) into (RAM/Local-FS) from attempt_?


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread

Perform “in-memory merge” if


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread

Perform “in-memory merge” if- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread

Perform “in-memory merge” if- Used memory > (-Xmx * 0.7) * mapred.job.shuffle.merge.percent (default: 0.66)- Or #map outputs > mapred.inmem.merge.threshold (default: 1000)


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread

Perform (interleaved) “on-disk merge” if



reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread

Perform (interleaved) “on-disk merge” if- #files on disk > 2*io.sort.factor - 1 (fairly rare)



reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread

Perform (interleaved) “on-disk merge” if- #files on disk > 2*io.sort.factor - 1 (fairly rare)Eg: 50 files and io.sort.factor = 10



reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread

Perform (interleaved) “on-disk merge” if- #files on disk > 2*io.sort.factor - 1 (fairly rare)Eg: 50 files and io.sort.factor = 105 rounds of merging, 10 files at a time*



reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread


Merge

SORT



reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread


Merge

SORT


Finally, spills in-memory data to disk. Why ?


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread


Merge

SORT


Finally, spills in-memory data to disk. Why ?- Assumes user reduce() needs all the RAM.


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE



}

MapOutputCopier




LocalFSMerger

InMemFSMergeThread


Merge

SORT


Finally, spills in-memory data to disk. Why ?- Assumes user reduce() needs all the RAM.- Can tweak it using mapred.job.reduce.input.buffer.percent (default: 0) to ~ 0.7, if simple reducer.


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE

ReduceTask

Merge

SORT


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE

ReduceTask

Merge

SORT

Use RawKeyValueIterator and


reduce

reduce

Sort/Spill


map

map

map

MapperInputFormat

split 1

split 2

split 3

split 4

split 5

TaskStatus.Phase.

Fetch

SHUFFLE

ReduceTask

Merge

SORT

call user-defined Reducer class.

part-0

part-1

Reducer OutputFormat

REDUCE

Use RawKeyValueIterator and


References- Hadoop - The definitive guide 3rd edition by Tom White.- Hadoop Operations by Eric Sammers.- Data-Intensive Text Processing by Jimmy Lin and Chris Dyers.- Mining of Massive Datasets by Rajaraman et al.- Online Aggregation for Large MapReduce Jobs by Pansare et al.- Distributed and Cloud Computing by Hwang et al.- http://developer.yahoo.com/hadoop/tutorial/- http://www.slideshare.net/cloudera/mr-perf- http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html- http://www.cs.rice.edu/~fd2/pdf/hpdc106-dinu.pdf


http://developer.yahoo.com/hadoop/tutorial/

http://developer.yahoo.com/hadoop/tutorial/

http://www.slideshare.net/cloudera/mr-perf

http://www.slideshare.net/cloudera/mr-perf

http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html

http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html

http://www.cs.rice.edu/~fd2/pdf/hpdc106-dinu.pdf

http://www.cs.rice.edu/~fd2/pdf/hpdc106-dinu.pdf

How MapReduce part of Hadoop works (i.e. system's view) ?

Education

Transcript of How MapReduce part of Hadoop works (i.e. system's view) ?