Hadoop job chaining
-
Upload
subhas-kumar-ghosh -
Category
Data & Analytics
-
view
266 -
download
0
Transcript of Hadoop job chaining
Training (Day – 2)
Chaining
Simple idea is sequential call through JobClient• Not every problem can be solved with a MapReduce
program, but fewer still are those which can be solved with a single MapReduce job.
• Many problems can be solved with MapReduce, by writing several MapReduce steps which run in series to accomplish a goal:– Map1 -> Reduce1 -> Map2 -> Reduce2 -> Map3...
• You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job.
JobClient in action• Call the first driver method, which
uses JobClient.runJob() to run the job and wait for it to complete.
• When that job has completed, then call the next driver method, which creates a new JobConf object referring to different instances of Mapper and Reducer, etc.
• The first job in the chain should write its output to a path which is then used as the input path for the second job.
• This process can be repeated for as many jobs are necessary to arrive at a complete solution to the problem.
Jobcontrol• Hadoop provides another mechanism for managing batches of
jobs with dependencies between jobs. • Rather than submit a JobConf to the JobClient's
– runJob() or – submitJob() methods
• org.apache.hadoop.mapred.jobcontrol.Job objects can be created to represent each job
• A Job takes a JobConf object as its constructor argument. • Jobs can depend on one another through the use of
the addDependingJob() method. • The code: x.addDependingJob(y)• says that Job x cannot start until y has successfully completed.
Creating dependency using jobcontrol• Dependency information cannot be added to a
job after it has already been started. • Given a set of jobs, these can be passed to an
instance of the JobControl class. • JobControl can receive individual jobs via
the addJob() method, or a collection of jobs via addJobs().
• The JobControl object will spawn a thread in the client to launch the jobs.
Jobcontrol special features• Individual jobs will be launched when their
dependencies have all successfully completed and when the MapReduce system as a whole has resources to execute the jobs.
• The JobControl interface allows you to query it to retrieve the state of individual jobs, as well as the list of jobs waiting, ready, running, and finished.
• The job submission process does not begin until the run() method of the JobControlobject is called.
ChainMapper• The ChainMapper class allows to use multiple Mapper
classes within a single Map task.• The Mapper classes are invoked in a chained (or piped)
fashion, the output of the first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will be written to the task's output.
• The key functionality of this feature is that the Mappers in the chain do not need to be aware that they are executed in a chain.
• This enables having reusable specialized Mappers that can be combined to perform composite operations within a single task.
ChainMapper• Special care has to be taken when creating chains that
the key/values output by a Mapper are valid for the following Mapper in the chain.
• It is assumed all Mappers and the Reduce in the chain use matching output and input key and value classes as no conversion is done by the chaining code.
• Using the ChainMapper and the ChainReducer classes is possible to compose Map/Reduce jobs that look like [MAP+ / REDUCE MAP*].
• And immediate benefit of this pattern is a dramatic reduction in disk IO.
ChainMapper usage pattern... Job = new Job(conf); Configuration mapAConf = new Configuration(false); ... ChainMapper.addMapper(job, AMap.class, LongWritable.class, Text.class, Text.class, Text.class, true, mapAConf); Configuration mapBConf = new Configuration(false); ... ChainMapper.addMapper(job, BMap.class, Text.class, Text.class, LongWritable.class, Text.class, false, mapBConf); ... job.waitForComplettion(true); ...
End of sessionDay – 2: Chaining