Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by...
-
Upload
yahoo-developer-network -
Category
Technology
-
view
3.628 -
download
7
description
Transcript of Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by...
![Page 1: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/1.jpg)
Scheduling in MapReduce using Machine Learning Techniques
IIIT Hyderabad
Vasudeva Varma [email protected] Nanduri [email protected]
Cloud Computing GroupSearch and Information Extraction Lab
http://search.iiit.ac.in
![Page 2: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/2.jpg)
Agenda
• Cloud Computing Group @ IIIT Hyderabad• Admission Control• Task Assignment• Conclusion
2
![Page 3: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/3.jpg)
Cloud Computing Group @ IIIT Hyderabad
• Search and Information Extraction– Large datasets– Clusters of machines– Web crawling– Data intensive applications
• MapReduce– Apache Hadoop
3
![Page 4: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/4.jpg)
Research Areas
• Resource management for MapReduce– Scheduling– Data Placement
• Power aware resource management• Data management in cloud• Virtualization
4
![Page 5: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/5.jpg)
Teaching
• Cloud Computing course– Monsoon semester (2008 onwards)– Special focus on Apache Hadoop• MapReduce and HDFS• Mahout
– Virtualization– NoSQL databases– Guest lectures from industry experts
5
![Page 6: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/6.jpg)
Learning Based Admission Control and Task Assignment in MapReduce
• Learning based approach• Admission Control– Should we accept a job for execution in the
cluster?• Task Assignment– Which task to choose for running on a given node?
6
![Page 7: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/7.jpg)
Admission Control
Deciding if and which request to accept from a set of incoming requests
Critical in achieving better QoS Important to prevent over committing Needed to maximize the utility from the
perspective of a service provider
7
![Page 8: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/8.jpg)
• Web services interface for MR jobs• Users search jobs through repositories• Select one that matches their criteria• Launch it on clusters managed by service provider• Service providers rent infrastructure from IaaS provider
MapReduce as a Service
8
![Page 9: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/9.jpg)
Three phase Soft and hard deadlines Decay parameters Provison for service provider
penalty
Utility Functions
9
![Page 10: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/10.jpg)
Based on Expected Utility Hypothesis from decision theory
Accept a job that maximizes the expected utility
Use pattern classifier to classify incoming jobs
Two classes Utility functions for prioritizing
Our Approach
10
![Page 11: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/11.jpg)
Feature Vector
Given as input to the classifier Contains job specific and cluster specific parameters Includes variables that might affect admission decision
Cluster Specific
Used map slots
Used reduce slots
Pending maps
Pending reduces
Finishing jobs
Map time average
Reduce time average
Job Specific
Number of maps
Number of reduces
Mean map task time
Mean reduce task time
11
![Page 12: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/12.jpg)
Bayesian Classifier
Naive Bayes Assumption Conditionally independent parameters
Works well in practice Use past events to predict future outcomes
Application of Bayes theorem while computing probabilities
Incremental Learning – efficient w.r.t. memory usage
Simple to implement
12
![Page 13: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/13.jpg)
Evaluation
Success/Failure criteria: Load management Simulation Baseline
Myopic – Immediately select job that has maximum utility
Random – Randomly select one job from the candidate jobs
13
![Page 14: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/14.jpg)
Algorithm Accuracy
14
![Page 15: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/15.jpg)
Comparison with baseline
Algorithm Achieved Load Average
Random 42.11
Myopic 42.09
Our algorithm 0.97
15
![Page 16: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/16.jpg)
Meeting Deadlines
16
![Page 17: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/17.jpg)
Task Assignment
Deciding if a Task can be assigned on a node Learning based technique Extension of the work presented before
17
![Page 18: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/18.jpg)
Learning Scheduler
18
![Page 19: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/19.jpg)
Features of Learning Scheduler
• Flexible task assignment – based on state of resources
• Consider job profile while allocating• Tries to avoid overloading task trackers• Allow users to control assignment by
specifying priority functions• Incremental learning
19
![Page 20: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/20.jpg)
Using Classifier
• Use a pattern classifier to classify candidate jobs
• Two classes: good and bad• Good tasks don't overload task trackers• Overload: A limit set on system load average
by the admin
20
![Page 21: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/21.jpg)
Feature Vector
• Job features– CPU, memory, network and disk usage of a job
• Node properties– Static: Number of processors, maximum physical
and virtual memory, CPU Frequency– Dynamic: State of resources, Number of running
map tasks, Number of running reduce tasks
21
![Page 22: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/22.jpg)
Job Selection
• From the candidates labelled as good select one with maximum priority
• Create a task of the selected job
22
![Page 23: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/23.jpg)
Priority (Utility) Functions
• Policy enforcement– FIFO: U(J) = J.age– Revenue oriented
• If priority of all jobs is equal, scheduler will always assign task that has the maximum likelihood of being labelled good.
23
![Page 24: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/24.jpg)
Job Profile
• Users submit 'hints' about job performance• Estimate job's resource consumption on a
scale of 10, 10 being the highest.• This data is passed at job submission time
through job parameters:– learnsched.jobstat.map - “1:2:3:4”
• This scheduler is made open-source at http://code.google.com/p/learnsched/
24
![Page 25: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/25.jpg)
Evaluation
• Evaluation work load– TextWriter– WordCount– WordCount + 10ms delay– URLGet– URLToDisk– CPU Activity
25
![Page 26: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/26.jpg)
Learning Behaviour
26
![Page 27: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/27.jpg)
Classifier Accuracy
27
![Page 28: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/28.jpg)
Conclusions
Feedback informed classifiers can be used effectively
Better QoS than naive approaches Less runtime happy users more revenue
for the service provider
28
![Page 29: Apache Hadoop India Summit 2011 talk "Scheduling in MapReduce using Machine Learning Techniques" by Vasudeva Varma](https://reader035.fdocuments.net/reader035/viewer/2022081413/54582519b1af9fba5d8b4b7d/html5/thumbnails/29.jpg)
Thank you
IIIT Hyderabad
Questions/Suggestions/Comments?Vasudeva Varma [email protected] Nanduri [email protected]
Cloud Computing GroupSearch and Information Extraction Lab
http://search.iiit.ac.in