Predictions for Parallel Applications and Systems
description
Transcript of Predictions for Parallel Applications and Systems
![Page 1: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/1.jpg)
SERC Research Seminar DayAugust 18, 2007
Predictions for Parallel Applications and Systems
Sathish VadhiyarGrid Applications Research Laboratory (GARL)
![Page 2: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/2.jpg)
SERC Research Seminar DayAugust 18, 2007
GARL Research• Grid Applications
– Climate Modeling– Gene Mutations
• Performance Modeling• Rescheduling• Others
– Prediction of queue wait times
![Page 3: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/3.jpg)
SERC Research Seminar DayAugust 18, 2007
GARL Research• Grid Applications
– Climate Modeling– Gene Mutations
• Performance Modeling• Rescheduling• Others
– Prediction of queue wait times
![Page 4: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/4.jpg)
SERC Research Seminar DayAugust 18, 2007
Rescheduling
• The base is a parallel checkpointing library called SRS
• Checkpointing? – storing application’s state so as to continue from the previous state after interruption
• Interruption either by a scheduler or system faults
• SRS allows processor reconfiguration
![Page 5: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/5.jpg)
SERC Research Seminar DayAugust 18, 2007
Application Progress
System 1
Storage
System 2
![Page 6: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/6.jpg)
SERC Research Seminar DayAugust 18, 2007
Optimal Checkpoint Interval
• Storing checkpoints periodically will help in fault-tolerance
•How periodic?• What is the optimal checkpoint interval?
– More checkpointing will lead to increased checkpoint overhead
– Less checkpointing frequency will lead to increase times for recovery from failures
![Page 7: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/7.jpg)
SERC Research Seminar DayAugust 18, 2007
Illustration
![Page 8: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/8.jpg)
SERC Research Seminar DayAugust 18, 2007
Dynamic Determination of Optimal Checkpointing Intervals
• Start the application on a set of resources
• Predict the next failure on the set of resources
• Checkpoint “just before” the next failure• The prediction has to be really accurate• But no prediction can be 100% accurate
![Page 9: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/9.jpg)
SERC Research Seminar DayAugust 18, 2007
Probability Distribution of Failures
• Use a probability distribution of failures on the resources
• Need to know: The next time of failure with x% certainty
• But more certainty is also not good
![Page 10: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/10.jpg)
SERC Research Seminar DayAugust 18, 2007
Markov Chains
For parallel M-M checkpointing
In SRS, there is almost no system down phase
For sequential applications
In SRS, transition from state 0 can lead to many states
![Page 11: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/11.jpg)
SERC Research Seminar DayAugust 18, 2007
GARL Research• Grid Applications
– Climate Modeling– Gene Mutations
• Performance Modeling• Rescheduling• Others
– Prediction of queue wait times
![Page 12: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/12.jpg)
SERC Research Seminar DayAugust 18, 2007
Motivation for Queue Wait Times
• A Grid consisting of number of batch queues
• A meta system that will:– predict the wait times and execution
times of jobs– Decide which queue is “most
suitable” for the job
![Page 13: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/13.jpg)
SERC Research Seminar DayAugust 18, 2007
What is a good predictor?
• There are number of prediction strategies• Evaluating a predictor’s goodness:
1. Mean Absolute Percentage Error (MAPE)2. Upper bound for actual/predicted3. Average of (actual-predicted) [absolute error]4. Absolute error/actual wait time [relative error]5. Average error/average queue wait time6. Coefficient of correlation
• Each of these metrics has flaws
![Page 14: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/14.jpg)
SERC Research Seminar DayAugust 18, 2007
Illustration
Method 1 Method 2
Metric 3 value of Method 1 < Metric 3 value of Method 2
i.e. Method 1 is better
![Page 15: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/15.jpg)
SERC Research Seminar DayAugust 18, 2007
Our goals
• To define useful metrics that can clearly say whether a method is “good” or “bad”
• Goodness of predictors– In terms of absolute wait times– In terms of execution times– In terms of resource demand
![Page 16: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/16.jpg)
SERC Research Seminar DayAugust 18, 2007
Illustration:Prediction errors versus absolute wait times
(A-P)/A%
Wait times
y1x1, y1
f(x)
x2, y2
![Page 17: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/17.jpg)
SERC Research Seminar DayAugust 18, 2007
Reality??
![Page 18: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/18.jpg)
SERC Research Seminar DayAugust 18, 2007
What we want to do…
• Define metrics that can evaluate a method in the “absolute” sense, not “comparative” sense– Stare at a single graph and ask “Is this graph good”
as much as possible
• In some cases, it may just not be possible– Use comparisons
• Evaluate the existing methods on these sets of metrics
• Come up with a method that performs the best in terms of all of the defined metrics
![Page 19: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/19.jpg)
SERC Research Seminar DayAugust 18, 2007
GARL Research• Grid Applications
– Climate Modeling– Gene Mutations
• Performance Modeling• Rescheduling• Others
– Prediction of queue wait times
![Page 20: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/20.jpg)
SERC Research Seminar DayAugust 18, 2007
Motivation
• Certain large computational phases of climate modeling (CCSM) are done only by some processors
• Load balancing – offload work from these processors to other processors– Increased processor utilization– Decreased execution time
• How much offloading?– Need to predict workload based on previous
computations
![Page 21: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/21.jpg)
SERC Research Seminar DayAugust 18, 2007
What is happening…
Proc 0 Proc 1 Proc 2 Proc 3 Proc 4
Phase 1
Phase 2
![Page 22: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/22.jpg)
SERC Research Seminar DayAugust 18, 2007
What should happen…
Proc 0 Proc 1 Proc 2 Proc 3 Proc 4
Phase 1
Phase 2
For this, we need to know the workload in phase 1
We predict the workload based on previous time steps
![Page 23: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/23.jpg)
SERC Research Seminar DayAugust 18, 2007
Advantages
![Page 24: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/24.jpg)
SERC Research Seminar DayAugust 18, 2007
GARLians
• Yadnyesh Joshi (M.Sc)• Karthikeyan Raman (M.Tech, jointly with Prof.
Govindarajan)• H.A. Sanjay (Ph.D, jointly with Prof. Ravi Nanjundiah, CAOS)• Sivagama Sundari (Ph.D)• Ashish Srivatsava (Project Assistant)• Alumni
– 1 student intern from INSA, Lyon, France– Summer interns– Project assistants– 2 M.Scs
![Page 25: Predictions for Parallel Applications and Systems](https://reader035.fdocuments.net/reader035/viewer/2022070413/56814d75550346895dbad2dc/html5/thumbnails/25.jpg)
SERC Research Seminar DayAugust 18, 2007
Questions ????
http://garl.serc.iisc.ernet.in