Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri...

26
Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed Systems (CLOUDS) Laboratory Department of Computing and Information Systems, The University of Melbourne, Email: [email protected], {kotagiri,rbuyya}@unimelb.edu.au ICCS-2014, Cairns, Australia

Transcript of Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri...

Page 1: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds

Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya

Cloud Computing and Distributed Systems (CLOUDS) LaboratoryDepartment of Computing and Information Systems, The University of Melbourne,

Email: [email protected],{kotagiri,rbuyya}@unimelb.edu.au

ICCS-2014, Cairns, Australia

Page 2: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Cloud Computing

Cloud Computing Offers resources as a subscription based service Highly scalable Highly available Driven by market principles Dynamically configured and delivered on demand Different pricing models

2

Page 3: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Benefits of Cloud Computing

• Scalability or elasticity

• On-Demand resource provisioning

• Wide range of resource types

• Pay-as-you-go model

• Attractive cost models

• Illusion of unlimited resources

• Cheaper and fast storage facilities

• Plethora of tools for ease of use– Content-delivery– Monitoring

– Networking– Deployment and Management

3

Page 4: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Spot Instances

• Started by Amazon around December 2009

• Idle or unused datacenter capacity

• Spot price is decided in an Auction-like mechanism

• Varies with time and instance type

• Varies between regions and availability zones

• bid should be higher than or equal to the spot price

• Offers upto 60% cost reductions

Page 5: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Workflows

• Scientific workflow systems aim at automating large complex data analysis to make it easier for scientists.

• Workflows are collection of tasks that are data dependent or control dependent. Workflows can be represented as Directed Acyclic Graph

• Workflow scheduling maps tasks to resources whilst maintaining dependencies

• Jargons– Makespan– Cost

Sample Workflow

5

– Deadline– Budget

Page 6: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Research overview

• Just-in-time and adaptive scheduling heuristic

• Using spot and on-demand instances

• An intelligent bidding strategy

• Minimizes the execution cost

• Providing a robust schedule

• Satisfying the deadline constraint

6

Page 7: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Background

• Workflow is represented a DAG

• Makespan is the total elapsed time

• Pricing models– On-Demand– Spot

• Critical Path is the longest path from the start node to the exit node

Page 8: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Latest Time to On-Demand (LTO)

• It is the latest time the algorithm has to switch to on-demand instances to satisfy the deadline constraint

DeadlineLTOStart

Spot Instances On-Demand

Page 9: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

System Model

Page 10: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Runtime Estimation

• We use Downey’s analytical model

• Downey’s model requires:– task’s average parallelism, A,– coefficient of variance of parallelism, σ,– task length – the number of cores

• Cirne et al model to generate A and σ

Page 11: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Failure Estimator

• Estimates the failure probability of a particular bid price

• Based on spot price

• The history price of one month prior is considered

• Total time of the spot price history, HT

• And total out of bid time, OBTbidt is measured

Page 12: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Scheduling Algorithm

Page 13: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Scheduling Algorithm (Contd..)

Page 14: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Scheduling Algorithm (Contd..)

Page 15: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Two type of Scheduling Algorithms

• Conservative: CP and LTO is estimated on the lowest cost instance.

– CP is the longest, hence less slack time– Uses spot instances cautiously under relaxed deadlines

• Aggressive: CP and LTO is estimated on the highest cost instance.

– CP is smallest, hence more slack time– opt on-demand instances that are expensive under

failures

Page 16: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Bidding Strategy

Intelligent Bidding Strategy

• Current spot price (pspot)

• On-demand price (pOD)

• Failure probability (FP) of the previous bid price

• LTO

• Current time (CT)

• α

• β

Page 17: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Intelligent Bidding Strategy

• α : dictates how much higher the bid value must be above the current spot price

• β : determines how fast the bid value reaches the on-demand price

• FP of the previous bid is used as a feedback to the current bid price

Page 18: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Intelligent Bidding Strategy

Page 19: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Other Bidding Strategies

• On-Demand Bidding Strategy : uses the on-demand price as the bid price.

• Naive Bidding Strategy: uses the current spot price as the bid price for the instance

Page 20: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Simulation Setup

• CloudSim was used for simulation

• LIGO workflow with 1000 tasks was considered

• For On-Demand 9 different VMs types wereconsidered

• For Spot, 1 VM type was used

Page 21: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Results : Comparison between algorithms

Mean execution cost of algorithms with varying deadline (with 95% confidence interval)

Page 22: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Results : Comparison between bidding strategies

Mean Execution Cost of bidding strategies with varying deadline (with 95% confidence interval)

Page 23: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Results : Task Failures

Mean of task failures due to bidding strategies

Page 24: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Results : Checkpointing

Page 25: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

Conclusion

• Two scheduling heuristics that map workflow tasks onto spot and on-demand instance are presented

• They minimize the execution cost

• They are robust and fault-tolerant towards out-of-bid failures and performance variations

• A bidding strategy that bids intelligently to minimize the cost is presented

• Demonstrates the use of checkpointing, which offers cost savings up to 14%

Page 26: Fault-Tolerant Workflow Scheduling Using Spot Instances on Clouds Deepak Poola, Kotagiri Ramamohanarao, and Rajkumar Buyya Cloud Computing and Distributed.

© Copyright The University of Melbourne 2009