Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

24
Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj

Transcript of Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Page 1: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Service Level Agreement Based

Scheduling Heuristics

Rizos Sakellariou, Djamila Ouelhadj

Page 2: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Motivation – is this a good state of affairs?

• Scheduling jobs onto (high-performance) compute resources is traditionally queue based (has been since time immemorial )

• Two basic levels of service are provided:– “Run this when it gets to the head of the queue”

(in other words, whenever!)– “Run this at a precise time” (advance

reservation)

Even sophisticated systems, such as Condor, are still queue-based…

Page 3: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Scheduling workflows• DAG scheduling heuristics do exist

…but…

• In a queue based system:– To maintain dependences, each component is

scheduled after all parents have finished: the penalty is that each component pays the cost of the batch queue latency!

– Assurances about the start time, completion time of each component are desirable!

The best that one can aim for at the moment is advance reservation: too restrictive!

Page 4: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Advance Reservation• Setting times precisely is not what the user

really wants. Often users are only interested in the bounds (e.g., latest end time). This information is not captured, nor used!

• Doesn’t fit well into the batch processing model.– Utilisation (hence income) decreases rapidly as

the number of AR jobs increases (gaps can’t be effectively plugged – checkpointing and/or suspend/resume costs!)

Page 5: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

But it’s not only about workflows…

Renegotiation of resources:

• A long-term goal of the Reality Grid project

• Experiments may need to be extended in time (at short notice) – (discovery of the century is around the corner )

• Resources may need to be changed – in which case checkpointing/restart is needed: state may be in the order of 1TB!

Could it also be about expanding the user base?

Page 6: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

A novel approach to scheduling?• There is no queue; jobs do not have a priority• The schedule is based on satisfying constraints.• These constraints are expressed in a Service

Level Agreement: a contract between users and brokers; brokers and local schedulers, etc…

What to optimise for? (objective function) • Resource utilisation (income)• If someone comes with lots of cash the

scheduler may want to break some smaller agreements (money rules?) – reliability?

Page 7: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Local Schedulern

Local Scheduler3

Local Scheduler2

Super Scheduler3

Super Scheduler1

Local Scheduler1

Super Scheduler2

Use

rs

Jobs to finish “anytime” (no guarantee required)

ComputeResources

Cluster1 C

luster

2

Clus

ter2

Cluster

3Clus

ter

3

MetaSLA

subSLA

Resource Record

Page 8: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Key components• Users: they negotiate and agree an SLA with

a broker (or superscheduler)

• Brokers: based on SLAs agreed with users, they negotiate and agree SLAs with local schedulers (and possibly other brokers)

• Local Schedulers: they schedule the work that corresponds to an SLA agreed with a broker.

• Two types (?) of SLA:–Meta-SLA (between user and broker)– Sub-SLA (between broker and local scheduler)

Page 9: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Issues• Definition of SLAs

– Resources, start/finish time, how long, cost, guarantee, penalty for failure

– Meta-SLAs are negotiated first, sub-SLAs come later

• Negotiation Protocols– Based on availability (needs behaviour model)

• Scheduling– Jobs onto resources (local)

• Renegotiation

• Economy (selfish entities…), metrics

Page 10: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

The Research Challenges “L”

AI P

lan

nin

g &

S

ched

ulin

g

Scheduling for the Grid

Fuzzy logicMulticriteria schedulingAI constraint satisfaction

SLAsNegotiationScheduling heuristicsEconomic considerations

Page 11: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

SLA Contents

Info

Hardware

Estimated Response Time

Resource

metaSLA

List of Resources: H/W, response Time

Name, ID

Number of Nodes

Date

Deadline

Time Period Task execution time

Start Time End Time

Guarantee Level

Cost Payment for task execution

Budget Constraint Max cost specified by the user

Execution results by this time

Nodes

Execution Host Preference in a specific machine

Info

Hardware Software Time

Resource

subSLA

Book keeping info

List of Resources: H/W, S/W

ID

Resource Compute node definition

Hardware

Arch Mem Disk CPU

b/w

Software

OS Name and version

Time

Date

Resource reservation time

Start Time End Time

Client Owner Remote Machine

Book keeping info

Page 12: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Negotiation• Meta-SLA

– User requests an SLA– Based on (high-level view of) availability

broker suggests an SLA– If the user accepts, a contract is in place

• Sub-SLAs– Broker has agreed a meta-SLA– Usage of resources needs to be agreed –

sub-SLA is requested– Bids are made, based on availability– Sub-SLA is agreed

Page 13: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

When Super Scheduler unable to check locally

Submit job execution req. Authorize Client

Job Client Super Scheduler1

(SS1)

Local Scheduler1

(LS1)

Check Local Resources Availability

Send a metaSLA with the id

Checks Resource Availability

Response

Verify Resources

Submit subSLA(s)

Agree subSLA(s)

Parse subSLA(s) + Verify

Response

Status Information Request

Response <job mini-report>

Completion Report

Task Executionperiod

Initiate TaskExecution

Parse req.

Create + Store subSLA(s)

Task Initiation Notification (email)

Reservation+ Set Deadline

Update Storage of State Info about LS

Update Storage of State Info about LS, SLA Store.

Report (LS state info)

Assign an SLAid

Cost Calculation

Agree metaSLA

Create + Store metaSLA

Request (execution host info)

Optional Action

Optional Action

Update LS

metaSLA negotiation takes

place

...

Page 14: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Local Scheduling

EPSRC e-Science Meeting 2005

The scheduling problem is defined by the allocation of a set of independent SLAs, S={SLA1, …, SLAs} to a network of hosts H={H1, …, Hn}.The expected execution time Eij of SLAi on host Hj. The earliest possible start time STi for the SLAi on a selection of hosts is the latest free time of all the selected hosts. The expected completion time Ci of SLAi on host Hj = STi + Eij The makespan is defined as:

Cmax = max (Ci)1<i<s

The objective function is to minimise the makespan:

min Cmax

Page 15: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Tabu Search for Local Scheduling

EPSRC e-Science Meeting 2005

• To solve the problem we propose to investigate the use of advanced search techniques: tabu search, Genetic algorithms, simulated annealing, etc.

•Tabu search is a high-level iterative procedure that makes use of memory structures and exploration strategies based on information stored in memory to search beyond local optima. In tabu search, the search process starts from a feasible solution and iteratively moves from the current solution to its best neighbouring solution even if that moves worsens the objective function value (Glover, 1997).

Page 16: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Tabu Search for Local Scheduling

EPSRC e-Science Meeting 2005

Tabu search for local scheduling:

• Initial solution: FCFS, Min-min, Max-min, sufferage, and backfilling. • The solution is improved by using two moves: SLA-swap and SLA-transfer moves.• SLA-swap move swaps two SLAs performed by different processors.• SLA-transfer move shifts the SLA to another processor. • Composite neighbourhood.

Page 17: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Other objective functions for Local Scheduling

EPSRC e-Science Meeting 2005

• Other objective functions: minimising maximum latenessminimising cost to the user, maximising profit (to supplier),maximising personal / general utility, maximise resource utilisation, etc.

Page 18: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Fuzzy Scheduling

EPSRC e-Science Meeting 2005

• Uncertainty handling using fuzzy models: fuzzy due dates, fuzzy execution time.

p1 p2 p3x0

1

μ(P)

x

μ(D)

0

1

d1 d2

~

D~

P

Page 19: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Fuzzy objective function

EPSRC e-Science Meeting 2005

The objective is to minimise the maximum fuzzy completion time :

~

,...,1max

~

maxminimise iniCC

Page 20: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Re-negotiation in the Presence of Uncertainties

Dynamic nature of Grid computing:

Resources may fail, high priority jobs may submitted,

new resources can be added, etc.

EPSRC e-Science Meeting 2005

In the presence of real-time events, which make the LS agents not any more able to execute the SLAs, the SS agents re-negotiate the SLAs in failure at the local and global levels of the Grid in order to find alternative LS agents to execute them.

Page 21: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Renegotiation in the Presence of Uncertainties

EPSRC e-Science Meeting 2005

SS n

SS 1

Local Sched11

Local Sched21

Local Schedm1

Local Sched12

Local Sched22

Local Schedw2

SSt2

Sub-SLA re-negotiationMeta-SLA re-negotiation

Sub-SLA re-negotiation

User

SS1 detects the resource failure.

SS1 re-negotiates the sub-SLAs in failure to find alternative Local Schedulers locally within the same cluster by initiating a sub-SLA negotiation session with the suitable Local Schedulers.

If it cannot manage to do so, SS1 re-negotiates the meta-SLAs with the neighbouring SSs by initiating a meta-SLA negotiation session.

SS2 re-negotiates the sub-SLAs in failure to find alternative Local Schedulers. SS2 located LS22 to execute the job in failure. At the end of task execution, LS22 sends a final

report including the output file details to the user.

In case SS1 could not find alternative Local Schedulers at the local and global levels, the

SS1 sends an alert message to the user to inform him that the meta-SLA cannot be

fulfilled.

Page 22: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Methodology• Simulation based approach

– Need to evaluate different approaches for agreeing SLAs (e.g., conservative vs overbooking), generating bids, pricing/penalties, scheduling, …

– Need to model users behaviour with SLAs

• Evaluation metrics:– Resource utilisation, jobs completed / SLAs broken

• Difficult to do a fair comparison with a batch-queuing system! – If job waiting time was the issue, it would translate

to comparing FCFS with soft real-time scheduling!

Page 23: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Conclusions

• SLAs have the potential of changing the way that jobs are assigned onto compute resources.

• Increased flexibility appears to be the main advantage

• Long-term risk: batch systems have shown a remarkable resistance to change!

http://www.gridscheduling.org

Page 24: Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

The people

• Manchester:– Viktor Yarmolenko– Rizos Sakellariou– Jon MacLaren (now at Louisiana State

University)

• Nottingham:– Djamila Ouelhadj– Jon Garibaldi