Service Level Agreement Based
Scheduling Heuristics
Rizos Sakellariou, Djamila Ouelhadj
Motivation – is this a good state of affairs?
• Scheduling jobs onto (high-performance) compute resources is traditionally queue based (has been since time immemorial )
• Two basic levels of service are provided:– “Run this when it gets to the head of the queue”
(in other words, whenever!)– “Run this at a precise time” (advance
reservation)
Even sophisticated systems, such as Condor, are still queue-based…
Scheduling workflows• DAG scheduling heuristics do exist
…but…
• In a queue based system:– To maintain dependences, each component is
scheduled after all parents have finished: the penalty is that each component pays the cost of the batch queue latency!
– Assurances about the start time, completion time of each component are desirable!
The best that one can aim for at the moment is advance reservation: too restrictive!
Advance Reservation• Setting times precisely is not what the user
really wants. Often users are only interested in the bounds (e.g., latest end time). This information is not captured, nor used!
• Doesn’t fit well into the batch processing model.– Utilisation (hence income) decreases rapidly as
the number of AR jobs increases (gaps can’t be effectively plugged – checkpointing and/or suspend/resume costs!)
But it’s not only about workflows…
Renegotiation of resources:
• A long-term goal of the Reality Grid project
• Experiments may need to be extended in time (at short notice) – (discovery of the century is around the corner )
• Resources may need to be changed – in which case checkpointing/restart is needed: state may be in the order of 1TB!
Could it also be about expanding the user base?
A novel approach to scheduling?• There is no queue; jobs do not have a priority• The schedule is based on satisfying constraints.• These constraints are expressed in a Service
Level Agreement: a contract between users and brokers; brokers and local schedulers, etc…
What to optimise for? (objective function) • Resource utilisation (income)• If someone comes with lots of cash the
scheduler may want to break some smaller agreements (money rules?) – reliability?
Local Schedulern
Local Scheduler3
Local Scheduler2
Super Scheduler3
Super Scheduler1
Local Scheduler1
Super Scheduler2
Use
rs
Jobs to finish “anytime” (no guarantee required)
ComputeResources
Cluster1 C
luster
2
Clus
ter2
Cluster
3Clus
ter
3
MetaSLA
subSLA
Resource Record
Key components• Users: they negotiate and agree an SLA with
a broker (or superscheduler)
• Brokers: based on SLAs agreed with users, they negotiate and agree SLAs with local schedulers (and possibly other brokers)
• Local Schedulers: they schedule the work that corresponds to an SLA agreed with a broker.
• Two types (?) of SLA:–Meta-SLA (between user and broker)– Sub-SLA (between broker and local scheduler)
Issues• Definition of SLAs
– Resources, start/finish time, how long, cost, guarantee, penalty for failure
– Meta-SLAs are negotiated first, sub-SLAs come later
• Negotiation Protocols– Based on availability (needs behaviour model)
• Scheduling– Jobs onto resources (local)
• Renegotiation
• Economy (selfish entities…), metrics
The Research Challenges “L”
AI P
lan
nin
g &
S
ched
ulin
g
Scheduling for the Grid
Fuzzy logicMulticriteria schedulingAI constraint satisfaction
SLAsNegotiationScheduling heuristicsEconomic considerations
SLA Contents
Info
Hardware
Estimated Response Time
Resource
metaSLA
List of Resources: H/W, response Time
Name, ID
Number of Nodes
Date
Deadline
Time Period Task execution time
Start Time End Time
Guarantee Level
Cost Payment for task execution
Budget Constraint Max cost specified by the user
Execution results by this time
Nodes
Execution Host Preference in a specific machine
Info
Hardware Software Time
Resource
subSLA
Book keeping info
List of Resources: H/W, S/W
ID
Resource Compute node definition
Hardware
Arch Mem Disk CPU
b/w
Software
OS Name and version
Time
Date
Resource reservation time
Start Time End Time
Client Owner Remote Machine
Book keeping info
Negotiation• Meta-SLA
– User requests an SLA– Based on (high-level view of) availability
broker suggests an SLA– If the user accepts, a contract is in place
• Sub-SLAs– Broker has agreed a meta-SLA– Usage of resources needs to be agreed –
sub-SLA is requested– Bids are made, based on availability– Sub-SLA is agreed
When Super Scheduler unable to check locally
Submit job execution req. Authorize Client
Job Client Super Scheduler1
(SS1)
Local Scheduler1
(LS1)
Check Local Resources Availability
Send a metaSLA with the id
Checks Resource Availability
Response
Verify Resources
Submit subSLA(s)
Agree subSLA(s)
Parse subSLA(s) + Verify
Response
Status Information Request
Response <job mini-report>
Completion Report
Task Executionperiod
Initiate TaskExecution
Parse req.
Create + Store subSLA(s)
Task Initiation Notification (email)
Reservation+ Set Deadline
Update Storage of State Info about LS
Update Storage of State Info about LS, SLA Store.
Report (LS state info)
Assign an SLAid
Cost Calculation
Agree metaSLA
Create + Store metaSLA
Request (execution host info)
Optional Action
Optional Action
Update LS
metaSLA negotiation takes
place
...
Local Scheduling
EPSRC e-Science Meeting 2005
The scheduling problem is defined by the allocation of a set of independent SLAs, S={SLA1, …, SLAs} to a network of hosts H={H1, …, Hn}.The expected execution time Eij of SLAi on host Hj. The earliest possible start time STi for the SLAi on a selection of hosts is the latest free time of all the selected hosts. The expected completion time Ci of SLAi on host Hj = STi + Eij The makespan is defined as:
Cmax = max (Ci)1<i<s
The objective function is to minimise the makespan:
min Cmax
Tabu Search for Local Scheduling
EPSRC e-Science Meeting 2005
• To solve the problem we propose to investigate the use of advanced search techniques: tabu search, Genetic algorithms, simulated annealing, etc.
•Tabu search is a high-level iterative procedure that makes use of memory structures and exploration strategies based on information stored in memory to search beyond local optima. In tabu search, the search process starts from a feasible solution and iteratively moves from the current solution to its best neighbouring solution even if that moves worsens the objective function value (Glover, 1997).
Tabu Search for Local Scheduling
EPSRC e-Science Meeting 2005
Tabu search for local scheduling:
• Initial solution: FCFS, Min-min, Max-min, sufferage, and backfilling. • The solution is improved by using two moves: SLA-swap and SLA-transfer moves.• SLA-swap move swaps two SLAs performed by different processors.• SLA-transfer move shifts the SLA to another processor. • Composite neighbourhood.
Other objective functions for Local Scheduling
EPSRC e-Science Meeting 2005
• Other objective functions: minimising maximum latenessminimising cost to the user, maximising profit (to supplier),maximising personal / general utility, maximise resource utilisation, etc.
Fuzzy Scheduling
EPSRC e-Science Meeting 2005
• Uncertainty handling using fuzzy models: fuzzy due dates, fuzzy execution time.
p1 p2 p3x0
1
μ(P)
x
μ(D)
0
1
d1 d2
~
D~
P
Fuzzy objective function
EPSRC e-Science Meeting 2005
The objective is to minimise the maximum fuzzy completion time :
~
,...,1max
~
maxminimise iniCC
Re-negotiation in the Presence of Uncertainties
Dynamic nature of Grid computing:
Resources may fail, high priority jobs may submitted,
new resources can be added, etc.
EPSRC e-Science Meeting 2005
In the presence of real-time events, which make the LS agents not any more able to execute the SLAs, the SS agents re-negotiate the SLAs in failure at the local and global levels of the Grid in order to find alternative LS agents to execute them.
Renegotiation in the Presence of Uncertainties
EPSRC e-Science Meeting 2005
SS n
SS 1
Local Sched11
Local Sched21
Local Schedm1
Local Sched12
Local Sched22
Local Schedw2
SSt2
Sub-SLA re-negotiationMeta-SLA re-negotiation
Sub-SLA re-negotiation
User
SS1 detects the resource failure.
SS1 re-negotiates the sub-SLAs in failure to find alternative Local Schedulers locally within the same cluster by initiating a sub-SLA negotiation session with the suitable Local Schedulers.
If it cannot manage to do so, SS1 re-negotiates the meta-SLAs with the neighbouring SSs by initiating a meta-SLA negotiation session.
SS2 re-negotiates the sub-SLAs in failure to find alternative Local Schedulers. SS2 located LS22 to execute the job in failure. At the end of task execution, LS22 sends a final
report including the output file details to the user.
In case SS1 could not find alternative Local Schedulers at the local and global levels, the
SS1 sends an alert message to the user to inform him that the meta-SLA cannot be
fulfilled.
Methodology• Simulation based approach
– Need to evaluate different approaches for agreeing SLAs (e.g., conservative vs overbooking), generating bids, pricing/penalties, scheduling, …
– Need to model users behaviour with SLAs
• Evaluation metrics:– Resource utilisation, jobs completed / SLAs broken
• Difficult to do a fair comparison with a batch-queuing system! – If job waiting time was the issue, it would translate
to comparing FCFS with soft real-time scheduling!
Conclusions
• SLAs have the potential of changing the way that jobs are assigned onto compute resources.
• Increased flexibility appears to be the main advantage
• Long-term risk: batch systems have shown a remarkable resistance to change!
http://www.gridscheduling.org
The people
• Manchester:– Viktor Yarmolenko– Rizos Sakellariou– Jon MacLaren (now at Louisiana State
University)
• Nottingham:– Djamila Ouelhadj– Jon Garibaldi
Top Related