Unobtrusive power proportionality for Torque: Design and Implementation

45
+ Unobtrusive power proportional ity for Torque: Design and Implementation Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert Goto

description

Unobtrusive power proportionality for Torque: Design and Implementation. Arka Bhattacharya Acknowledgements: Jeff Anderson Lee Andrew Krioukov Albert Goto. Introduction. What is power proportionality ? - PowerPoint PPT Presentation

Transcript of Unobtrusive power proportionality for Torque: Design and Implementation

Page 1: Unobtrusive power proportionality for Torque:  Design and Implementation

+Unobtrusive power proportionality for Torque: Design and Implementation

Arka BhattacharyaAcknowledgements:Jeff Anderson LeeAndrew KrioukovAlbert Goto

Page 2: Unobtrusive power proportionality for Torque:  Design and Implementation

+Introduction

What is power proportionality ? Performance-power ratio at all performance levels is

equivalent to that at the maximum performance level Servers consume a high percentage of their max power

even idle Hence, power proportionality => switch off idle servers

Page 3: Unobtrusive power proportionality for Torque:  Design and Implementation

+NapSAC – Krioukov et.al.

3

IPSRequests

Power

Computational “Spinning Reserve”

Load DistributionScheduling

Power Management WikiPedia Request Rate

4/12/11 CPS 2011

Page 4: Unobtrusive power proportionality for Torque:  Design and Implementation

+The need for power proportionality of IT equipment in Soda Hall

Soda Hall Power : 450-500kW

Cluster Room Power: 120-130kW (~25%)

Total HVAC for cluster rooms : 75-85kW(~15%)

Page 5: Unobtrusive power proportionality for Torque:  Design and Implementation

+PSI Cluster

Cluster Room Power: 120-130kW (~25% of Soda)

PSI Cluster: 20-25kW (~5% of Soda)

Total HVAC for cluster rooms : 75-85kW(~15% of Soda)

Total HVAC for PSI Cluster room : 20-25kW(~5% of Soda)

Page 6: Unobtrusive power proportionality for Torque:  Design and Implementation

+The PSI Cluster

PSI Cluster Consumes ~20-25kW of power irrespective of workload. Contains about 110 servers.

Recently server faults have reduced the size of the cluster to 78 servers. (The faulty servers mostly are powered on all the time)

Used mainly by NLP, Vision, AI and ML graduate students.

It is an HPC Cluster running Torque

Page 7: Unobtrusive power proportionality for Torque:  Design and Implementation

+PSI Cluster

Page 8: Unobtrusive power proportionality for Torque:  Design and Implementation

+Possible Energy savings

Can save ~ 50% of the energy

Page 9: Unobtrusive power proportionality for Torque:  Design and Implementation

+Current state :

Page 10: Unobtrusive power proportionality for Torque:  Design and Implementation

+ Result:

10 kW

We save 49% of the energy

Page 11: Unobtrusive power proportionality for Torque:  Design and Implementation

+What is Torque?

Tera-scale Open-source Research and QUEue managerBuilt upon original Portable Batch System (PBS) projectResource manager: Manages availability of, and

requests for, compute node resourcesUsed by most academic institutions throughout the

world for batch processing.

Page 12: Unobtrusive power proportionality for Torque:  Design and Implementation

+Maui Scheduler

Job schedulerImplements and manages:

Scheduling policies Dynamic priorities Reservations Fairshare

Page 13: Unobtrusive power proportionality for Torque:  Design and Implementation

+Sample Job Flow

Script submitted to TORQUE specifying required resources

Maui periodically retrieves from TORQUE list of potential jobs, available node resources, etc.

When resources become available, Maui tells TORQUE to execute certain jobs on particular nodes

TORQUE dispatches jobs to the PBS MOMs (machine oriented miniserver) running on the compute nodes - pbs_mom is the process that starts the job script

Job status changes reported back to Maui, information updated

Page 14: Unobtrusive power proportionality for Torque:  Design and Implementation

+Why are we building power-proportional Torque ?

To shed load in Soda Hall To investigate why production clusters don’t implement

power proportionality To integrate power-proportionality into a software used

in many clusters throughout the world

Page 15: Unobtrusive power proportionality for Torque:  Design and Implementation

+Desirables from an unobtrusive power proportionality feature

Avoid modifications to torque source code Only use existing torque interfaces Make the feature completely transparent to end users Maintain system responsiveness Centralized No dependence resource manager/scheduler version

Page 16: Unobtrusive power proportionality for Torque:  Design and Implementation

+Analysis of the psi cluster

Logs : Active and Idle Queue Log Job placement statistics

Logs exist for 68 days in Feb-April,2011 Logs were recorded once every minute Logs contain information of ~169k jobs , ~40 users

Page 17: Unobtrusive power proportionality for Torque:  Design and Implementation

+Type of servers in the psi clusterServer Make Number of

CoresMemory Count

Dell 2 3GB 64Dell 8 16GB 21Intel 8 48GB 28Intel 24 256GB 4Total : 117• Each server class is further divided

according to various features• Not all servers listed above are switched

on all the time

Page 18: Unobtrusive power proportionality for Torque:  Design and Implementation

+CDF of server idle duration

TAKEAWAY 1: Most idle periods are small

Page 19: Unobtrusive power proportionality for Torque:  Design and Implementation

+Contribution of server idle period to total

TAKEAWAY 2: To save energy, tackle the large idle periods

Page 20: Unobtrusive power proportionality for Torque:  Design and Implementation

+CDF of job durations

(50,500s)BATCH

INTERACTIVE

TAKEAWAY 3: Most jobs are long. Hence slight increase in queuing time wont hurt

Page 21: Unobtrusive power proportionality for Torque:  Design and Implementation

+Summary of takeaways

Small server idle times, though numerous, contribute very less to total server idle time.

Power proportionality algorithm need not be aggressive in switching of servers

Waking servers takes 5 min. Considered to the running time of a job, it is negligible

Page 22: Unobtrusive power proportionality for Torque:  Design and Implementation

+Loiter Time vs Energy Savings

Page 23: Unobtrusive power proportionality for Torque:  Design and Implementation

+

Design of unobtrusive Power Proportionality for Torque

Page 24: Unobtrusive power proportionality for Torque:  Design and Implementation

+Using Torque interfaces

What useful state information does torque/maui maintain ? Maintains the state(active/offline/down) of each server,

and jobs running on it. Obtained through “pbsnodes” command

Maintains a list of running and queued jobs Obtained through “qstat” command

Maintains job constraints and scheduling details of each job Obtained through “checkjob” command

Page 25: Unobtrusive power proportionality for Torque:  Design and Implementation

+First implementation- State machine for each server

Active

Offline

Down

Waking

• Server_idle_time > LOITER_TIME

• Server_offline_time >OFFLINE_LOITER_TIME

• No job has been scheduled on server

• Idle job exists

• Server has woken up

Problematic Server

• Server not waking

• If idle job can be scheduled on server

Page 26: Unobtrusive power proportionality for Torque:  Design and Implementation

+Does not work !

Each job is submitted to a specific queue, Must ensure right server wakes up.

Page 27: Unobtrusive power proportionality for Torque:  Design and Implementation

+Next implementation-State machine for each server

Active

Offline

Down

Waking

• Server_idle_time > LOITER_TIME

• Server_offline_time > OFFLINE_LOITER_TIME

• No job has been scheduled on server

• Idle job exists• Server

belongs to desired queue

• Server has woken up

Problematic Server

• Server not waking

• If idle job can be scheduled on server

Page 28: Unobtrusive power proportionality for Torque:  Design and Implementation

+Still did not work !

Each job has specific constraints which torque takes into account while scheduling

Job constraints can be obtained through “checkjob” command.

Page 29: Unobtrusive power proportionality for Torque:  Design and Implementation

+Next implementation-State machine for each server

Active

Offline

Down

Waking

• Server_idle_time > LOITER_TIME

• Server_offline_time > OFFLINE_LOITER_TIME

• No job has been scheduled on server

• Idle job exists• Server belongs

to desired queue

• Server satisfies job constraints

• Server has woken up

Problematic Server

• Server not waking

• If idle job can be scheduled on server

Page 30: Unobtrusive power proportionality for Torque:  Design and Implementation

+Scheduling problem: Job submission characteristics Users tend to submit multiple jobs at a time (often >20) Torque has its own fairness mechanisms, which wont

schedule all the jobs even if there are free servers. To accurately predict which jobs Torque will schedule, and

not to switch on extra servers, we should emulate the Torque scheduling logic !

Ties Power Proportionality feature to specific Torque Policy Solution : Switch on only a few servers at a time to check

if torque schedules the idle job

Page 31: Unobtrusive power proportionality for Torque:  Design and Implementation

+Next implementation-State machine for each server

Active

Offline

Down

Waking

• Server_idle_time > LOITER_TIME

• Server_offline_time > OFFLINE_LOITER_TIME

• No job has been scheduled on server

• Idle job exists• Server belongs to

desired queue• Server satisfies job

constraints• Switch on only a few

servers at a time

• Server has woken up

Problematic Server

• Server not waking

• If idle job can be scheduled on server

Page 32: Unobtrusive power proportionality for Torque:  Design and Implementation

+Maintain responsiveness/headroom The Debug cycle usually contains the users running short

jobs and validating the output If no server satisfying job contraints are switched on, a

user might have to wait a long time to validate if his job is running

If jobs throw errors, he might have to wait for an entire server power cycle to run his modified job

Solution : Group servers according to features. In each group, have a limited numbers of servers as spinning

reserve all the time

Page 33: Unobtrusive power proportionality for Torque:  Design and Implementation

+Final implementation-State machine for each server

Active

Offline

Down

Waking

• Server_idle_time > LOITER_TIME

• Server_offline_time >OFFLINE_LOITER_TIME

• No job has been scheduled on server

• Switching off servers leaves no headroom

• Idle job exists• Server belongs to desired queue• Server satisfies job constraints• Switch on only MAX_SERVERS at a

time• Switch on server to maintain

headroom

• Server has woken up

Problematic Server

• Server not waking

• If idle job can be scheduled on server

Page 34: Unobtrusive power proportionality for Torque:  Design and Implementation

+But the servers don’t wake up !!!

Each server has to bootstrap a list of service, such as network file systems, work directories, portmapper, etc

Often these bootstraps fail, and hence servers are left in an undesired state ( e.g with no home directories mounted to write user output to ! )

Solution : Have a health-check script on each server Check for proper configurations of useful services, and make

server available for scheduling only if health-check succeeds.

Page 35: Unobtrusive power proportionality for Torque:  Design and Implementation

+Power Proportional Torque at a glance: Completely transparent to user Did not modify torque source code 1000 line python script which runs only on torque

master server Halts servers through ssh Wake servers through wake-on-lan Separates scheduling policy from mechanism.

It allows torque to dictate the scheduling policy.

Page 36: Unobtrusive power proportionality for Torque:  Design and Implementation

+Deployment

Deployed on 57 of the 78 active nodes in the psi cluster. Total number of cores = 150

Servers were classified into 5 groups based on features. HEADROOM_PER_GROUP = 3 MAX_SERVERS_TO_WAKE_AT_A_TIME = 5 LOITER_TIME = 7 minutes OFFLINE_LOITER_TIME = 3 minutes

Page 37: Unobtrusive power proportionality for Torque:  Design and Implementation

+Average Statistics

Deployed since last week ~800 jobs analyzed Avg utilization of cluster = 40% % Energy saved = 49%

Page 38: Unobtrusive power proportionality for Torque:  Design and Implementation

+

Page 39: Unobtrusive power proportionality for Torque:  Design and Implementation

+Results:

Page 40: Unobtrusive power proportionality for Torque:  Design and Implementation

+HVAC power savings

Page 41: Unobtrusive power proportionality for Torque:  Design and Implementation

+Number of servers powered on at a time:

Headroom

Page 42: Unobtrusive power proportionality for Torque:  Design and Implementation

+Expected vs Actual savings

Page 43: Unobtrusive power proportionality for Torque:  Design and Implementation

+Submission vs Execution profile

5/17/12 12:00 5/18/12 0:00 5/18/12 12:00 5/19/12 0:00 5/19/12 12:00 5/20/12 0:00 5/20/12 12:00 5/21/12 0:000

20

40

60

80

100

120Submission Profile Execution profile

Time

Num

ber o

f Act

ive

Core

s

Page 44: Unobtrusive power proportionality for Torque:  Design and Implementation

+CDF of job queue time as a percentage of job length

Page 45: Unobtrusive power proportionality for Torque:  Design and Implementation

+Conclusions – what we achieved Power proportionality is easy to achieve for torque

without changing any source code at all The script could be run on any standard torque cluster

to save energy. Switching servers back on in a consistent state is the

single biggest roadblock to deployment of script. We saved a max of ~17kW of power is Soda Hall

(~3%). This was only half the psi cluster !