Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By:...

1

Ensieea RizwaniAn energy-efficient

management mechanism for large-scale server

clustersBy:

Zhenghua Xue, Dong, Ma, Fan, Mei

2

O Most data centers including University at Buffalo’s center of computational research (CCR) resources keep running 365*24 despite the knowledge of the work load or utilization. This results in increase of power consumption and decrease of resource utilization. Energy efficient centers are really important as they vastly contribute financially and technically.

3

OutlineO IntroductionsO Overview of the ArchitectureO Power conservation MechanismO Adaptive pool MechanismO Simulation and MeasurementO ConclusionO Future work

4

O Power equipment, cooling equipment, and electricity together represents a significant portion of a data center’s cost,

O Any guess’s for the %?

O Cost is up to 63 percent of the total cost of ownership of its physical IT infrastructure.

5

How to make Data Centers more cost efficient?

O For the hardware component level, a generalapproach is to reduce the power consumed bycomponents not currently in use.Some examples are:O placing the CPU in a “halted” state when there

are no runnable tasksO Turning off the hard drive motor or memory

device after some period of inactivityO resizing the cache by powering down unused

cache lines

6

Approach taken by this article:O This paper proposes an adaptive pool basedresource management (APRM) mechanism to provide computing capacity on-demand.

O APRM implements power saving by terminating part of idle nodes and guarantees QoS by reserving some idle nodes

O By obtaining load information from the management system, APRM can predict the load amount.

7

Management System of HPC O Management system of HPC consists

of two components:

O Job management subsystemO Resource management subsystem

8

Overview of an extensible cluster management architecture

9

Job Management SystemO Job Controller

O Executing entity that dispatches jobs, controls their life time by starting a job, suspending or canceling them.

O Job SupervisorO Responsible for monitoring job status and reporting

that information to queue manager.O Queue Manager

O Queuing the jobs in the waiting queueO Updating the queue upon receiving information

from job supervisorO Making decision about job scheduling in accordance

with scheduling algorithm and available resourcesO Informing job Controller to execute

10

Resource ManagerO Executor

O Dedicated to executing the instructionsO Resource Monitor

O Concentrates on monitoring and collecting the status information of resources

O Statistics AnalyzerO Auxiliary component for supporting automatic

and intelligent resource management.O Policy Decisioner

O Maintains a collection of policies which are triggered by some predefined events.

O Energy effective resource management method is kept in the policy decisioner.

11

Demand fluctuations:O Many studies have shown that

demand for high performance scientific computing varies with time. As is studied, job arrivals are expected to have cycles at three levels:

O Daily (daily working hours are the peak hours)

O Weekly (weekend have the lowest job arrivals)

O Yearly ()

12

Server States

13

Power Model of ServersO Busy, Idle, Shutdown

O Upon completion of all the jobs in a computing node that power state transits from busy to idle.

O Once new job arrive at a new computing node, the power state transits from idle to busy.

O If a computing node keeps idle for a long time, it will be terminated and the power state transits from Idle to shutdown.

O When the workload is becoming heavy, additional computing capacity is expected. Some computing nodes will be wakened up to take part and their status will be transitioned from shutdown to idle and than to busy.

14

Adaptive Pool Mechanism O A resource pool is a collection of computing

nodes offering shared access to computing capacity, and the automation and virtualization capabilities of resource pool promise lower costs of ownership.

15

Mechanism of APRMO corePoolSize: the number of nodes to keep

in the pool, and it is the sum of the numbers of working nodes and idle nodes.

O maxPoolSize: the maximum number of nodes to allow in the pool, and it equals to the total number of the nodes in a cluster.

O maxIdleNodes: the maximum number of idle nodes to keep in the pool.

O keepAliveTime: when the number of idle nodes in the pool is greater than maxIdleNodes, this is the maximum time that excess computing nodes will wait for new jobs before terminating.

16

Termination ConditionsO The idle time of idle nodes is beyond

keepAliveTime;

O The first condition prevents a computing node from frequently terminating and launching when the computing demand fluctuates in short cycle.

17

Termination ConditionsO The number of the idle nodes in the

pool is larger than maxIdleNodes;

O The second condition targets at decreasing needless computing nodes to save power.

18

Termination ConditionsO If more than one idle node

simultaneously meets the two conditions above, nodes with longer runtime have priority to terminate.

O The third condition is to balance the utilization of nodes. After termination of some idle nodes, the number of idle nodes in the pool maintains maxIdlenodes.

19

APRMAPRM implements power saving by terminating part of idle nodes and guarantees QoS by reserving some idle nodes whose number maintains maxIdleNodes. The working parameter maxIdleNodes plays an important role in APRM. If it is set too high, this will lead to excessive provision of computing capacity. However, if too low, the reserved idle nodes may be insufficient to new arrival jobs, and the spare nodes will be wakened to take part in computing with a delay of start-up.

20

The ratio of run time of all the computing nodes with

APRM to that without APRM as the metric for power

saving, and it can be denoted as formula

21

The time between jobarrival and completion, averaged over all

jobs

22

The ratio of the responsetime of a job to the time it requires on a

dedicatedsystem, averaged over all jobs

23

Average frequency of shutdown as a metric to

measure whether computing nodes frequently

terminate and launch

24

Simulation ModelOWorkload generatorOJob schedulerOResource manager

27

SummaryO The difference of average job responsetime is not more than 1.8701 minutes, and that of average job slow down is not beyond 0.0085. This suggests APRM has little impact on QoS with significant power saving.

O Future Work:O Researching traces O And conclude with better predictive

methods

28

Thank You

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By:...

Documents

Transcript of Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By:...