Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By:...

28
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1

description

Outline O Introductions O Overview of the Architecture O Power conservation Mechanism O Adaptive pool Mechanism O Simulation and Measurement O Conclusion O Future work 3

Transcript of Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By:...

Page 1: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

1

Ensieea RizwaniAn energy-efficient

management mechanism for large-scale server

clustersBy:

Zhenghua Xue, Dong, Ma, Fan, Mei

Page 2: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

2

O Most data centers including University at Buffalo’s center of computational research (CCR) resources keep running 365*24 despite the knowledge of the work load or utilization. This results in increase of power consumption and decrease of resource utilization. Energy efficient centers are really important as they vastly contribute financially and technically.

Page 3: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

3

OutlineO IntroductionsO Overview of the ArchitectureO Power conservation MechanismO Adaptive pool MechanismO Simulation and MeasurementO ConclusionO Future work

Page 4: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

4

O Power equipment, cooling equipment, and electricity together represents a significant portion of a data center’s cost,

O Any guess’s for the %?

O Cost is up to 63 percent of the total cost of ownership of its physical IT infrastructure.

Page 5: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

5

How to make Data Centers more cost efficient?

O For the hardware component level, a generalapproach is to reduce the power consumed bycomponents not currently in use.Some examples are:O placing the CPU in a “halted” state when there

are no runnable tasksO Turning off the hard drive motor or memory

device after some period of inactivityO resizing the cache by powering down unused

cache lines

Page 6: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

6

Approach taken by this article:O This paper proposes an adaptive pool basedresource management (APRM) mechanism to provide computing capacity on-demand.

O APRM implements power saving by terminating part of idle nodes and guarantees QoS by reserving some idle nodes

O By obtaining load information from the management system, APRM can predict the load amount.

Page 7: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

7

Management System of HPC O Management system of HPC consists

of two components:

O Job management subsystemO Resource management subsystem

Page 8: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

8

Overview of an extensible cluster management architecture

Page 9: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

9

Job Management SystemO Job Controller

O Executing entity that dispatches jobs, controls their life time by starting a job, suspending or canceling them.

O Job SupervisorO Responsible for monitoring job status and reporting

that information to queue manager.O Queue Manager

O Queuing the jobs in the waiting queueO Updating the queue upon receiving information

from job supervisorO Making decision about job scheduling in accordance

with scheduling algorithm and available resourcesO Informing job Controller to execute

Page 10: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

10

Resource ManagerO Executor

O Dedicated to executing the instructionsO Resource Monitor

O Concentrates on monitoring and collecting the status information of resources

O Statistics AnalyzerO Auxiliary component for supporting automatic

and intelligent resource management.O Policy Decisioner

O Maintains a collection of policies which are triggered by some predefined events.

O Energy effective resource management method is kept in the policy decisioner.

Page 11: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

11

Demand fluctuations:O Many studies have shown that

demand for high performance scientific computing varies with time. As is studied, job arrivals are expected to have cycles at three levels:

O Daily (daily working hours are the peak hours)

O Weekly (weekend have the lowest job arrivals)

O Yearly ()

Page 12: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

12

Server States

Page 13: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

13

Power Model of ServersO Busy, Idle, Shutdown

O Upon completion of all the jobs in a computing node that power state transits from busy to idle.

O Once new job arrive at a new computing node, the power state transits from idle to busy.

O If a computing node keeps idle for a long time, it will be terminated and the power state transits from Idle to shutdown.

O When the workload is becoming heavy, additional computing capacity is expected. Some computing nodes will be wakened up to take part and their status will be transitioned from shutdown to idle and than to busy.

Page 14: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

14

Adaptive Pool Mechanism O A resource pool is a collection of computing

nodes offering shared access to computing capacity, and the automation and virtualization capabilities of resource pool promise lower costs of ownership.

Page 15: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

15

Mechanism of APRMO corePoolSize: the number of nodes to keep

in the pool, and it is the sum of the numbers of working nodes and idle nodes.

O maxPoolSize: the maximum number of nodes to allow in the pool, and it equals to the total number of the nodes in a cluster.

O maxIdleNodes: the maximum number of idle nodes to keep in the pool.

O keepAliveTime: when the number of idle nodes in the pool is greater than maxIdleNodes, this is the maximum time that excess computing nodes will wait for new jobs before terminating.

Page 16: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

16

Termination ConditionsO The idle time of idle nodes is beyond

keepAliveTime;

O The first condition prevents a computing node from frequently terminating and launching when the computing demand fluctuates in short cycle.

Page 17: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

17

Termination ConditionsO The number of the idle nodes in the

pool is larger than maxIdleNodes;

O The second condition targets at decreasing needless computing nodes to save power.

Page 18: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

18

Termination ConditionsO If more than one idle node

simultaneously meets the two conditions above, nodes with longer runtime have priority to terminate.

O The third condition is to balance the utilization of nodes. After termination of some idle nodes, the number of idle nodes in the pool maintains maxIdlenodes.

Page 19: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

19

APRMAPRM implements power saving by terminating part of idle nodes and guarantees QoS by reserving some idle nodes whose number maintains maxIdleNodes. The working parameter maxIdleNodes plays an important role in APRM. If it is set too high, this will lead to excessive provision of computing capacity. However, if too low, the reserved idle nodes may be insufficient to new arrival jobs, and the spare nodes will be wakened to take part in computing with a delay of start-up.

Page 20: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

20

The ratio of run time of all the computing nodes with

APRM to that without APRM as the metric for power

saving, and it can be denoted as formula

Page 21: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

21

The time between jobarrival and completion, averaged over all

jobs

Page 22: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

22

The ratio of the responsetime of a job to the time it requires on a

dedicatedsystem, averaged over all jobs

Page 23: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

23

Average frequency of shutdown as a metric to

measure whether computing nodes frequently

terminate and launch

Page 24: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

24

Simulation ModelOWorkload generatorOJob schedulerOResource manager

Page 25: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

25

Page 26: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

26

Page 27: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

27

SummaryO The difference of average job responsetime is not more than 1.8701 minutes, and that of average job slow down is not beyond 0.0085. This suggests APRM has little impact on QoS with significant power saving.

O Future Work:O Researching traces O And conclude with better predictive

methods

Page 28: Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

28

Thank You