Download - Ppt4 london - michael rudgyard ( concurrent thinking ) driving efficiencies through measuring and monitoring in the data centre

Measuring and monitoring to support the EU

code of conduct

Michael Rudgyard (CTO)

Concurrent Thinking Ltd

• The participant commitments define minimum obligations (roughly):

– Provision of monthly DCiE / PUE measurements

– Provision of IT rated electrical load capacity of the DC

– Target inlet temperature for IT equipment (optional)

– External monthly average ambient temperature (optional)

– External monthly average dew point temperature (optional)

• It also requires the DC to commit to an energy-saving action plan:

– A number of potential ways to save energy are suggested

– Most (all ?) involve some level of monitoring

EU Code of Conduct – Participant Commitments

• It is simple (but neither cost-effective nor sensible) to monitor your data

centre using the ‘man and a clip-board’ technique

• Sadly, this is the ‘state of the art’ for a lot of data centres, each housing

many millions of pounds of high-tech IT equipment

• But information is power, and power is money….

monitoring vs. Monitoring (1)

• Much more effective to Monitor on as fine grain a level as possible

– To truly understand where energy savings can be made

– To understand how factors vary over time / with load etc

– To give ample warning of potential (often critical) issues

– To report factual information to management

– To drive continuous iterative improvement over time

• Real energy and productivity savings require a ‘joined-up’ approach

– Managing buildings, data-centre facilities and IT in a unified manner

– .. opening the door to the possibility of orchestration of the data-centre

monitoring vs. Monitoring (2)

• First step is to monitor power; then understand where the power is going.

• Next step is to measure PUE

– Most new data centres are being designed against PUE targets

– Many existing data centres are looking to improve their PUE

– Aim to reduce energy utilisation through incremental improvements to PUE

– The average data centre has a PUE of 1.9 (Kooney, 2010), but most should be able to achieve a figure below 1.5 (??)

• Caveats: – Officially, PUE needs to be an annualised average … not a ‘snap shot’

– However, continuous PUE ‘snap-shots’ are useful to help drive improvement

Monitoring Energy and PUE (or DCiE)

• Cooling the data centre is the key overhead that is measured by PUE

– But many do not continuously monitor the effectiveness of cooling equipment

– Basic assumption: “if the air is cool enough, then the aircon is working… “

• But cooling infrastructure is generally depreciated over several years – Despite expensive support contracts, its efficiency may diminish significantly..

– Its efficiency may also be influenced by other changes in the data centre

– When should cooling systems be replaced (OPEX vs. CAPEX) ????

• Need to track fine-grain power utilisation to really understand issues

Monitoring key infrastructure

• There are significant opportunities for improvements in most data centres – The majority operate at temperatures at >3-4oC below (old) ASHRAE

recommendations (Paterson et al, 2009)

– A 1oC increase in temperature equates to a 2-4% reduction in energy (California Energy Commission, 2007; UK financial institution, 2011)

• It is critical to monitor temperature on as fine grain a level as possible

– To understand where hot-spots are, and how these change over time

– To give ample warning of cooling failure with a smaller thermal ‘buffer’

– Relating temperatures to energy use helps drive iterative improvement

• The more real-time measurements, the better

– Ideally at the rack, sub-rack, server

– ……..or even processor level !!

Environmental Monitoring

• Should monitor IT hardware (eg. IPMI) to fully optimise environmentals

– Understand the effect of power used by (inefficient) server fans

– To identify faulty equipment that we might be overcompensating for…

Environmental Monitoring (cont…)

• With few exceptions, the most successful methodology for improving energy conservation across all sectors is:

– Step 1: Identify who/what is responsible for significant energy waste

– Step 2: Drive behaviour to ‘encourage’ change

• What is the implication for the Data Centre ?

• Need to report (charge ?) IT power by customer, department or end-user

– Track energy (& energy efficiency) to the server ,VM or even application level

– Who or what applications/service are the worst offenders ?

– Management can use data to drive better practice

Driving End-User Behaviour

• Most new data centres are being designed against PUE targets

– For a given IT hardware capacity, PUE is a good planning metric

– However, it is often a poor operational metric

• Most importantly: what if the servers are not doing any useful work ??

– The data centre may still have a ‘good’ PUE, but it would be very inefficient by any business metric

• We really need to monitor IT utilisation: – Surveys imply that IT utilisation is between 5 & 10% for an un-virtualised DC,

rising to 10 & 20% for a fully virtualised DC

– In a typical DC, 10% of running servers are not in use at all (Green Grid Survey, 2010)

Next steps:

DC design vs. operational efficiency

• Some simple ITUE metrics may be derived, eg: – Normalised CPU Utilisation/watt – for compute bound tasks

– IOPS/watt – when I/O is predominant

– Bytes/watt – for network utilisation

– All three !

• Some end-users may be interested in application-related metrics: – Database transactions/watt

– Page refresh/watt

– Search/watt

‘ITUE’ – A better class of efficiency metrics ?

0

0.2

0.4

0.6

0.8

1

ComputeUtilisation

Effectiveness

StorageUtilisation

Effectiveness

NetworkUtilisation

Effectiveness

• Understanding IT utilisation and ITUE metrics can help reduce overall power utilisation very significantly

– Remembering that PUE is relative to IT power !!

• In particular, it can also help us to identify – Who is using the power they are assigned in an efficient way

– Which servers/VM/applications are delivering best ‘value’

• In particular, ‘sweating’ the IT assets may not be smart after all ! – What is the efficiency of service delivery on individual platforms

– When do running costs exceed depreciation costs

– What replacement platform should be procured etc ??

Understanding IT utilisation

• A: It is (an important) part of the answer

• Typically human behaviour is: – A customer replaces a 3 year old (then state-of-the-art) server with a new

state-of-the-art server

– He puts a number of VMs on his new (much faster) server rather than the single OS instance on his much slower server

– He typically doubles his IT efficiency (from 10% to 20%)

• This demonstrates the need to spec new equipment based on historical application and user requirements

• As with hardware, some VMs may not be used at all over time…

Q: Isn’t Virtualisation the answer ?

• Monitoring and Reporting alone do not produce savings

• Use data to agree, plan & make iterative improvements:

– Eg. Make incremental changes to data centre environmentals; riase CRAC temperatures; find hotspots; move equipment; improve airflow

– Eg. Identify unused servers, underused servers and decommission; identify servers that are not used at night, weekends etc and employ active power management; define virtualisation strategy based on real data etc.

• This is not without its complexities

– Requires cross-cultural change (IT, Facilities, Building Management)

– Requires openness and end-user targetting (no-one is an angel…)

– Requires detailed planning and (often) down-time

• Rewards can be significant, even by focussing on simple changes – >25% energy savings in 1st year ?

Continuous Iterative Improvement

• Efficient DCs should monitor & manage both IT and Facilities systems in a coherent manner:

– Environmental systems (temperature, humidity, air-conditioning..)

– Power (at the distribution board, rack PDU and server PSU level …)

– IT equipment (using standard protocols such as IPMI and SNMP…)

– Operating systems & Virtual Machines (integrating with IT systems)

– ..and perhaps applications themselves

• In the future, we will move to the autonomous data centre

– Emphasis moves from monitoring to active management

– Potential for very significant energy savings…

Conclusions