Measuring and monitoring to support the EU
code of conduct
Michael Rudgyard (CTO)
Concurrent Thinking Ltd
• The participant commitments define minimum obligations (roughly):
– Provision of monthly DCiE / PUE measurements
– Provision of IT rated electrical load capacity of the DC
– Target inlet temperature for IT equipment (optional)
– External monthly average ambient temperature (optional)
– External monthly average dew point temperature (optional)
• It also requires the DC to commit to an energy-saving action plan:
– A number of potential ways to save energy are suggested
– Most (all ?) involve some level of monitoring
EU Code of Conduct – Participant Commitments
• It is simple (but neither cost-effective nor sensible) to monitor your data
centre using the ‘man and a clip-board’ technique
• Sadly, this is the ‘state of the art’ for a lot of data centres, each housing
many millions of pounds of high-tech IT equipment
• But information is power, and power is money….
monitoring vs. Monitoring (1)
• Much more effective to Monitor on as fine grain a level as possible
– To truly understand where energy savings can be made
– To understand how factors vary over time / with load etc
– To give ample warning of potential (often critical) issues
– To report factual information to management
– To drive continuous iterative improvement over time
• Real energy and productivity savings require a ‘joined-up’ approach
– Managing buildings, data-centre facilities and IT in a unified manner
– .. opening the door to the possibility of orchestration of the data-centre
monitoring vs. Monitoring (2)
• First step is to monitor power; then understand where the power is going.
• Next step is to measure PUE
– Most new data centres are being designed against PUE targets
– Many existing data centres are looking to improve their PUE
– Aim to reduce energy utilisation through incremental improvements to PUE
– The average data centre has a PUE of 1.9 (Kooney, 2010), but most should be able to achieve a figure below 1.5 (??)
• Caveats: – Officially, PUE needs to be an annualised average … not a ‘snap shot’
– However, continuous PUE ‘snap-shots’ are useful to help drive improvement
Monitoring Energy and PUE (or DCiE)
• Cooling the data centre is the key overhead that is measured by PUE
– But many do not continuously monitor the effectiveness of cooling equipment
– Basic assumption: “if the air is cool enough, then the aircon is working… “
• But cooling infrastructure is generally depreciated over several years – Despite expensive support contracts, its efficiency may diminish significantly..
– Its efficiency may also be influenced by other changes in the data centre
– When should cooling systems be replaced (OPEX vs. CAPEX) ????
• Need to track fine-grain power utilisation to really understand issues
Monitoring key infrastructure
• There are significant opportunities for improvements in most data centres – The majority operate at temperatures at >3-4oC below (old) ASHRAE
recommendations (Paterson et al, 2009)
– A 1oC increase in temperature equates to a 2-4% reduction in energy (California Energy Commission, 2007; UK financial institution, 2011)
• It is critical to monitor temperature on as fine grain a level as possible
– To understand where hot-spots are, and how these change over time
– To give ample warning of cooling failure with a smaller thermal ‘buffer’
– Relating temperatures to energy use helps drive iterative improvement
• The more real-time measurements, the better
– Ideally at the rack, sub-rack, server
– ……..or even processor level !!
Environmental Monitoring
• Should monitor IT hardware (eg. IPMI) to fully optimise environmentals
– Understand the effect of power used by (inefficient) server fans
– To identify faulty equipment that we might be overcompensating for…
Environmental Monitoring (cont…)
• With few exceptions, the most successful methodology for improving energy conservation across all sectors is:
– Step 1: Identify who/what is responsible for significant energy waste
– Step 2: Drive behaviour to ‘encourage’ change
• What is the implication for the Data Centre ?
• Need to report (charge ?) IT power by customer, department or end-user
– Track energy (& energy efficiency) to the server ,VM or even application level
– Who or what applications/service are the worst offenders ?
– Management can use data to drive better practice
Driving End-User Behaviour
• Most new data centres are being designed against PUE targets
– For a given IT hardware capacity, PUE is a good planning metric
– However, it is often a poor operational metric
• Most importantly: what if the servers are not doing any useful work ??
– The data centre may still have a ‘good’ PUE, but it would be very inefficient by any business metric
• We really need to monitor IT utilisation: – Surveys imply that IT utilisation is between 5 & 10% for an un-virtualised DC,
rising to 10 & 20% for a fully virtualised DC
– In a typical DC, 10% of running servers are not in use at all (Green Grid Survey, 2010)
Next steps:
DC design vs. operational efficiency
• Some simple ITUE metrics may be derived, eg: – Normalised CPU Utilisation/watt – for compute bound tasks
– IOPS/watt – when I/O is predominant
– Bytes/watt – for network utilisation
– All three !
• Some end-users may be interested in application-related metrics: – Database transactions/watt
– Page refresh/watt
– Search/watt
‘ITUE’ – A better class of efficiency metrics ?
0
0.2
0.4
0.6
0.8
1
ComputeUtilisation
Effectiveness
StorageUtilisation
Effectiveness
NetworkUtilisation
Effectiveness
• Understanding IT utilisation and ITUE metrics can help reduce overall power utilisation very significantly
– Remembering that PUE is relative to IT power !!
• In particular, it can also help us to identify – Who is using the power they are assigned in an efficient way
– Which servers/VM/applications are delivering best ‘value’
• In particular, ‘sweating’ the IT assets may not be smart after all ! – What is the efficiency of service delivery on individual platforms
– When do running costs exceed depreciation costs
– What replacement platform should be procured etc ??
Understanding IT utilisation
• A: It is (an important) part of the answer
• Typically human behaviour is: – A customer replaces a 3 year old (then state-of-the-art) server with a new
state-of-the-art server
– He puts a number of VMs on his new (much faster) server rather than the single OS instance on his much slower server
– He typically doubles his IT efficiency (from 10% to 20%)
• This demonstrates the need to spec new equipment based on historical application and user requirements
• As with hardware, some VMs may not be used at all over time…
Q: Isn’t Virtualisation the answer ?
• Monitoring and Reporting alone do not produce savings
• Use data to agree, plan & make iterative improvements:
– Eg. Make incremental changes to data centre environmentals; riase CRAC temperatures; find hotspots; move equipment; improve airflow
– Eg. Identify unused servers, underused servers and decommission; identify servers that are not used at night, weekends etc and employ active power management; define virtualisation strategy based on real data etc.
• This is not without its complexities
– Requires cross-cultural change (IT, Facilities, Building Management)
– Requires openness and end-user targetting (no-one is an angel…)
– Requires detailed planning and (often) down-time
• Rewards can be significant, even by focussing on simple changes – >25% energy savings in 1st year ?
Continuous Iterative Improvement
• Efficient DCs should monitor & manage both IT and Facilities systems in a coherent manner:
– Environmental systems (temperature, humidity, air-conditioning..)
– Power (at the distribution board, rack PDU and server PSU level …)
– IT equipment (using standard protocols such as IPMI and SNMP…)
– Operating systems & Virtual Machines (integrating with IT systems)
– ..and perhaps applications themselves
• In the future, we will move to the autonomous data centre
– Emphasis moves from monitoring to active management
– Potential for very significant energy savings…
Conclusions
Top Related