Optimizing High Performance Computing Applications for Energy
-
Upload
david-lecomber -
Category
Technology
-
view
755 -
download
1
Transcript of Optimizing High Performance Computing Applications for Energy
![Page 1: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/1.jpg)
Optimizing Energy for High Performance Applications
Discovering when to Compute Green
![Page 2: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/2.jpg)
What is HPC? Welcome to our world
Aerospace and Space Automotive Oil and Gas EDA Weather and
climate
Financial Defence Government Labs Life sciences Academic
![Page 3: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/3.jpg)
Energy in HPC
The world’s top 500 supercomputers cost 400M€
annually in energy alone
If software reduces its energy footprint … payback could
be enormous
Solution
Enable developers and users to
improve application energy
consumption
![Page 4: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/4.jpg)
Our tools
Debug TuneProfile
Develop
![Page 5: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/5.jpg)
Two Key Questions
• Can developers optimize code for energy?• Can owners and users tune applications for
energy?
![Page 6: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/6.jpg)
What is energy?
Approximations for Energy
•Floating point, vector operations, memory access•L1 or L2 misses vs main memory orders of magnitude in energy
Heuristics
•Real data from some processor, memory subsystems, accelerators•Available in kernel - Intel RAPL
Low level measure
ment
•PDU and server level readings•Real data – real energy
Server level
monitoring
![Page 7: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/7.jpg)
Optimizing Time
Capture performance•Profiler creates application profile
•Allinea MAP records multiple processes
Find bottlenecks•Source code viewer pinpoints key consumers
•Timelines find unusual patterns
Optimize•Rewrite key loops•Reorganize memory access patterns
•Change algorithms
![Page 8: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/8.jpg)
CPU Package and System Metrics
Whole System Power Usage
CPU Package Power Usage
![Page 9: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/9.jpg)
Coprocessor Metrics
• Coprocessors and accelerators– NVIDIA CUDA GPU– INTEL XEON PHI
• Devices provide kernel access to power– HIGH POWER CONSUMPTION WHEN ACTIVE– LOW POWER CONSUMPTION WHEN IDLE– VERY EFFICIENT IN FLOPS PER WATT
• System now has variable energy usage to consider– OPTIMIZATION FOR TIME - IS THE GPU ROUTE QUICKER?– OPTIMIZATION FOR ENERGY - WHICH IS MOST EFFICIENT?
• (GPU + SERVER energy) * GPU time• Or SERVER * CPU time?
![Page 10: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/10.jpg)
Two Key Questions
• Can developers optimize code for energy? YES• Can owners and users tune applications for
energy?
![Page 11: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/11.jpg)
Tuning Time
No instrumentation needed
No source code needed
No recompilation needed
Less than 5% runtime overhead
Fully scalable
Explicit and usable output
![Page 12: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/12.jpg)
Allinea Performance ReportsExample Report
Run details
Visual breakdown chart
Clear categorization
Explanation of figures and advice for follow-up
Breakdown of resource usage across CPU, MPI, I/O
![Page 13: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/13.jpg)
Integrated Energy Information
![Page 14: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/14.jpg)
Key Observation: In a Nutshell
• For many HPC workloads– THE FASTER AN APPLICATION COMPLETES, THE LOWER ITS
ENERGY CONSUMPTION– OR … OPTIMIZE FOR SPEED AND YOU ARE (USUALLY)
ALREADY OPTIMIZING FOR ENERGY
• But for some HPC and non-HPC cases– FREQUENCY SCALING SAVES ENERGY
![Page 15: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/15.jpg)
Two Key Questions
• Can developers optimize code for energy? YES• Can owners and users tune applications for
energy? YES
…. But should they?
• Are we counting all energy?• Are we considering all costs?
![Page 16: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/16.jpg)
What is energy?
Approximations for Energy
•Floating point, vector operations, memory access•L1 or L2 misses vs main memory orders of magnitude in energy
Heuristics
•Real data from some processor, memory subsystems•Available in kernel - Intel RAPL
Low level measurement
•PDU and server level readings•Real data – real energy
Server level monitoring
•Air-con•Servers, switches, storage….
Full system monitoring
![Page 17: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/17.jpg)
Two Key Questions
• When should developers optimize code for energy?
• When should owners and users tune applications for energy?
![Page 18: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/18.jpg)
Frequency Scaling
Some workloads have low compute requirement, but high data volume
Data crunching vs number crunching
Processor is over-powered for the speed of memory, disk or network
CPU frequency can be scaled down in software
Providing information to developer, user and system owner
Allinea MAP
Allinea Performance Reports
![Page 19: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/19.jpg)
A lot of codes are memory-bound
![Page 20: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/20.jpg)
Multiple cores share bandwidth
Core 1
Core 2
Core 3
Core 4
…
Lots of clever
technologyMain memory
![Page 21: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/21.jpg)
Can we tune them for energy efficiency?
Core 1
Core 2
Core 3
Core 4
…
Lots of clever
technology
Main memory
![Page 22: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/22.jpg)
How can we improve energy efficiency?
Buy a new cluster with ambient warm water cooling an integrated espresso machine
Reduce CPU frequency
Run on fewer cores per node
![Page 23: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/23.jpg)
How can we improve energy efficiency?
Buy a new cluster with ambient warm water cooling an integrated espresso machine
Reduce CPU frequency?
Run on fewer cores per node?
![Page 24: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/24.jpg)
The Experiment
One simple code
A well-understood wave equation solver
One compute node
Minimize effect of MPI communications
Change CPU
frequency and
#cores
Measure the results with Allinea Performance Reports
![Page 25: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/25.jpg)
4 PPN @ 2.1 Ghz, 30 seconds
![Page 26: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/26.jpg)
4 PPN @ 2.1 Ghz, 30 seconds 4 PPN @ 1.3 Ghz, 34 seconds
![Page 27: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/27.jpg)
2
4
6
8
0%
10%
20%
30%
40%
50%
60%
70%
1.3 Ghz
1.7 Ghz
2.1 Ghz
Slowdown relative to 4 PPN @ 2.1GhzData gathered with Performance Reports’ CSV export
1.3 Ghz 1.7 Ghz 2.1 Ghz
1.7Ghz run completes as quickly as at 2.1Ghz
![Page 28: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/28.jpg)
2
4
6
8-10%
-5%
0%
5%
10%
15%
20%
1.3 Ghz
1.7 Ghz
2.1 Ghz
Energy savings relative to 4 PPN @ 2.1GhzData gathered with Performance Reports’ CSV export
1.3 Ghz 1.7 Ghz 2.1 Ghz
5-10% energy savings with zero performance impact
![Page 29: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/29.jpg)
2
4
6
8
0%
10%
20%
30%
40%
50%
60%
70%
1.3 Ghz
1.7 Ghz
2.1 Ghz
Slowdown relative to 4 PPN @ 2.1GhzData gathered with Performance Reports’ CSV export
1.3 Ghz 1.7 Ghz 2.1 Ghz
15% energy savings with 20% performance impact
![Page 30: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/30.jpg)
The Results
24
68 -10%
-5%
0%
5%
10%
15%
20%
1.3 Ghz 1.7
Ghz 2.1 Ghz
2 PPN: 15% energy savings, 20% increased runtime
1.3 Ghz 1.7 Ghz 2.1 Ghz
24
68
0%
10%
20%
30%
40%
50%
60%
70%
1.3 Ghz
1.7 Ghz
2.1 Ghz
1.7Ghz: 6% Energy savings for free
1.3 Ghz 1.7 Ghz 2.1 Ghz
So… should we run every job at a reduced clock speed?Or only ever use half the cores on each node?
![Page 31: Optimizing High Performance Computing Applications for Energy](https://reader035.fdocuments.net/reader035/viewer/2022070521/58ef44ea1a28ab601a8b46ab/html5/thumbnails/31.jpg)
Improving energy efficiency
• Each application and system has different characteristics– TOOLS CAN SHOW IF THE APPLICATION WASTES POWER
UNNECESSARILY– DEVELOPERS CAN SEE WHERE TO OPTIMIZE AND CHANGE
CODE– USERS CAN IMPROVE EFFICIENCY WITHOUT CHANGING CODE
• Don’t forget the opportunity cost– IN HPC SLOWING DOWN APPLICATIONS COSTS SCIENCE– MACHINES AND PHDS HAVE FINITE LIFETIME – AND THEIR COST
DOMINATES
• Time and energy are not the same– OPTIMIZE FOR TIME BEFORE OPTIMIZING FOR ENERGY