Der Fragebogen: eine Annäherung Reinhard Burtscher WiSe 2008 / 09.
1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications...
-
Upload
kimberly-webb -
Category
Documents
-
view
221 -
download
1
Transcript of 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications...
![Page 1: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/1.jpg)
1
Evaluation and Optimization of Multicore Performance Bottlenecks in
Supercomputing Applications
Jeff Diamond1, Martin Burtscher2,
John D. McCalpin3, Byoung-Do Kim3,
Stephen W. Keckler1,4, James C. Browne1
1University of Texas, 2Texas State, 3Texas Advanced Computing Center, 4NVIDIA
![Page 2: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/2.jpg)
2
Trends In Supercomputers
![Page 3: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/3.jpg)
3
Is multicorean issue?
![Page 4: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/4.jpg)
4
The Problem: Multicore Scalability
![Page 5: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/5.jpg)
5
The Problem: Multicore Scalability
![Page 6: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/6.jpg)
6
Optimizations Differ in Multicore
Base code vs Multicore Optimized code
![Page 7: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/7.jpg)
7
Paper Contributions
Studies multicore related bottlenecks Identifies performance measurement challenges
unique to multicore systems Presents systematic approach to multicore
performance analysisDemonstrates principles of optimization
![Page 8: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/8.jpg)
8
Talk Outline
IntroductionApproach: An HPC Case StudyMulticore Measurement IssuesOptimization ExampleConclusion
![Page 9: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/9.jpg)
9
Approach: An HPC Case Study
Examine a real HPC application Major functions add variety
What is a typical HPC application?Many exhibit low arithmetic intensity
Typical of explicit / iterative solvers, stencilsFinite volume / elements / differencesMolecular dynamics, particle simulations, graph
search, Sparse MM, etc.
![Page 10: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/10.jpg)
10
Application: HOMME High Order Method Modeling Environment 3-D Atmospheric Simulation from NCAR Required for NSF acceptance testing Excellent scaling, highly optimized Arithmetic Intensity typical of stencil codes
Supercomputers:Ranger – 62,976 cores, 579 Teraflops
• 2.3 GHz quad core AMD Barcelona chips
Longhorn – 2,048 cores + 512 GPUs• 2.5 GHz quad core Intel Nehalem-EP chips
Approach: An HPC Case Study
![Page 11: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/11.jpg)
11
Talk Outline
IntroductionApproach: An HPC Case StudyMulticore Measurement IssuesOptimization ExampleConclusion
![Page 12: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/12.jpg)
12
Multicore Performance BottlenecksSINGLE CHIP
SINGLE DIMM
PRIVATEL1/L2 Cache
SHAREDL3 CACHE
SHAREDOFF-CHIP BW
SHARED DRAMPAGE CACHES
NODE
LOCAL DRAM
L3
L2 L2
L2 L2
L1 L1
L1 L1
![Page 13: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/13.jpg)
13
Disturbances Persist Longer
![Page 14: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/14.jpg)
14
Measurement Implications
![Page 15: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/15.jpg)
15
Measurements Must Be Lightweight
Duration of major HOMME functions
Action Cycles
Read Counter 9
Read Four Counters 30
Call Function 40
PAPI READ 400
System Call 5,000
TLB Page Initialization 25,000
Function Duration Calls Per Second % Exec Time2,000 cycles or less 100,000 20%
2,000 to 10,000 cycles 20,000 10%10K to 200K cycles 1,600 15%200K to 1M cycles 200 15%1M to 10M cycles - 0%10M or more cycle 4 35%
![Page 16: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/16.jpg)
16
Multicore Measurement Issues
Performance issues in shared memory systemContext SensitiveNondeterministicHighly non local
Measurement disturbance is significantAccessing memory or delaying core Hard to “bracket” measurement effectsDisturbances can last billions of cyclesBottlenecks can be “bursty”
Conclusion – need multiple tools
![Page 17: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/17.jpg)
17
Talk Outline
IntroductionApproach: An HPC Case StudyMulticore Measurement IssuesOptimization ExampleConclusion
![Page 18: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/18.jpg)
18
Multicore Performance BottlenecksSINGLE CHIP
SINGLE DIMM
SHAREDL3 CACHE
SHAREDOFF-CHIP BW
SHARED DRAMPAGE CACHES
NODE
LOCAL DRAM
L3
L2 L2
L2 L2
L1 L1
L1 L1
![Page 19: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/19.jpg)
19
Measurement Approach
Find important functionsCompare performance counters at min/max core density Identify key multicore bottleneck:
L3 capacity – L3 miss rates increase with density Off-chip BW – BW usage at min density greater than share DRAM contention – DRAM page miss rates increase with
density
For small and medium functions, follow up with light weight / temporal measurements
![Page 20: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/20.jpg)
20
Typical Homme Loop
![Page 21: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/21.jpg)
21
Apply “Microfission” (First Line)
![Page 22: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/22.jpg)
22
“Loop Microfission”
Local, context free optimizationEach array processed independently
Add high-level blocking to fit cache
Reduces total DRAM banks Statistically reduces DRAM page miss rate
Reduces instantaneous working set sizeHelps with L3 capacity and off-chip BW
![Page 23: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/23.jpg)
23
Microfission Results
![Page 24: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/24.jpg)
24
Talk Outline
IntroductionApproach: An HPC Case StudyMulticore Measurement IssuesOptimization ExampleConclusion
![Page 25: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/25.jpg)
25
Summary and Conclusions
HPC scalability must include multicoreNot well understoodRequires new analysis and measurement
techniquesOptimizations differ from single-core
Microfission is just one exampleMulticore locality optimization for shared
cachesImproves performance by 35%
![Page 26: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/26.jpg)
26
Future Work
Expect multicore observations apply to other HPC applications with low arithmetic intensity Irregular parallel applications: Adaptive meshes,
heterogeneous workloads Irregular blocking applications: graph traversal
Wider range of multicore (memory-focused) optimizationsRecomputationRelocating DataTemporary storage reductionStructural changes
![Page 27: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/27.jpg)
27
Thank You
Any Questions?
![Page 28: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/28.jpg)
28
BACKUP SLIDES…
![Page 29: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/29.jpg)
29
Less DRAM Contention
![Page 30: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/30.jpg)
30
Multicore Optimized, Low Density
![Page 31: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/31.jpg)
31
Most important functions
![Page 32: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/32.jpg)
32
L1 & L2 Miss Rates Less Relevant
![Page 33: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/33.jpg)
33
TEST
![Page 34: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/34.jpg)
34
HPC Applications Have Low Intensity
![Page 35: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/35.jpg)
35
Loads Per Cycle vs Intrachip Scaling
![Page 36: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/36.jpg)
36
TEST
![Page 37: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/37.jpg)
37
TEST
![Page 38: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/38.jpg)
38
Oscillations Effect L2 Miss Rate
![Page 39: 1 Evaluation and Optimization of Multicore Performance Bottlenecks in Supercomputing Applications Jeff Diamond 1, Martin Burtscher 2, John D. McCalpin.](https://reader036.fdocuments.net/reader036/viewer/2022062314/56649ec05503460f94bcb7c8/html5/thumbnails/39.jpg)
39
Oscillations Effect L2 Miss Rate