Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark...
Transcript of Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark...
![Page 1: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/1.jpg)
Language-Centric Performance Analysis of OpenMPPrograms with Aftermath
Andi Drebes
The University of ManchesterSchool of Computer Science
Advanced Processor [email protected]
Joint work with:Jean-Baptiste Brejon, Antoniu Pop, Karine Heydemann, Albert Cohen
IWOMP 2016
![Page 2: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/2.jpg)
Analysis of OpenMP Programs
Hardware
Run-time OS
Application
Andi Drebes – Aftermath: Language-Centric Performance Analysis 1 / 10
![Page 3: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/3.jpg)
Analysis of OpenMP Programs
Hardware
Run-time OS
Application
Andi Drebes – Aftermath: Language-Centric Performance Analysis 1 / 10
![Page 4: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/4.jpg)
Analysis of OpenMP Programs
Programming model
#pragma omp task depend(...){ ... }
#pragma omp parallel forfor(int i = 0; i < N; i++){ ... }
Application Hardware
Run-time OS
Andi Drebes – Aftermath: Language-Centric Performance Analysis 1 / 10
![Page 5: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/5.jpg)
New Tools for Performance Analysis
Frequent topics for performance analysis:I Amount of parallelism and load balacingI Duration of execution phasesI Synchronization overhead (e.g., barriers)I Choice of an appropriate loop scheduleI Data distribution on NUMA systemsI Relate hardware events to loops / tasks
Our tools: Aftermath & Aftermath-OpenMPI Aftermath: Graphical tool for performance analysisI Aftermath-OpenMP: Instrumented LLVM/clang run-time
Andi Drebes – Aftermath: Language-Centric Performance Analysis 2 / 10
![Page 6: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/6.jpg)
New Tools for Performance Analysis
Frequent topics for performance analysis:I Amount of parallelism and load balacingI Duration of execution phasesI Synchronization overhead (e.g., barriers)I Choice of an appropriate loop scheduleI Data distribution on NUMA systemsI Relate hardware events to loops / tasks
Our tools: Aftermath & Aftermath-OpenMPI Aftermath: Graphical tool for performance analysisI Aftermath-OpenMP: Instrumented LLVM/clang run-time
Andi Drebes – Aftermath: Language-Centric Performance Analysis 2 / 10
![Page 7: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/7.jpg)
Outline
1. Overview of Trace-based Analysis
2. Overview of Aftermath’s GUI
3. Demo
4. Overhead of Tracing
5. Summary & Conclusion
![Page 8: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/8.jpg)
Trace-based Analysis with Aftermath
Application HardwareAftermath-OpenMP
Run-timeOS
Trace file
Andi Drebes – Aftermath: Language-Centric Performance Analysis 3 / 10
![Page 9: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/9.jpg)
Trace-based Analysis with Aftermath
Application HardwareAftermath-OpenMP
Run-timeOS
Trace file
Aftermath
Visualizations &Exploration
Statistics &Accurate Numbers
ProgrammingModel-centric
Analysis
Andi Drebes – Aftermath: Language-Centric Performance Analysis 3 / 10
![Page 10: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/10.jpg)
Terminology
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
![Page 11: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/11.jpg)
Terminology
Loop construct#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
![Page 12: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/12.jpg)
Terminology
Loop construct
0 99Iteration space
Loop
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
![Page 13: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/13.jpg)
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
![Page 14: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/14.jpg)
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
![Page 15: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/15.jpg)
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
Worker 0
Worker 1
Worker 2
C0 C3 C6 C9
C1 C4 C7
C2 C5 C8
[0-9] [30-39] [60-69] [90-99]
[10-19] [40-49] [70-79]
[20-29] [50-59] [80-89]
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
![Page 16: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/16.jpg)
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
Worker 0
Worker 1
Worker 2
C0 C3 C6 C9
C1 C4 C7
C2 C5 C8
[0-9] [30-39] [60-69] [90-99]
[10-19] [40-49] [70-79]
[20-29] [50-59] [80-89]
Iteration set
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
![Page 17: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/17.jpg)
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
Worker 0
Worker 1
Worker 2
C0 C3 C6 C9
C1 C4 C7
C2 C5 C8
[0-9] [30-39] [60-69] [90-99]
[10-19] [40-49] [70-79]
[20-29] [50-59] [80-89]
Iteration set
Worker 0
Worker 1
Worker 2
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
![Page 18: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/18.jpg)
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
Worker 0
Worker 1
Worker 2
C0 C3 C6 C9
C1 C4 C7
C2 C5 C8
[0-9] [30-39] [60-69] [90-99]
[10-19] [40-49] [70-79]
[20-29] [50-59] [80-89]
Iteration set
Worker 0
Worker 1
Worker 2
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
![Page 19: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/19.jpg)
Terminology
Loop construct
0 99Iteration space
LoopC0 C1 C2 C3 C4 C5 C6 C7 C8 C9
Chunk
Worker 0
Worker 1
Worker 2
C0 C3 C6 C9
C1 C4 C7
C2 C5 C8
[0-9] [30-39] [60-69] [90-99]
[10-19] [40-49] [70-79]
[20-29] [50-59] [80-89]
Iteration set
Worker 0
Worker 1
Worker 2
Iteration period Iteration period
#pragma omp parallel for schedule(static, 10)for(int i = 0; i < 100; i++){ ... }
Andi Drebes – Aftermath: Language-Centric Performance Analysis 4 / 10
![Page 20: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/20.jpg)
Aftermath: Overview of the GUI
Andi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 21: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/21.jpg)
Aftermath: Overview of the GUI
Detailed Text viewDetailed Text view
Time lineTime line
Filte
rsFi
lters
Stat
istic
sSt
atis
tics
Andi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 22: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/22.jpg)
Aftermath: Overview of the GUI
Time lineAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 23: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/23.jpg)
Aftermath: Overview of the GUI
Time
Proc
esor
s Activityduring
execution
Time lineAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 24: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/24.jpg)
Aftermath: Overview of the GUI
Time
Proc
esor
s
Sequential Execution(orange)
Time line: Run-time statesAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 25: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/25.jpg)
Aftermath: Overview of the GUI
Time
Proc
esor
s Parallel loop(green)
Time line: Run-time statesAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 26: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/26.jpg)
Aftermath: Overview of the GUI
Time
Proc
esor
s
BarrierSynchronization
(dark red)
Time line: Run-time statesAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 27: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/27.jpg)
Aftermath: Overview of the GUI
Time
Proc
esor
s No activity(background visible)
Time line: Run-time statesAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 28: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/28.jpg)
Aftermath: Overview of the GUI
Time
Proc
esor
s
Time line: Loop constructsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 29: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/29.jpg)
Aftermath: Overview of the GUI
Time
Proc
esor
s
Time line: Loop constructsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 30: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/30.jpg)
Aftermath: Overview of the GUI
State statisticsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 31: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/31.jpg)
Aftermath: Overview of the GUI
State statisticsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 32: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/32.jpg)
Aftermath: Overview of the GUI
Histogram showing duration of iteration periodsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 33: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/33.jpg)
Aftermath: Overview of the GUI
Histogram showing duration of iteration periodsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 34: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/34.jpg)
Aftermath: Overview of the GUI
Detailed text view for parallel loopsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 35: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/35.jpg)
Aftermath: Overview of the GUI
Detailed text view for parallel loopsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 36: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/36.jpg)
Aftermath: Overview of the GUI
Filter for loop constructsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 37: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/37.jpg)
Aftermath: Overview of the GUI
Filter for loop constructsAndi Drebes – Aftermath: Language-Centric Performance Analysis 5 / 10
![Page 38: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/38.jpg)
Demo: NPB’s MG benchmark
Benchmark: NPB MGI NPB 2.3 C implementation from the Omni Compiler ProjectI C input class (512× 512 elements)
Test platformI SGI UV 2000 (Xeon E5-4640)I 192 cores (Hyperthreading disabled)I 24 NUMA nodes, 756 GiB RAMI LLVM/clang 3.8.0I Aftermath-OpenMP for trace generation
Andi Drebes – Aftermath: Language-Centric Performance Analysis 6 / 10
![Page 39: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/39.jpg)
DEMO
![Page 40: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/40.jpg)
Demo: Summary
Execution phasesI Parallel initializations + Main ComputationI Sequential execution in between
Time spent in barriersI States on time line / statistics panel
Load imbalanceI Sufficient parallelismI High load imbalance, but not due to partitioning / scheduleI Same NUMA node→ Aprox. same execution time
SolutionI Change allocation scheme: one big allocationI Reduce number of workers: #iters = n × #workersI Result: 35× speedup
Andi Drebes – Aftermath: Language-Centric Performance Analysis 7 / 10
![Page 41: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/41.jpg)
Demo: Summary
Execution phasesI Parallel initializations + Main ComputationI Sequential execution in between
Time spent in barriersI States on time line / statistics panel
Load imbalanceI Sufficient parallelismI High load imbalance, but not due to partitioning / scheduleI Same NUMA node→ Aprox. same execution time
SolutionI Change allocation scheme: one big allocationI Reduce number of workers: #iters = n × #workersI Result: 35× speedup
Andi Drebes – Aftermath: Language-Centric Performance Analysis 7 / 10
![Page 42: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/42.jpg)
Demo: Summary
Execution phasesI Parallel initializations + Main ComputationI Sequential execution in between
Time spent in barriersI States on time line / statistics panel
Load imbalanceI Sufficient parallelismI High load imbalance, but not due to partitioning / scheduleI Same NUMA node→ Aprox. same execution time
SolutionI Change allocation scheme: one big allocationI Reduce number of workers: #iters = n × #workersI Result: 35× speedup
Andi Drebes – Aftermath: Language-Centric Performance Analysis 7 / 10
![Page 43: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/43.jpg)
Demo: Summary
Execution phasesI Parallel initializations + Main ComputationI Sequential execution in between
Time spent in barriersI States on time line / statistics panel
Load imbalanceI Sufficient parallelismI High load imbalance, but not due to partitioning / scheduleI Same NUMA node→ Aprox. same execution time
SolutionI Change allocation scheme: one big allocationI Reduce number of workers: #iters = n × #workersI Result: 35× speedup
Andi Drebes – Aftermath: Language-Centric Performance Analysis 7 / 10
![Page 44: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/44.jpg)
Overhead of Tracing
CG EP FT LU MG sparselu strassen alignment fft sort Geometricmean (abs.)
10
5
0
5
10
15
20
0.88 0.60-0.66
0.351.77
0.01
5.79
-0.030.24
4.07
0.46
NPB-2.3 (loop-based) BOTS 1.1.2 (task-based)
Relative Increase of Execution Time [%](mean for 50 runs / error bars: standard deviation)
Test systemI SGI UV 2000 (192 cores, 24 NUMA nodes)
Missing benchmarksI Outlier: floorplan (+380% execution time; very small tasks)I Segfaults (BT, nqueens, uts) / Excessive Execution time (IS) /
Verification Failure (health)
Andi Drebes – Aftermath: Language-Centric Performance Analysis 8 / 10
![Page 45: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/45.jpg)
Overhead of Tracing
CG EP FT LU MG sparselu strassen alignment fft sort Geometricmean (abs.)
10
5
0
5
10
15
20
0.88 0.60-0.66
0.351.77
0.01
5.79
-0.030.24
4.07
0.46
NPB-2.3 (loop-based) BOTS 1.1.2 (task-based)
Relative Increase of Execution Time [%](mean for 50 runs / error bars: standard deviation)
Test systemI SGI UV 2000 (192 cores, 24 NUMA nodes)
Missing benchmarksI Outlier: floorplan (+380% execution time; very small tasks)I Segfaults (BT, nqueens, uts) / Excessive Execution time (IS) /
Verification Failure (health)Andi Drebes – Aftermath: Language-Centric Performance Analysis 8 / 10
![Page 46: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/46.jpg)
Using Aftermath & Aftermath-OpenMP
Drop-in replacement for libomp with wrapper script:$ aftermath-openmp-trace -o events.ost -- <program> <args>
$ aftermath events.ost
Source code and tutorial:http://www.openstream.info/aftermath
Virtual Machine(Aftermath + Aftermath-OpenMP + sample traces + documentation):http://www.openstream.info/vm
Andi Drebes – Aftermath: Language-Centric Performance Analysis 9 / 10
![Page 47: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/47.jpg)
Using Aftermath & Aftermath-OpenMP
Drop-in replacement for libomp with wrapper script:$ aftermath-openmp-trace -o events.ost -- <program> <args>
$ aftermath events.ost
Source code and tutorial:http://www.openstream.info/aftermath
Virtual Machine(Aftermath + Aftermath-OpenMP + sample traces + documentation):http://www.openstream.info/vm
Andi Drebes – Aftermath: Language-Centric Performance Analysis 9 / 10
![Page 48: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/48.jpg)
Summary
AftermathI Reactive graphical user interface for trace analysisI Programming model-centric analysis: Loops and tasks
Aftermath-OpenMPI Instrumented LLVM/clang OpenMP run-timeI Low tracing overhead
Future workI Dependent tasksI Automate recurring analyses
On-line resourceshttp://www.openstream.info/aftermath (Main website)http://www.openstream.info/vm (VM image)
Andi Drebes – Aftermath: Language-Centric Performance Analysis 10 / 10
![Page 49: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/49.jpg)
Summary
AftermathI Reactive graphical user interface for trace analysisI Programming model-centric analysis: Loops and tasks
Aftermath-OpenMPI Instrumented LLVM/clang OpenMP run-timeI Low tracing overhead
Future workI Dependent tasksI Automate recurring analyses
On-line resourceshttp://www.openstream.info/aftermath (Main website)http://www.openstream.info/vm (VM image)
Andi Drebes – Aftermath: Language-Centric Performance Analysis 10 / 10
![Page 50: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/50.jpg)
Summary
AftermathI Reactive graphical user interface for trace analysisI Programming model-centric analysis: Loops and tasks
Aftermath-OpenMPI Instrumented LLVM/clang OpenMP run-timeI Low tracing overhead
Future workI Dependent tasksI Automate recurring analyses
On-line resourceshttp://www.openstream.info/aftermath (Main website)http://www.openstream.info/vm (VM image)
Andi Drebes – Aftermath: Language-Centric Performance Analysis 10 / 10
![Page 51: Language-Centric Performance Analysis of OpenMP Programs ... · Demo: NPB’s MG benchmark Benchmark: NPB MG I NPB 2.3 C implementation from the Omni Compiler Project I C input class](https://reader034.fdocuments.net/reader034/viewer/2022042918/5f5fabc740a80b683814b293/html5/thumbnails/51.jpg)
Summary
AftermathI Reactive graphical user interface for trace analysisI Programming model-centric analysis: Loops and tasks
Aftermath-OpenMPI Instrumented LLVM/clang OpenMP run-timeI Low tracing overhead
Future workI Dependent tasksI Automate recurring analyses
On-line resourceshttp://www.openstream.info/aftermath (Main website)http://www.openstream.info/vm (VM image)
Andi Drebes – Aftermath: Language-Centric Performance Analysis 10 / 10