Open-Source Profiling and Analysis Tools for AuroraPerformance studies on Knights Landing
1 August 2016
Rashawn L. Knapp, Supada Laosooksathit, Preeti Suman, Tatyana MineevaIntel, Software and Service Group (SSG)Systems Engineering, Architecture & [email protected], [email protected]
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
1 Aug. 20162
Open-Source Tools Team: Executive SummaryIntroduce team and goals
Tools
Benchmark Suite
Performance Studies on Xeon® Phi™ Coprocessor Knights Landing (KNL)‐ Description of architecture and study platform
‐ Greater Chicago Area Systems Research 2016 – attended and discussed work
‐ OpenSpeedShop support for Intel compilers and study
‐ CAM-SE with OpenSpeedShop and HPCToolKit
Aurora Preparation
Summary and Next Seps
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice31 Aug. 2016
Open-Source Tools Team: Goals and PurposeOur Role‐ Collaborations with Tool Owners
‐ Enable Open Source HPC analyzers, ensuring these performance tools run well on Intel’s current and upcoming Xeon Phi platforms
CORAL‐ Theta - Knights Landing (KNL), 8.5 petaflops (PFLOPS)
‐ Aurora - Knights Hill (KNH), peak 180 PFLOPS, >50,000 compute nodes, >7 PB DRAM and persistent memory
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
1 Aug. 2016 Intel Confidential4
Open-Source Tools Team: Tools and Status
Tools table
Tool Description Status
Lo
w-l
ev
el
Fo
un
da
tio
n Dyninst (UMD, UW) Dynamic binary instrumentation tool HSW EP, KNL, Intel/GCC compilers, Intel MPI. Test suite Several HSW patchesVersions: 8.2.1 - 9.2.0; Verified: test suite
PAPI (UTK) Interface to count CPU and off-core performance events HSW EP, KNL (w/ Intel patch) with Intel and GCC compilers. Test suite completed, Patch to enable off-core HSW events in Component tests. (Jul. ‘15)Versions 5.4.1- 5.4.3; Verified: test suite
Hig
h-l
ev
elT
oo
ls
TAU (UO) Profiling and tracing tool for parallel applications, supports MPI and OpenMP; incorporates Dyninst and PAPI
HSW EP Intel Compilation with Intel MPI, MPICH, and Intel and GCC C/C++/Fortran compilers, Intel MPI, MPICH, Dyninst, PAPI. Version: 2.24.1 (HSW), KNL in progress (v 2.25.1)
Score-P (VI-HPS) Provides a common interface for high-level tools; incorporates Dyninst and PAPI
HSW EP: Intel/GCC compilers, Dyninst, PAPI, Intel MPI/MPICHVersion 3.0Compiled but not tested (goal is with TAU)
Open|Speedshop(Krell Institute)
Dynamic Instrumentation tool for Linux: profiling, event tracing for MPI and OpenMP programs; incorporates Dyninst and PAPI
HSW EP, KNL, Intel/GCC compilers, Intel MPI, Dyninst, PAPIPatch to enable OSS installation with Intel compilers (Q1 ‘16)Version 2.2.*; Verified: benchmark suite
HPCToolKit (Rice) Lightweight sampling measurement tool for HPC; incorporates Dyninst* and PAPI
HSW EP, KNL, Intel/GCC compilers, Intel MPIVersions 5.4.*; Verified: benchmark suite
Darshan (ALCF) IO monitoring tool HSW EP, KNL, Intel/GCC compilers, Intel MPIVersions 2.3.1, 3.0.1; Verified: benchmark suite
Ind
ep
en
de
nt Valgrind Base framework for constructing dynamic analysis tools; includes suite of tools
including a debugger, and error detection for memory and pthreads.
HSW EP, KNL, Intel/GCC compilers.Version: 3.10.1; Verified: test suite
memcheck Detects memory errors: stack, heap, memory leaks, and MPI distributedmemory. For C and C++.
HSW EP, KNL, Intel/GCC compilers.Version: 3.10.1; Verified: test suite
helgrind Pthreads error detection: synchronization, incorrect use of pthreads API, potential deadlocks, data races. C, C++, Fortran
Enabled on HSW EP and KNL with Intel/GCC compilers.Version: 3.10.1; Verified: test suite
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
1 Aug. 20165
Open-Source Tools Team: Benchmark Suite‐ CORAL benchmarks
‐ Well-known HPC benchmark (WKHPCB)
Name Category Type Priority Notes
LSMS Scalable Science Computation, Process communication, System scalability 1
CAM-SE Throughput Computation, Process communication 1
AMG2013 Throughput Computation, Process communication, Memory-access bound 1
UMT2013 Throughput Computation, Process communication, Memory-access bound 1
IOR Skeleton Process communication, IO 1
STRIDE Skeleton Computation, Memory 1
FTQ Skeleton Computation 2
HPL WKHPCB Computation, Process communication 1
STREAM WKHPCB Memory bandwidth 1
HPCG WKHPCB Computation, Memory, Process communication 1
NPB WKHPCB Computation, Memory, Process communication 1Serial, MPI, OMP, hybrid
HPCC WKHPCB Computation, Memory, Process communication 1 HPL, Stream
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
1 Aug. 20166
Open-Source Tools Team: KNL Performance Studies
The Knights Landing Processor
Knights Landing (KNL) Highlights‐ Either as main processor or a co-processor
‐ Intel(R) Advanced Vector Extensions 512(Intel(R) AVX-512)
‐ 14-nanometer processor
‐ The chip contains 36 Tiles, each with 2 cores, 2Vector Processing Units (VPUs)/core and 1MBL2 cache; interconnected by 2D Mesh.
‐ 16 GB High Band Width Multi-ChannelDRAM(MCDRAM) and 6 channels DDR4
‐ Intel Omni-Path controller to support IntelOmni-Path Architecture (OPA)
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
0
100
200
300
400
500
600
700
800
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
ao
be
nch
cil
k
NP
B W
is
NP
B W
cg
NP
B W
ep
NP
B W
mg
coll
isio
n t
bb
coll
isio
n c
ilk
um
t
rt s
er
ma
nd
elb
rot
tbb
wa
ve
2d
se
r
NP
B W
bt
hp
cg
rtm
_st
en
cil
tbb
NP
B W
sp
ma
nd
elb
rot
ser
pe
rlin
tb
b
NP
B W
dc
pe
rlin
se
r
rtm
_st
en
cil
ser
NP
B W
ft
NP
B W
mg
am
g
fwq
um
t
coll
isio
n s
er
NP
B W
bt
NP
B W
lu
ao
be
nch
se
r
lsm
s
vo
l re
nd
se
r
rtm
_st
en
cil
ser
be
nch
ma
rk s
tan
da
lon
e t
ime
, s
ov
erh
ea
d
Open|SpeedShop pcsamp overhead, %
Intel O|SS, Intel benchmark
GCC O|SS, Intel benchmark
Intel O|SS, GCC benchmark
GCC OSS, GCC benchmark
benchmark standalone time, s
1 Aug. 20167
Open-Source Tools Team: KNL Performance StudiesOpen|SpeedShop Enabling‐ Enabled compilation with Intel compiler
‐ Now includes intel-specific compilation option to the official release.
‐ No critical bugs were found on KNL
‐ On serial benchmarks, in 90% of cases the hotspots overhead is less than 5%
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
1 Aug. 20168
Open-Source Tools Team: KNL Performance StudiesCAM-SE - HPCToolKit and Open|SpeedShop ProfilingExperiment setup-Purpose: Compare Profiling Capabilities of Open-Source tools on CAM-SE
-Approach:
- Open|SpeedShop – pcsamp trials
- HPCToolkit – hpcrun trials
- Compare top function reports from the tools
-CORAL CAM-SE benchmark
-Hardware and software environments
- Single node KNL machine
- Intel and GCC compilation versions
- Various combinations of number of MPI processes and OpenMP threads
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice9
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
‐ Enable tools in preparation for Aurora
‐ Optimizations to enable better tool performance
1 Aug. 2016 Intel Confidential10
Open-Source Tools Team: Aurora Preparation
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Summary
‐ Our enabling KNL work is mostly complete
Next Steps
‐ Complete in progress KNL performance studies
‐ Complete tool enabling (Score-P with TAU)
‐ Transition to Aurora
Fun Things that Intel offers:
‐ FLOPS calculation for KNL with Intel® Software Development Emulator (Intel® SDE): https://software.intel.com/en-us/articles/calculating-flop-using-intel-software-development-emulator-intel-sde
‐ OpenHPC: aggregation of common ingredients required for Linux HPC cluster deployment
‐ http://www.openhpc.community/
Questions
1 Aug. 2016 Intel Confidential11
Open-Source Tools Team: Summary and Next Steps
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Legal Disclaimer & Optimization Notice
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance ofthat product when combined with other products.
Copyright © 2014, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
121 Aug. 2016 Intel Confidential
Top Related