7/29/2019 BuddyBland Titan SC12
1/12
Office ofScience
Titan - Early Exwith the Titan S
Oak Ridge National La
Oak Ridge Leadership Co
No
7/29/2019 BuddyBland Titan SC12
2/12
4Bud
4,352 ft2
404 m2
SYSTEM SPECIFICATIONS:
Peak performance of 27.1 PF (
18,688 Compute Nodes each w 16-Core AMD Opteron CP
NVIDIA Tesla K20x GPU
512 Service and I/O nodes
200 Cabinets
710 TB total system memory
Cray Gemini 3D Torus Intercon 8.9 MW peak power 8.3 avg.
ORNLs Titan Hybrid System: Crawith AMD Opteron and NVIDIA Tesprocessors
7/29/2019 BuddyBland Titan SC12
3/12
5Bud
X86 processor provides fast, singlthread performance for control &communications AMD Opteron 62
16 cores 141 GFLOPs p
7/29/2019 BuddyBland Titan SC12
4/12
6Bud
GPUs are designed for extremeparallelism, performance & powerefficiency
NVIDIA Tesla K2
14 Streaming Mult
2,688 CUDA cores
1.31 TFLOPs peak
6 GB GDDR5 mem
HPL: >2.0 GFLOP(Titan full system mea
7/29/2019 BuddyBland Titan SC12
5/12
7/29/2019 BuddyBland Titan SC12
6/128Bud
Titan:Cray XK7 System
Board:
4 Compute Nodes
5.8 TF
152 GB
Cab
2496
13
3.Compute Node:
1.45 TF
38 GB
7/29/2019 BuddyBland Titan SC12
7/1213 Bud
Why GPUs?High Performance and Power Efficiena Path to Exascale Hierarchical parallelism Improves scalability of applications
Exposing more parallelism through code refactoring and sourcdirectives
Heterogeneous multi-core processor architecture Use the rigprocessor for each task.
Data locality Keep the data near the processing. GPU has hbandwidth to local memory for rapid access. GPU has large in
Explicit data management Explicitly manage data movemen
CPU and GPU memories.
7/29/2019 BuddyBland Titan SC12
8/1214 Bud
Hybrid Programming Model On Jaguar, with 299,008 cores, we were seeing the lim
single level of MPI scaling for most applications
To take advantage of the vastly larger parallelism in Titneed to use hierarchical parallelism in their codes
Distributed memory: MPI, SHMEM, PGAS
Node Local: OpenMP, Pthreads, local MPI communicators
Within threads: Vector constructs on GPU, libraries, OpenA
These are the same types of constructs needed onmulti-PFLOPS computers to scale to the full size o
systems!
7/29/2019 BuddyBland Titan SC12
9/1215
Bud
How do you program these nodes? Compilers
OpenACC is a set of compiler directives that allows the user to express hierin the source code so that the compiler can generate parallel code for the taGPU, MIC, or vector SIMD on CPU
Cray compiler supports XK7 nodes and is OpenACC compatible
CAPS HMPP compiler supports C, C++ and Fortran compilation for heterogOpenACC support
PGI compiler supports OpenACC and CUDA Fortran
Tools
Allinea DDT debugger scales to full system size and with ORNL support wilheterogeneous (x86/GPU) apps
ORNL has worked with the Vampir team at TUD to add support for profiling heterogeneous nodes
CrayPAT and Cray Apprentice support XK6 programming
7/29/2019 BuddyBland Titan SC12
10/1221
Bud
Early Science Applications on Ti
Material Science(WL-LSMS)Role of material disorder,statistics, and fluctuationsin nanoscale materialsand systems.
Combustion (S3D)Combustion simulations toenable the next generationof diesel/bio- fuels to burnmore efficiently.
Climate Change (CAM-SE)Answer questions about specific climatechange adaptation and mitigationscenarios; realistically representfeatures like precipitationpatterns/statistics and tropical storms.
Nuclear EUnprecedenradiation trathat can be nuclear eneapplications
BiofueA multipmolecucode.
Astrophysics (NRDF)Radiation transport critical toastrophysics, laser fusion,combustion, atmosphericdynamics, and medical imaging.
H Eff ti GPU S l bl A li
7/29/2019 BuddyBland Titan SC12
11/1222
Bud
How Effective are GPUs on Scalable AppliOLCF-3 Early Science CodesVery early performance measurements on Titan
XK7 (w/ K20x) vs. XE6
Cray XK7: K20x GPU plus AMD
Cray XE6: Dual AMD 6274 and
Cray XK6 w/o GPU: Single AMD 62
Application PerformanceRatio
Comments
S3D 1.8 Turbulent combustion 6% of Jaguar workload
Denovo sweep 3.8 Sweep kernel of 3D neutron transport f 2% of Jaguar workload
LAMMPS 7.4*(mixed precision) High-performance molecular dynamics 1% of Jaguar workload
WL-LSMS 3.8 Statistical mechanics of magnetic mate 2% of Jaguar workload 2009 Gordon Bell Winner
CAM-SE 1.8*(estimate)
Community atmosphere model 1% of Jaguar workload
7/29/2019 BuddyBland Titan SC12
12/12
27Bud
Questions?
27
The research and activities des
presentation were performed using
the National Center for Computati
Oak Ridge National Laboratory, whi
the Office of Science of the U.S. Depunder Contract No. DE-AC050
Want to join ou
ORNL is hiring. C
http://jobs.o
Top Related