Programming Heterogeneous (GPU) Systems Jeffrey Vetter ://ft.ornl.gov...
-
Upload
aurora-grill -
Category
Documents
-
view
218 -
download
1
Transcript of Programming Heterogeneous (GPU) Systems Jeffrey Vetter ://ft.ornl.gov...
Programming Heterogeneous (GPU) Systems
Jeffrey Vetter
http://ft.ornl.gov [email protected]
Presented toExtreme Scale Computing Training Program
ANL: St. Charles, IL2 August 2013
TH-2 System Compute Nodes have 3.432 Tflop/s
per node– 16,000 nodes– 32000 Intel Xeon cpus– 48000 Intel Xeon phis
Operations Nodes– 4096 FT CPUs as operations nodes
Proprietary interconnect TH2 express
1PB memory (host memory only) Global shared parallel storage is 12.4
PB Cabinets: 125+13+24 = 162
compute/communication/storage cabinets– ~750 m2
NUDT and Inspur
3
SYSTEM SPECIFICATIONS:• Peak performance of 27.1 PF
• 24.5 GPU + 2.6 CPU• 18,688 Compute Nodes each with:
• 16-Core AMD Opteron CPU• NVIDIA Tesla “K20x” GPU• 32 + 6 GB memory
• 512 Service and I/O nodes• 200 Cabinets• 710 TB total system memory• Cray Gemini 3D Torus Interconnect• 8.9 MW peak power
ORNL’s “Titan” Hybrid System:Cray XK7 with AMD Opteron and NVIDIA Tesla processors
4,352 ft2
Keeneland – Full Scale SystemKeeneland Full Scale (KFS) system installed in October 2012
• 264 HP SL250 G8 nodes in 11 compute racks• 792 M2090 GPUs contribute to aggregate system peak of
~615 TF• Over 200 users, 150 projects using KFS
ProLiant SL250 G8(2CPUs, 3GPUs)
S6500 Chassis(4 Nodes)
Rack(6 Chassis)
M2090
Xeon E5-2670
Mellanox 384p FDR InfiniBand Switch
Integrated with NICSDatacenter Lustre and XSEDE
Full PCIeG3 X16 bandwidth to all GPUs
166GFLOPS
665GFLOPS
2327GFLOPS32/18 GB
9308GFLOPS
55848GFLOPS
614450GFLOPS
http://keeneland.gatech.edu
J.S. Vetter, R. Glassbrook et al., “Keeneland: Bringing heterogeneous GPU computing to the computational science community,” IEEE Computing in Science and Engineering, 13(5):90-5, 2011, http://dx.doi.org/10.1109/MCSE.2011.83.
Keeneland System(11 Compute Racks)
Contemporary HPC ArchitecturesDate System Location Comp Comm Peak
(PF)Power (MW)
2009 Jaguar; Cray XT5 ORNL AMD 6c Seastar2 2.3 7.0
2010 Tianhe-1A NSC Tianjin Intel + NVIDIA Proprietary 4.7 4.0
2010 Nebulae NSCS Shenzhen
Intel + NVIDIA IB 2.9 2.6
2010 Tsubame 2 TiTech Intel + NVIDIA IB 2.4 1.4
2011 K Computer RIKEN/Kobe SPARC64 VIIIfx Tofu 10.5 12.7
2012 Titan; Cray XK6 ORNL AMD + NVIDIA Gemini 27 9
2012 Mira; BlueGeneQ ANL SoC Proprietary 10 3.9
2012 Sequoia; BlueGeneQ LLNL SoC Proprietary 20 7.9
2012 Blue Waters; Cray NCSA/UIUC AMD + (partial) NVIDIA
Gemini 11.6
2013 Stampede TACC Intel + MIC IB 9.5 5
2013 Tianhe-2 NSCC-GZ (Guangzhou)
Intel + MIC Proprietary 54 ~20
6
Emerging Computing Architectures
• Heterogeneous processing– Many cores– Fused, configurable memory
• Memory– 3D Stacking– New devices (PCRAM, ReRAM)
• Interconnects– Collective offload– Scalable topologies
• Storage– Active storage– Non-traditional storage architectures
(key-value stores)
• Improving performance and programmability in face of increasing complexity– Power, resilienceHPC (all) computer design is more fluid now than in the past two decades.
7 Managed by UT-Battellefor the U.S. Department of Energy
AMD Llano’s fused memory hierarchy
K. Spafford, J.S. Meredith, S. Lee, D. Li, P.C. Roth, and J.S. Vetter, “The Tradeoffs of Fused Memory Hierarchies in Heterogeneous Architectures,” in ACM Computing Frontiers (CF). Cagliari, Italy: ACM, 2012.Note: Both SB and Llano are consumer parts, not server parts.
8
Future Directions in Heterogeneous Computing
• Over the next decade: Heterogeneous computing will continue to increase in importance
• Manycore
• Hardware features– Transactional memory– Random Number Generators– Scatter/Gather– Wider SIMD/AVX
• Synergies with BIGDATA, mobile markets, graphics
• Top 10 list of features to include from application perspective. Now is the time!
Applications must use a mix of programming models
MPI
Low overhead
Resource contention
Locality
OpenMP, Pthreads
SIMD
NUMA
OpenACC, CUDA, OpenCLMemory use, coalescing
Data orchestration
Fine grained parallelism
Hardware features
CommunicationMPI Profiling
Georgia Tech / Computational Science and Engineering / Vetter 11
Communication – MPI MPI dominates HPC Communication can severely restrict performance and
scalability Developer has explicit control of MPI in application
– Communication computation overlap– Collectives
MPI tools provide wealth of information– Statistics – number and size of message sent in certain time– Tracing – event based log per task of all communication events
Georgia Tech / Computational Science and Engineering / Vetter 12
MPI Provides the MPI Profiling Layer MPI Spec provides the MPI Profiling Layer to allow
interposition between application and MPI runtime
PERUSE is a recent attempt to provide more detailed information from the runtime for performance measurement– http://www.mpi-peruse.org/
Georgia Tech / Computational Science and Engineering / Vetter 13
MPI Performance Tools Provide Varying Levels of Detail
MPIP (http://mpip.sourceforge.net/)– Statistics on
• Counts, sizes, min, max for Point-to-point and collective operations• MPI IO counts, sizes, min, max
– Lightweight• Has scaled to 64k processors on BGL• No large tracefiles• Low perturbation
– Callsite specific information Tau, Vampir, Intel Tracing Tool, Paraver
– Statistical and tracing information– Varying levels of complexity, perturbation, and tracefile size
Paraver (http://www.bsc.es/plantillaA.php?cat_id=486)– Covered in detail here
MPI Profiling
Georgia Tech / Computational Science and Engineering / Vetter 15
Why do these systems have different performance on POP?
Georgia Tech / Computational Science and Engineering / Vetter 16
MPI Performance Profiling: mpiP mpiP Basics
– Easy to use tool• Statistical-based MPI profiling library• Requires relinking but no source level changes• Compiling with “-g” is recommended
– Provides average times for each MPI call site– Has been shown to be very useful for scaling analysis
Georgia Tech / Computational Science and Engineering / Vetter 17
MPIP example POP MPI performance
@ mpiP@ Command : ./pop @ Version : 2.4@ MPIP Build date : Jul 18 2003, 11:41:57@ Start time : 2003 07 18 15:01:16@ Stop time : 2003 07 18 15:03:53@ MPIP env var : [null]@ Collector Rank : 0@ Collector PID : 25656@ Final Output Dir : .@ MPI Task Assignment : 0 h0107.nersc.gov@ MPI Task Assignment : 1 h0107.nersc.gov
Georgia Tech / Computational Science and Engineering / Vetter 18
More mpiP Output for POP---------------------------------------------------------------------------@--- MPI Time (seconds) ------------------------------------------------------------------------------------------------------------------------------Task AppTime MPITime MPI% 0 157 1.89 1.21 1 157 6.01 3.84 * 313 7.91 2.52---------------------------------------------------------------------------@--- Callsites: 6 ------------------------------------------------------------------------------------------------------------------------------------ ID Lev File Line Parent_Funct MPI_Call 1 0 global_reductions.f 0 ?? Wait 2 0 stencils.f 0 ?? Waitall 3 0 communicate.f 3122 .MPI_Send Cart_shift 4 0 boundary.f 3122 .MPI_Send Isend 5 0 communicate.f 0 .MPI_Send Type_commit 6 0 boundary.f 0 .MPI_Send Isend
Georgia Tech / Computational Science and Engineering / Vetter 19
Still More mpiP Output for POP---------------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------------------Call Site Time App% MPI%Waitall 4 2.22e+03 0.71 28.08Waitall 6 1.82e+03 0.58 23.04Wait 1 1.46e+03 0.46 18.41Waitall 2 831 0.27 10.51Allreduce 1 499 0.16 6.31Bcast 1 275 0.09 3.47Isend 2 256 0.08 3.24Isend 4 173 0.06 2.18Barrier 1 113 0.04 1.43Irecv 2 80.3 0.03 1.01Irecv 4 40.6 0.01 0.51Cart_create 3 28 0.01 0.35Cart_coords 3 17.4 0.01 0.22Type_commit 5 12.7 0.00 0.16
Georgia Tech / Computational Science and Engineering / Vetter 20
Remaining mpiP Output for POP---------------------------------------------------------------------------@--- Aggregate Time (top twenty, descending, milliseconds) -------------------------------------------------------------------------------------------Isend 1 12.7 0.00 0.16Bcast 3 12.4 0.00 0.16Barrier 5 12.2 0.00 0.15Cart_shift 5 12 0.00 0.15Irecv 1 10.7 0.00 0.13Isend 6 9.28 0.00 0.12---------------------------------------------------------------------------@--- Callsite statistics (all, milliseconds): 53 -----------------------------------------------------------------------------------------------------Name Site Rank Count Max Mean Min App% MPI%Allreduce 1 0 1121 2.35 0.182 0.079 0.13 10.79Allreduce 1 1 1121 11.1 0.263 0.129 0.19 4.90Allreduce 1 * 2242 11.1 0.222 0.079 0.16 6.31