Post on 06-Feb-2016
description
New Techniques for Programming GPU Clusters
Yifeng Chen
School of EECSPeking University, China.
Two Conflicting Approaches for
Programmability in HPCTop-down Approach
Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization Add low-level program constructs to improve low-level control Risks:
Programmers tend to avoid using “extra” constructs.Low-level controls do not fit well into the core model.
Bottom-up Approach (PARRAY PPoPP’12) Core programming model exposes the memory hierarchy Same algorithm, Same performance, Same intellectual
challenge, but Shorter code
GPUClustersTianhe: 1 GPU/ 2CPUs Tsubame : 3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUsPKU McClus: 2GPUs/1
CPU
GPU 0
GPU 1
cu d a Me m cp yH o s tTo D e vice
4 0 9 6 20482048
P ro c0
Pro c1
MPI_ Sca tte r
PC I
ne two rk
Motivating Examples for PARRAY
Basic Notation
Dimension Tree
Type Reference
GPU 0
GPU 1
cu d a Me m cp yH o s tTo D e vice
4 0 9 6 20482048
P ro c0
Pro c1
MPI_ Sca tte r
PC I
ne two rk
Thread Arrays
#parray {pthd [2]} P#parray {paged float [2][[2048][4096]]} H#parray {dmem float # H_1} D#parray {[#P][#D]} Gfloat* host;_pa_pthd* p;#mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p)}
pthread_create
pthread_create
sem_postsem_post
sem_waitsem_wait
pthread_joinpthread_join
Generating CUDA+Pthread
#parray { mpi [2] } M#parray { paged float [2][[2048][4096]] } H#parray { [#M][#H_1] } G
float* host;_pa_mpi* m;
#mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m)}
Generating MPI or IB/verbs
MPI_ScatterMPI_Scatter
ALLTOALL
BCAST
Other Communication Patterns
Generating Code for IB/verbs and YH
Communication LayerSemi-Bypassing the MPI layerPatching the Infiniband layerDiscontiguous RDMA communication
pattern achieving Zero-Copy.
Large-Scale FFTin 20 linesDeeply optimized algorithm (ICS 2010)Zero-copy for hmem
(Before Nov 2011)
Direct Simulation of Turbulent Flows
Scale Up to 14336 3D Single-Precision 12 distributed arrays, each with 11 TB data (128TB total) Entire Tianhe-1A with 7168 nodes
Progress 4096 3D completed 8192 3D half-way and 14336 3D tested for performance.
Software Technologies PARRAY code only 300 lines. Programming-level resilience technology for stable
computation Conclusion: GPU-accelerated large simulation on entire
Tianhe-1A is feasible.
Generated Code
DiscussionsOther programming models?
MPI (more expressive datatype) OpenACC (optimization for coalescing accesses) PGAS (generating PGAS library calls) IB/verbs (directly generating Zero-Copy IB calls)
We need a software stack! Irregular structures must be encoded into arrays
and then benefit from PARRAY. Runtime workflow possible above PARRAY Generating Pthread + CUDA + MPI (future
support of FPGA and MIC possible) + macros Macros are compiled out: no performance loss. Typical training = 3 days, friendly to engineers…