New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

New Techniques for Programming GPU Clusters

Yifeng Chen

School of EECSPeking University, China.

Two Conflicting Approaches for

Programmability in HPCTop-down Approach

Core programming model is high-level (e.g. func parallel lang) Must rely on heavy heuristic runtime optimization Add low-level program constructs to improve low-level control Risks:

Programmers tend to avoid using “extra” constructs.Low-level controls do not fit well into the core model.

Bottom-up Approach (PARRAY PPoPP’12) Core programming model exposes the memory hierarchy Same algorithm, Same performance, Same intellectual

challenge, but Shorter code

GPUClustersTianhe: 1 GPU/ 2CPUs Tsubame ： 3GPU/ 2CPUs Mole-8.5: 6GPUs/2CPUsPKU McClus: 2GPUs/1

cu d a Me m cp yH o s tTo D e vice

4 0 9 6 20482048

P ro c0

Pro c1

MPI_ Sca tte r

ne two rk

Motivating Examples for PARRAY

Basic Notation

Dimension Tree

Type Reference

cu d a Me m cp yH o s tTo D e vice

4 0 9 6 20482048

P ro c0

Pro c1

MPI_ Sca tte r

ne two rk

Thread Arrays

#parray {pthd [2]} P#parray {paged float [2][[2048][4096]]} H#parray {dmem float # H_1} D#parray {[#P][#D]} Gfloat* host;_pa_pthd* p;#mainhost{ #create P(p) #create H(host) #detour P(p) { float* dev; INIT_GPU($tid$); #create D(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy P(p)}

pthread_create

sem_postsem_post

sem_waitsem_wait

pthread_joinpthread_join

Generating CUDA+Pthread

#parray { mpi [2] } M#parray { paged float [2][[2048][4096]] } H#parray { [#M][#H_1] } G

float* host;_pa_mpi* m;

#mainhosts{ #create M(m) #create H(host) #detour M(m) { float* dev; #create H_1(dev) #insert DataTransfer(dev, G, host, H){} } #destroy H(host) #destroy M(m)}

Generating MPI or IB/verbs

MPI_ScatterMPI_Scatter

ALLTOALL

Other Communication Patterns

Generating Code for IB/verbs and YH

Communication LayerSemi-Bypassing the MPI layerPatching the Infiniband layerDiscontiguous RDMA communication

pattern achieving Zero-Copy.

Large-Scale FFTin 20 linesDeeply optimized algorithm (ICS 2010)Zero-copy for hmem

(Before Nov 2011)

Direct Simulation of Turbulent Flows

Scale Up to 14336 3D Single-Precision 12 distributed arrays, each with 11 TB data (128TB total) Entire Tianhe-1A with 7168 nodes

Progress 4096 3D completed 8192 3D half-way and 14336 3D tested for performance.

Software Technologies PARRAY code only 300 lines. Programming-level resilience technology for stable

computation Conclusion: GPU-accelerated large simulation on entire

Tianhe-1A is feasible.

Generated Code

DiscussionsOther programming models?

MPI (more expressive datatype) OpenACC (optimization for coalescing accesses) PGAS (generating PGAS library calls) IB/verbs (directly generating Zero-Copy IB calls)

We need a software stack! Irregular structures must be encoded into arrays

and then benefit from PARRAY. Runtime workflow possible above PARRAY Generating Pthread + CUDA + MPI (future

support of FPGA and MIC possible) + macros Macros are compiled out: no performance loss. Typical training = 3 days, friendly to engineers…

New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

Documents

Transcript of New Techniques for Programming GPU Clusters Yifeng Chen School of EECS Peking University, China.

3 Affiliated High School Peking Family Box Peking

Hong Jiang ( Yifeng Zhu, Xiao Qin, and David Swanson )

Parallel Subgraph Listing in a Large-Scale Graph Yingxia Shao Bin Cui Lei Chen Lin Ma Junjie Yao Ning Xu School of EECS, Peking University.

School of EECS, Peking University 1 A Group-theoretic Framework for Rendezvous in Heterogeneous Cognitive Radio Networks Lin Chen ∗, Kaigui Bian ∗, Lin.

Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Partial Redundancy Elimination.

3DVCR Group, Department of Machine Intelligence *Yipu Zhao, M. He, H. Zhao, F. Davoine, and H. Zha Department of EECS, Peking University Sino-French Lab,

Peking presentation

MapReduce Theory, Implementation and Algorithms course/cs402/2009 Hongfei Yan School of EECS, Peking University 7/2/2009 Refer to.

Clustering Hongfei Yan School of EECS, Peking University 7/8/2009 Refer to Aaron Kimball’s slides.

Instructor: Prof. Yifeng Zhu Spring 2016web.eece.maine.edu/~zhu/book/lab/L4_Lab_09_ADC_C.pdfInstructor: Prof. Yifeng Zhu Spring 2016 Goals 1. Understand basic ADC concepts, such as

Introductory Distributed Systems course/cs402 Hongfei Yan School of EECS, Peking University 6/24/2008 Refer to Aaron Kimball’s slides.

Peking Opera

~Peking opera 1

Efficient Cohesive Subgraph Detection in Parallel Yingxia Shao Lei Chen Bin Cui School of EECS, Peking University Hong Kong University of Science.

Optimization Models - EECS 127 / EECS 227AT

Optimization Models - EECS 127 / EECS 227ATinst.eecs.berkeley.edu/~ee127/sp20/student... · Optimization Models EECS 127 / EECS 227AT Laurent El Ghaoui EECS department UC Berkeley

Other Google Technologies course/cs402 Peng Bo School of EECS, Peking University 7/15/2008 Refer to Aaron Kimball’s slides.

PEKING UNIVERSITY

UCLA ENGINEERING Computer Science School of EECS, Peking University On Heterogeneous Neighbor Discovery in Wireless Sensor Networks Lin Chen † ♠, Ruolin.

Traditional Asian TheaterAsian. China Peking Opera Beijing The Peking Opera House.