GPU Computing with CUDA -...

118
September 4, 2011 Gabriel Noaje – Source-to-source code translator : OpenMP to CUDA 1 University of Reims Champagne-Ardenne, France CUDA Training Day June 26, 2012 GPU Computing with CUDA Gabriel Noaje [email protected]

Transcript of GPU Computing with CUDA -...

Page 1: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

September 4, 2011 Gabriel Noaje – Source-to-source code translator : OpenMP to CUDA 1

University of Reims Champagne-Ardenne, France

CUDA Training Day June 26, 2012

GPU Computing with CUDA

Gabriel Noaje [email protected]

Page 2: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL
Page 3: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

INTRODUCTION TO MASSIVELY PARALLEL COMPUTING

Page 4: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Moore’s Law (paraphrased)

“The number of transistors on an integrated circuit doubles every two years.”

– Gordon E. Moore

Page 5: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Moore’s Law (visualized)

Credits: Wikimedia

Page 6: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Serial Performance Scaling is Over

!   Cannot continue to scale processor frequencies !   no 10 GHz chips

!   Cannot continue to increase power consumption

!   can’t melt chip

!   Can continue to increase transistor density !   as per Moore’s Law

Page 7: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

How to Use Transistors?

!   Instruction-level parallelism !   out-of-order execution, speculation, … !   vanishing opportunities in power-constrained world

!   Data-level parallelism !   vector units, SIMD execution, … !   increasing … SSE, AVX, Cell SPE, Clearspeed, GPU

!   Thread-level parallelism !   increasing … multithreading, multicore, manycore !   Intel Core2, AMD Phenom, Sun Niagara, STI Cell, NVIDIA Fermi, …

Page 8: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

!   A quiet revolution and potential build-up !   Computation: TFLOPs vs. 100 GFLOPs

!   GPU in every PC – massive volume & potential impact

Why Massively Parallel Processing?

T12

Westmere NV30 NV40

G70

G80

GT200

3GHz Dual Core P4

3GHz Core2 Duo

3GHz Xeon Quad

Page 9: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Why Massively Parallel Processing? !   A quiet revolution and potential build-up

!   Bandwidth: ~10x

!   GPU in every PC – massive volume & potential impact

NV30 NV40 G70

G80

GT200

T12

3GHz Dual Core P4

3GHz Core2 Duo

3GHz Xeon Quad

Westmere

Page 10: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

KEPLER ARCHITECTURE

Page 11: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Kepler GK110 Block Diagram

Architecture 7.1B Transistors 15 SMX units > 1 TFLOP FP64 1.5 MB L2 Cache 384-bit GDDR5 PCI Express Gen3

Page 12: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Kepler GK110 SMX vs Fermi SM

Page 13: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

SMX: Efficient Performance

192 CUDA cores 64 FP units 32 Special Function Units 32 load/store units dedicated for access memories 65536 registers 64KB shared memory 48KB cache

Page 14: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy:

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Block IDs and Thread IDs • Each thread uses IDs to decide

what data to work on – Block ID: 1D or 2D – Thread ID: 1D, 2D, or 3D

• Simplifies memory

addressing when processing multidimensional data – Image processing – Solving PDEs on volumes – …

Page 15: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

The “New” Moore’s Law

!   Computers no longer get faster, just wider

!   You must re-think your algorithms to be parallel !

!   Data-parallel computing is most scalable solution !   Otherwise: refactor code for 2 cores !   You will always have more data than cores –

build the computation around the data

8 cores 4 cores 16 cores…

Page 16: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Processor Memory Processor Memory

Global Memory

Generic Multicore Chip

!   Handful of processors each supporting ~1 hardware thread

!   On-chip memory near processors (cache, RAM, or both)

!   Shared global memory space (external DRAM)

Page 17: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

• • • Processor Memory Processor Memory

Global Memory

Generic Manycore Chip

!   Many processors each supporting many hardware threads

!   On-chip memory near processors (cache, RAM, or both)

!   Shared global memory space (external DRAM)

Page 18: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Enter the GPU Computing

!   Massive economies of scale

!   Massively parallel

Page 19: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

3D animation

Movie playing

CAD (Computer-Aided Design) Games

Page 20: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

GPU Evolution

!   High throughput computation !   GeForce GTX 280: 933 GFLOP/s

!   High bandwidth memory !   GeForce GTX 280: 140 GB/s

!   High availability to all !   180+ million CUDA-capable GPUs in the wild

1995 2000 2005 2010

RIVA 128 3M xtors

GeForce® 256 23M xtors

GeForce FX 125M xtors

GeForce 8800 681M xtors

GeForce 3 60M xtors

“Fermi” 3B xtors

Page 21: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Why is this different from a CPU?

!   Different goals produce different designs !   GPU assumes work load is highly parallel !   CPU must be good at everything, parallel or not

!   CPU: minimize latency experienced by 1 thread !   big on-chip caches !   sophisticated control logic

!   GPU: maximize throughput of all threads !   # threads in flight limited by resources => lots of

resources (registers, bandwidth, etc.) !   multithreading can hide latency => skip the big caches !   share control logic across many threads

Page 22: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Financial analysis Molecular docking

Medical MRI Geological modeling

Weather simulation

Computer virus detection

Page 23: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

• = “Compute Unified Device Architecture” • extends the ANSI C standard • gentle learning curve (compared to Cg, HLSL, etc.) • opens the underlying architecture to the user

www.nvidia.com/getcuda (driver + toolkit + SDK + docs + …)

Initially - use graphic calls: • Cg by NVIDIA • HLSL by Microsoft Nowadays - rich GPU programming ecosystem: • ATI Stream by AMD • CUDA by NVIDIA • OpenCL by Khronos Group • Direct Computing by Microsoft

( NVIDIA hardware specific )

Page 24: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

CUDA: Scalable parallel programming

!   Augment C/C++ with minimalist abstractions !   let programmers focus on parallel algorithms !   not mechanics of a parallel programming language

!   Provide straightforward mapping onto hardware !   good fit to GPU architecture !   maps well to multi-core CPUs too

!   Scale to 100s of cores & 10,000s of parallel threads !   GPU threads are lightweight — create / switch is free !   GPU needs 1000s of threads for full utilization

Page 25: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Key Parallel Abstractions in CUDA

!   Hierarchy of concurrent threads

!   Lightweight synchronization primitives

!   Shared memory model for cooperating threads

Page 26: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Hierarchy of concurrent threads

!   Parallel kernels composed of many threads !   all threads execute the same sequential program

!   Threads are grouped into thread blocks !   threads in the same block can cooperate

!   Threads/blocks have unique IDs

Thread t

t0 t1 … tB

Block b

Page 27: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

CUDA Model of Parallelism

!   CUDA virtualizes the physical hardware !   thread is a virtualized scalar processor (registers, PC, state) !   block is a virtualized multiprocessor (threads, shared mem.)

!   Scheduled onto physical hardware without pre-emption !   threads/blocks launch & run to completion !   blocks should be independent

• • • Block Memory Block Memory

Global Memory

Page 28: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Heterogeneous Computing

Multicore CPU

Page 29: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

C for CUDA !   Philosophy: provide minimal set of extensions necessary to expose power

!   Function qualifiers: __global__ void my_kernel() { } __device__ float my_device_func() { }

!   Variable qualifiers: __constant__ float my_constant_array[32]; __shared__ float my_shared_array[32];

!   Execution configuration: dim3 grid_dim(100, 50); // 5000 thread blocks dim3 block_dim(4, 8, 8); // 256 threads per block my_kernel <<< grid_dim, block_dim >>> (...); // Launch kernel

!   Built-in variables and functions valid in device code: dim3 gridDim; // Grid dimension dim3 blockDim; // Block dimension dim3 blockIdx; // Block index dim3 threadIdx; // Thread index void __syncthreads(); // Thread synchronization

Page 30: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

CUDA PROGRAMMING BASICS

Page 31: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Outline of CUDA Basics

!  Basic Kernels and Execution on GPU !  Basic Memory Management !  Coordinating CPU and GPU Execution

!  See the Programming Guide for the full API

Page 32: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

CUDA Programming Model

!  Parallel code (kernel) is launched and executed on a device by many threads

!  Launches are hierarchical !  Threads are grouped into blocks !  Blocks are grouped into grids

!  Familiar serial code is written for a thread !  Each thread is free to execute a unique code

path !  Built-in thread and block ID variables

Page 33: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

High Level View

SMEM

SMEM

SMEM

SMEM

Global Memory CPU Chipset

PCIe  

Page 34: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Blocks of threads run on an SM

Thread

Memory

Threadblock

Per-block Shared Memory

SMEM

Streaming Processor Streaming Multiprocessor

Registers

Memory

Page 35: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Whole grid runs on GPU

Many blocks of threads

. . .

SMEM

SMEM

SMEM

SMEM

Global Memory

Page 36: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Thread Hierarchy

!  Threads launched for a parallel section are partitioned into thread blocks !  Grid = all blocks for a given launch

!  Thread block is a group of threads that can: !  Synchronize their execution !  Communicate via shared memory

Page 37: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Memory Model

Kernel 0

. . . Per-device

Global Memory

. . .

Kernel 1

Sequential Kernels

Page 38: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Memory Model

Device 0 memory

Device 1 memory

Host memory cudaMemcpy()

Page 39: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example: Vector Addition Kernel

// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

int main() { // Run grid of N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }

Device Code

Page 40: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example: Vector Addition Kernel

// Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; }

int main() { // Run grid of N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); }

Host Code

Page 41: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example: Host code for vecAdd

// allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; *h_C = …(empty) // allocate device (GPU) memory float *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),

cudaMemcpyHostToDevice) ); cudaMemcpy( d_B, h_B, N * sizeof(float),

cudaMemcpyHostToDevice) );

// execute grid of N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);

Page 42: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example: Host code for vecAdd (2)

// execute grid of N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C); // copy result back to host memory cudaMemcpy( h_C, d_C, N * sizeof(float),

cudaMemcpyDeviceToHost) ); // do something with the result… // free device (GPU) memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

Page 43: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

IDs and Dimensions !  Threads: !  3D IDs, unique within a block

!  Blocks: !  2D IDs, unique within a grid

!  Dimensions set at launch !  Can be unique for each grid

!  Built-in variables: ! threadIdx, blockIdx ! blockDim, gridDim

Device Grid 1

Block (0, 0)

Block (1, 0)

Block (2, 0)

Block (0, 1)

Block (1, 1)

Block (2, 1)

Block (1, 1)

Thread (0, 1)

Thread (1, 1)

Thread (2, 1)

Thread (3, 1)

Thread (4, 1)

Thread (0, 2)

Thread (1, 2)

Thread (2, 2)

Thread (3, 2)

Thread (4, 2)

Thread (0, 0)

Thread (1, 0)

Thread (2, 0)

Thread (3, 0)

Thread (4, 0)

Page 44: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

__global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = 7; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = blockIdx.x; } __global__ void kernel( int *a ) { int idx = blockIdx.x*blockDim.x + threadIdx.x; a[idx] = threadIdx.x; }

Kernel Variations and Output

Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Page 45: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Code executed on GPU !  C/C++ with some restrictions: !  Can only access GPU memory !  No variable number of arguments !  No static variables !  No recursion !  No dynamic polymorphism

!  Must be declared with a qualifier: !  __global__ : launched by CPU, cannot be called from GPU must return void !  __device__ : called from other GPU functions, cannot be called by the CPU !  __host__ : can be called by CPU !  __host__ and __device__ qualifiers can be combined

!   Function is compiled for both host and device !   Sample use: overloading operators

Page 46: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Memory Spaces

!  CPU and GPU have separate memory spaces !  Data is moved across PCIe bus !  Use functions to allocate/set/copy memory on

GPU !  Very similar to corresponding C functions

!  Pointers are just addresses !  Can’t tell from the pointer value whether the

address is on CPU or GPU !  Must exercise care when dereferencing: !  Dereferencing CPU pointer on GPU will likely crash !  Same for vice versa

Page 47: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

CUDA Device Memory Model

!  Device code can: !  R/W per-thread registers !  R/W per-thread local memory !  R/W per-block shared memory !  R/W per-grid global memory !  Read only per-grid constant memory

!  Host code can: !  Transfer data to/from per-grid global and

constant memories

Page 48: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

GPU Memory Allocation / Release

!  Host (CPU) manages device (GPU) memory: ! cudaMalloc (void ** pointer, size_t nbytes) ! cudaMemset (void * pointer, int value, size_t

count) ! cudaFree (void* pointer)

int n = 1024; int nbytes = 1024*sizeof(int); int * d_a = 0; cudaMalloc( (void**)&d_a, nbytes ); cudaMemset( d_a, 0, nbytes); cudaFree(d_a);

Page 49: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Data Copies

! cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction); !   returns after the copy is complete !  blocks CPU thread until all bytes have been copied ! doesn’t start copying until previous CUDA calls complete

! enum cudaMemcpyKind ! cudaMemcpyHostToDevice ! cudaMemcpyDeviceToHost ! cudaMemcpyDeviceToDevice

!  Non-blocking copies are also available ! cudaMemcpyAsync

Page 50: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

Page 51: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; }

Page 52: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

Page 53: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Code Walkthrough 1 // walkthrough1.cu #include <stdio.h> int main() { int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0; }

Page 54: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example: Shuffling Data

// Reorder values based on keys // Each thread moves one element __global__ void shuffle(int* prev_array, int*

new_array, int* indices) { int i = threadIdx.x + blockDim.x * blockIdx.x; new_array[i] = prev_array[indices[i]]; }

int main() { // Run grid of N/256 blocks of 256 threads each shuffle<<< N/256, 256>>>(d_old, d_new, d_ind); }

Host Code

Page 55: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }

Kernel with 2D Indexing

Page 56: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<<grid, block>>>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudaFree( d_a ); return 0; }

__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1; }

Page 57: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Blocks must be independent

!  Any possible interleaving of blocks should be valid !  presumed to run to completion without pre-

emption !  can run in any order !  can run concurrently OR sequentially

!  A thread block is a batch of threads that can cooperate with each other by: !  synchronizing their execution: _syncthreads() !  sharing data with shared memory

!  Independence requirement gives scalability

Page 58: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

CUDA MEMORIES

Page 59: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Hardware Implementation of CUDA Memories

!   Each thread can: !   Read/write per-thread

registers !   Read/write per-thread

local memory !   Read/write per-block

shared memory !   Read/write per-grid

global memory !   Read/only per-grid

constant memory

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Page 60: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

CUDA Variable Type Qualifiers

!   “automatic” scalar variables without qualifier reside in a register !   compiler will spill to thread local memory

!   “automatic” array variables without qualifier reside in thread-local memory

Variable declaration Memory Scope Lifetime int var; register thread thread int array_var[10]; local thread thread __shared__ int shared_var; shared block block __device__ int global_var; global grid application __constant__ int constant_var; constant grid application

Page 61: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

CUDA Variable Type Performance

!   scalar variables reside in fast, on-chip registers !   shared variables reside in fast, on-chip memories !   thread-local arrays & global variables reside in

uncached off-chip memory !   constant variables reside in cached off-chip memory

Variable declaration Memory Penalty int var; register 1x int array_var[10]; local 100x __shared__ int shared_var; shared 1x __device__ int global_var; global 100x __constant__ int constant_var; constant 1x

Page 62: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Where to declare variables?

Can host access it?

Outside of any function

In the kernel

Yes No

__constant__ int constant_var;

__device__ int global_var;

int var;

int array_var[10]; __shared__ int shared_var;

Page 63: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example – thread-local variables // motivate per-thread variables with // Ten Nearest Neighbors application __global__ void ten_nn(float2 *result, float2 *ps, float2 *qs, size_t num_qs) { // p goes in a register float2 p = ps[threadIdx.x]; // per-thread heap goes in off-chip memory float2 heap[10]; // read through num_qs points, maintaining // the nearest 10 qs to p in the heap ... // write out the contents of heap to result ... }

Page 64: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // each thread loads two elements from global memory int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }

Page 65: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // what are the bandwidth requirements of this kernel? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }

Two loads

Page 66: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // How many times does this kernel load input[i]? int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }

// once by thread i // again by thread i+1

Page 67: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example – shared variables // motivate shared variables with // Adjacent Difference application: // compute result[i] = input[i] – input[i-1] __global__ void adj_diff_naive(int *result, int *input) { // compute this thread’s global index unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; if(i > 0) { // Idea: eliminate redundancy by sharing data int x_i = input[i]; int x_i_minus_one = input[i-1]; result[i] = x_i – x_i_minus_one; } }

Page 68: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example – shared variables // optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { // shorthand for threadIdx.x int tx = threadIdx.x; // allocate a __shared__ array, one element per thread __shared__ int s_data[BLOCK_SIZE]; // each thread reads one element to s_data unsigned int i = blockDim.x * blockIdx.x + tx; s_data[tx] = input[i]; // avoid race condition: ensure all loads // complete before continuing __syncthreads(); ... }

Page 69: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example – shared variables // optimized version of adjacent difference __global__ void adj_diff(int *result, int *input) { ... if(tx > 0) result[i] = s_data[tx] – s_data[tx–1]; else if(i > 0) { // handle thread block boundary result[i] = s_data[tx] – input[i-1]; } }

Page 70: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Example – shared variables // when the size of the array isn’t known at compile time... __global__ void adj_diff(int *result, int *input) { // use extern to indicate a __shared__ array will be // allocated dynamically at kernel launch time extern __shared__ int s_data[]; ... } // pass the size of the per-block array, in bytes, as the third // argument to the triple chevrons adj_diff<<<num_blocks, block_size, block_size * sizeof(int)>>>(r,i);

Page 71: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

A Common Programming Strategy

!   Partition data into subsets that fit into shared memory

Page 72: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

A Common Programming Strategy

!   Handle each data subset with one thread block

Page 73: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

A Common Programming Strategy

!   Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism

Page 74: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

A Common Programming Strategy

!   Perform the computation on the subset from shared memory

Page 75: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

A Common Programming Strategy

!   Copy the result from shared memory back to global memory

Page 76: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

A Common Programming Strategy

!   Carefully partition data according to access patterns !   Read-only è __constant__ memory (fast) !   R/W & shared within block è __shared__ memory

(fast) !   R/W within each thread è registers (fast) !   Indexed R/W within each thread è local memory

(slow) !   R/W inputs/results è cudaMalloc‘ed global

memory (slow)

Page 77: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Communication Through Memory

!   Question:

__global__ void race(void) { __shared__ int my_shared_variable; my_shared_variable = threadIdx.x; // what is the value of // my_shared_variable? }

Page 78: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Communication Through Memory

!   This is a race condition !   The result is undefined !   The order in which threads access the variable is

undefined without explicit coordination !   Use barriers (e.g., __syncthreads) or atomic

operations (e.g., atomicAdd) to enforce well-defined semantics

Page 79: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Communication Through Memory

!   Use __syncthreads to ensure data is ready for access

__global__ void share_data(int *input) { __shared__ int data[BLOCK_SIZE]; data[threadIdx.x] = input[threadIdx.x]; __syncthreads(); // the state of the entire data array // is now well-defined for all threads // in this block }

Page 80: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Communication Through Memory

!   Use atomic operations to ensure exclusive access to a variable

// assume *result is initialized to 0 __global__ void sum(int *input, int *result) { atomicAdd(result, input[threadIdx.x]); // after this kernel exits, the value of // *result will be the sum of the input }

Page 81: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Hierarchical Atomics

!   Divide & Conquer !   Per-thread atomicAdd to a __shared__ partial sum !   Per-block atomicAdd to the total sum

Σ

Σ0 Σ1 Σι

Page 82: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

SMX EXECUTION

Page 83: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

How an SM executes threads

!  Overview of how a Stream Multiprocessor works

!  SIMT Execution !  Divergence

Page 84: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Scheduling Blocks onto SMs

Thread Block 5

Thread Block 27

Thread Block 61

Streaming Multiprocessor

Thread Block 2001

!  HW Schedules thread blocks onto available SMs !  No guarantee of ordering among thread blocks !  HW will schedule thread blocks as soon as a previous

thread block finishes

Page 85: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Warps

!   Each thread block is executed as 32-thread warps !  An implementation decision, not part of the

CUDA programming model !  Warps are scheduling units in SMs

!  If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM?

–  Each Block is divided into 256/32 = 8 Warps –  There are 8 x 3 = 24 Warps

Page 86: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Mapping of Thread Blocks

!  Each thread block is mapped to one or more warps

!  The hardware schedules each warp independently

Thread Block N (128 threads)

TB N W1

TB N W2

TB N W3

TB N W4

Page 87: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

24

Thread Scheduling Example

!  SM implements zero-overhead warp scheduling !  At any time, only one of the warps is executed by

SM !  Warps whose next instruction has its inputs

ready for consumption are eligible for execution !  Eligible Warps are selected for execution on a

prioritized scheduling policy !  All threads in a warp execute the same

instruction when selected

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Page 88: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Control Flow Divergence

!  What happens if you have the following code? if(foo(threadIdx.x)) { do_A(); } else { do_B(); }

Page 89: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Control Flow Divergence

Branch

Path A

Path B

Branch

Path A

Path B

Page 90: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Control Flow Divergence

!  Nested branches are handled as well if(foo(threadIdx.x)) { if(bar(threadIdx.x)) do_A(); else do_B(); } else do_C();

Page 91: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Control Flow Divergence

Branch Branch

Path A

Path C

Branch

Path B

Page 92: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Control Flow Divergence

!  You don’t have to worry about divergence for correctness

!  You might have to think about it for performance !  Depends on your branch conditions

Page 93: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Control Flow Divergence

!  Performance drops off with the degree of divergence

switch(threadIdx.x % N) { case 0: ... case 1: ... }

Page 94: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Divergence

0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16 18

Perf

orm

ance

Divergence

Page 95: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

CUDA

.c / .cpp Host code

.gpu Device code

Object File Fat bin

C/C++ Libraries

CUDA Libraries

Executable (with embedded GPU code)

Compiling a CUDA program

Page 96: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

CUDA Makefile CC=nvcc CUDA_DIR= /opt/cuda/4.1 SDK_DIR= ${CUDA_DIR}/sdk CFLAGS= -I. -I${CUDA_DIR}/include -I${SDK_DIR}/C/common/inc LDFLAGS= -L${CUDA_DIR}/lib64 -L${SDK_DIR}/lib -L${SDK_DIR}/C/common/lib LIB= -lm -lrt SOURCES= vector_add.cu EXECNAME= vector_add all: $(CC) --ptxas-options=-v -g -G -keep -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) clean: $(CC) --ptxas-options=-v -g -G -keep -clean -v -o $(EXECNAME) $(SOURCES) $(LIB) $(LDFLAGS) $(CFLAGS) rm -f *.o core

Page 97: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

OPENACC API

Page 98: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

3 Ways to Accelerate Applications

Applications

Libraries

“Drop-in” Acceleration

Programming Languages

OpenACC Directives

Maximum Flexibility

Easily Accelerate Applications

Page 99: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

OpenACC Directives

Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience

CPU GPU

Your original Fortran or C

code

Simple Compiler hints

Compiler Parallelizes code

Works on many-core GPUs & multicore CPUs

OpenACC Compiler

Hint

Page 100: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

OpenACC Open Programming Standard for Parallel Computing

Page 101: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Easy: Directives are the easy path to accelerate compute intensive applications

Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors

Powerful: GPU Directives allow complete access to the massive parallel power of a GPU

OpenACC The Standard for GPU Directives

Page 102: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

High-level, with low-level access Compiler directives to specify parallel regions in C, C++, Fortran

OpenACC compilers offload parallel regions from host to accelerator Portable across OSes, host CPUs, accelerators, and compilers

Create high-level heterogeneous programs Without explicit accelerator initialization, Without explicit data or program transfers between host and accelerator

Programming model allows programmers to start simple

Enhance with additional guidance for compiler on loop mappings, data location, and other performance details

Compatible with other GPU languages and libraries Interoperate between CUDA C/Fortran and GPU libraries e.g. CUFFT, CUBLAS, CUSPARSE, etc.

Page 103: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

OpenACC Specification and Website

Full OpenACC 1.0 Specification available online

http://www.openacc-standard.org

Quick reference card also available Available compilers:

Caps HMPP OpenACC (soon at Romeo) PGI OpenACC (soon at Romeo)

Page 104: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo $!acc end kernels end subroutine saxpy ... $ Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d) ...

void saxpy(int n,

float a,

float *x,

float *restrict y)

{

#pragma acc kernels

for (int i = 0; i < n; ++i)

y[i] = a*x[i] + y[i];

}

...

// Perform SAXPY on 1M elements

saxpy(1<<20, 2.0, x, y);

...

A Very Simple Exercise: SAXPY SAXPY in C SAXPY in Fortran

Page 105: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Directive Syntax

Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive

C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block

Page 106: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

CUDA = much more

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009

ECE 498AL, University of Illinois, Urbana-Champaign

20

• Atomic, shuffle operations • Streams • GPU Direct • CUBLAS, NPP, CUFFT, … • OpenCL

Page 107: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

References • www.nvidia.com/cuda (CUDA zone, developer zone) • Programming Massively Parallel Processors: A Hands-on Approach by David Kirk, Wen-mei Hwu • CUDA by Example: An Introduction to General-Purpose GPU Programming by Jason Sanders and Edward Kandrot • CUDA Application Design and Development by Rob Farber

Page 108: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Full Version Questions ?

Page 109: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

!   Generalize adjacent_difference example

!   AB = A * B !   Each element ABij !   = dot(row(A,i),col(B,j))

!   Parallelization strategy !   Thread à ABij

!   2D kernel

Matrix Multiplication Example

Page 110: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

First Implementation __global__ void mat_mul(float *a, float *b, float *ab, int width) { // calculate the row & col index of the element int row = blockIdx.y*blockDim.y + threadIdx.y; int col = blockIdx.x*blockDim.x + threadIdx.x; float result = 0; // do dot product between row of a and col of b for(int k = 0; k < width; ++k) result += a[row*width+k] * b[k*width+col]; ab[row*width+col] = result; }

Page 111: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Idea: Use __shared__ memory to reuse global data

!   Each input element is read by width threads

!   Load each element into __shared__ memory and have several threads use the local version to reduce the memory bandwidth

width

Page 112: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Tiled Multiply

!   Partition kernel loop into phases

!   Load a tile of both matrices into __shared__ each phase

!   Each phase, each thread computes a partial result

TILE_WIDTH

Page 113: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Better Implementation __global__ void mat_mul(float *a, float *b, float *ab, int width) { // shorthand int tx = threadIdx.x, ty = threadIdx.y; int bx = blockIdx.x, by = blockIdx.y; // allocate tiles in __shared__ memory __shared__ float s_a[TILE_WIDTH][TILE_WIDTH]; __shared__ float s_b[TILE_WIDTH][TILE_WIDTH]; // calculate the row & col index int row = by*blockDim.y + ty; int col = bx*blockDim.x + tx; float result = 0;

Page 114: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Better Implementation // loop over the tiles of the input in phases for(int p = 0; p < width/TILE_WIDTH; ++p) {

// collaboratively load tiles into __shared__ s_a[ty][tx] = a[row*width + (p*TILE_WIDTH + tx)]; s_b[ty][tx] = b[(m*TILE_WIDTH + ty)*width + col]; __syncthreads(); // dot product between row of s_a and col of s_b for(int k = 0; k < TILE_WIDTH; ++k) result += s_a[ty][k] * s_b[k][tx]; __syncthreads(); } ab[row*width+col] = result; }

Page 115: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Use of Barriers in mat_mul

!   Two barriers per phase: !   __syncthreads after all data is loaded into __shared__

memory !   __syncthreads after all data is read from __shared__

memory !   Note that second __syncthreads in phase p guards the

load in phase p+1

!   Use barriers to guard data !   Guard against using uninitialized data !   Guard against bashing live data

Page 116: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

First Order Size Considerations

!   Each thread block should have many threads !   TILE_WIDTH = 16 à 16*16 = 256 threads

!   There should be many thread blocks !   1024*1024 matrices à 64*64 = 4096 thread blocks !   TILE_WIDTH = 16 à gives each SM 4 blocks, 1024 threads !   Full occupancy

!   Each thread block performs 2 * 256 = 512 x 4B loads for 256 * (2 * 16) = 8,192 fp ops (0.25 B/op) !   Compare to 4B/op

Page 117: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

TILE_SIZE Effects

Page 118: GPU Computing with CUDA - univ-reims.frcosy.univ-reims.fr/~cjaillet/www/pub/fichiers/enseignement/Info...GPU Computing with CUDA ... • ATI Stream by AMD • CUDA by NVIDIA • OpenCL

Memory Resources as Limit to Parallelism

!   Effective use of different memory resources reduces the number of accesses to global memory

!   These resources are finite! !   The more memory locations each thread requires à

the fewer threads an SM can accommodate

Resource Per GTX480 SM Full Occupancy on GTX480

Registers 32768 <= 32768 / 1024 threads = 32 per thread

__shared__ Memory 48KB <= 48KB / 8 blocks = 6KB per block