Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens...

43
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Lecture 4 Tuesday, 3 November, 2009 Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller

Transcript of Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens...

Page 1: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Lecture 4Tuesday, 3 November, 2009

Programming with CUDA and Parallel Algorithms

Waqar SaleemJens Müller

Page 2: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Recap• Grid and block dimensions

• CUDA extensions to C

• built-in variables

• vector types, variable and function qualifiers

• synchronization and timing

• atomic functions

• memory fence functions, volatile variables, math, warp vote, texture memory functions

Page 3: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Recap

• Grid and block dimensions

• CUDA extensions to C

• Memory management (runtime API)

• cudaMalloc, cudaFree, cudaMemcpy

Page 4: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

• GPGPU before CUDA? OpenGL, DirectX

• CUDA thread creation and scheduling takes only a few cycles

• CPU threads can take up to thousands

• Device initialization at first device function call

• Caution: causes delay

Loose ends

Page 5: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Handling device memoryvoid main() { // allocate h_A, h_B,h_C, size N // assign values to host vectors // initialize device // allocate d_A,d_B,d_C, size N // copy h_A,h_B to d_A,d_B vAdd<<<1,N>>>(d_A,d_B,d_C); // copy d_C to h_C // output h_C // free host variables // free device variables}

void main() { int N; // assign N size_t size = N * sizeof( int ); int *h_A = malloc( size ); int *h_B = malloc( size ); int *h_C = malloc( size ); // assign values to vectors int *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, size ); cudaMalloc( (void**) &d_B, size ); cudaMalloc( (void**) &d_C, size ); cudaMemcpy( d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy( d_B, h_B, size, cudaMemcpyHostToDevice); vAdd<<<1,N>>>( d_A, d_B, d_C ); // ...}

Page 6: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Today

• Thread divergence

• Compiling CUDA programs (intro)

• Thread/block allocation in a MP

• Optimizing memory access

Page 7: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Thread divergence

• Threads in a block diverge when they follow different execution paths

• Divergent threads are serialized

• Slows device performance

Page 8: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

B?

A

DC

E

noyes

A

B?

C

D

E

wait

wait

Thread divergence

• Threads in a block diverge when they follow different execution paths

• Divergent threads are serialized (slow)

Page 9: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

NVIDIA C Compiler (nvcc)• Compile CUDA programs with nvcc

• Separates host and device code

• host code compiled by host compiler

• device code compiled further by nvcc

• many options: emulation mode, fast math, optimization level, ...

Page 10: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Transparent scalability• Code written once can run on any kind of

device

• Scaling (scheduling) is transparent to the user

• Imposes lack of inter-block communication

Page 11: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

SPMD / SIMT

• SPMD: Single Program Multiple Data

• SIMT: Single Instruction Multiple Thread

• SIMD exposes size of device vector

• CUDA kernels can be programmed for a device of any specification

Page 12: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

• GeForce 8800GTX

• up to 8 active blocks per MP

• up to 768 active threads

• 86.4 GB/s access to global memory

• peak performance of 367 GFLOPS

• warp size of 32

Page 13: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

• Specific to hardware, not a CUDA concept

• Scheduling unit in a MP

• Max of 768 active threads = 24 active warps per MP

• Dedicated hardware to track IDs and execution status of threads in active warps

• Limits maximum number of active warps

Warps

Page 14: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Priority Queue of warps

• Active warps are queued and prioritized

• While a warp waits for the result of some high latency operation, another can start executing

Page 15: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Variables Memory Scope Lifetime

Automatic arrays

Automatic scalars

__shared__

__device__

__constant__

global (local) thread thread

registers thread thread

shared thread block

global grid(s) application

constant grid(s) application

Variables and memory• Shared is static

• Caution: Each thread has private version

• Constant memory resides in global memory

• cached, 65,536 bytes

• fast access depending on access pattern

Page 16: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Variables Memory Scope Lifetime

Automatic arrays

Automatic scalars

__shared__

__device__

__constant__

global (local) thread thread

registers thread thread

shared thread block

global grid(s) application

constant grid(s) application

Variables and memory

• Global memory has no sync, bad for inter-block communication

• used to pass info b/w kernels

Page 17: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

• Solution: partition data into small tiles that can fit in shared memory

• Kernel computation on tiles must be independent of each other

• Might require modification of the algorithm

Memory trade-off

Speed Size

Global

Shared

Slow Large

Fast Small

Page 18: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Example: Matrix multiplication

Page 19: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Simple matrix multiplication kernel• // Pd, Md, Nd: global memory, width: shared memory

__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // row and column indices int bx = blockIdx.x, by = blockIdx.y; int tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; // compute Pd entry Pvalue = 0; for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // store computed entry Pd[Row][Col] = Pvalue;}

Page 20: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Kernel performance• // Pd, Md, Nd: global memory, width: shared memory

__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // ... for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // ...}

• 2 global memory accesses and 2 computations

• Compute operations to Global Memory Access (CGMA) ratio = 1.0

Page 21: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Kernel performance• // Pd, Md, Nd: global memory, width: shared memory

__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // ... for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // ...}

• Global memory bandwidth of 86.4GB/s

• Each iteration computes at no more than 86.4/4 = 21.6 GFLOPS

• The card has peak performance of 367 GFLOPS!

Page 22: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Spotting data parallelism

• Need to re-use dataexample: 2x2 blocks for 4x4 matrices

0,0 1,0 0,1 1,1

Md0,0 * Nd0,0 Md0,0 * Nd1,0 Md0,1 * Nd0,0 Md0,1 * Nd1,0

Md1,0 * Nd0,1 Md1,0 * Nd1,1 Md1,1 * Nd0,1 Md1,1 * Nd1,1

Md2,0 * Nd0,2 Md2,0 * Nd1,2 Md2,1 * Nd0,2 Md2,1 * Nd1,2

Md3,0 * Nd0,3 Md3,0 * Nd1,3 Md3,1 * Nd0,3 Md3,1 * Nd1,3

Page 23: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Spotting data parallelism• Need to re-use data

example: 2x2 blocks for 4x4 matrices

0,0 1,0 0,1 1,1

Md0,0 * Nd0,0 Md0,0 * Nd1,0 Md0,1 * Nd0,0 Md0,1 * Nd1,0

Md1,0 * Nd0,1 Md1,0 * Nd1,1 Md1,1 * Nd0,1 Md1,1 * Nd1,1

Md2,0 * Nd0,2 Md2,0 * Nd1,2 Md2,1 * Nd0,2 Md2,1 * Nd1,2

Md3,0 * Nd0,3 Md3,0 * Nd1,3 Md3,1 * Nd0,3 Md3,1 * Nd1,3

Page 24: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Spotting data parallelism (cont’d)• 4 rows/columns, each is fetched twice

• For NxN block, each is fetched N times

0,0 1,0 0,1 1,1

Md0,0 * Nd0,0 Md0,0 * Nd1,0 Md0,1 * Nd0,0 Md0,1 * Nd1,0

Md1,0 * Nd0,1 Md1,0 * Nd1,1 Md1,1 * Nd0,1 Md1,1 * Nd1,1

Md2,0 * Nd0,2 Md2,0 * Nd1,2 Md2,1 * Nd0,2 Md2,1 * Nd1,2

Md3,0 * Nd0,3 Md3,0 * Nd1,3 Md3,1 * Nd0,3 Md3,1 * Nd1,3

Page 25: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Re-organizing memory access• Load each row/column once into shared

memory and re-use within block

• Reduce global memory traffic by N

• Global memory access /= N, CGMA *= N

• Loaded rows, columns form a tile

• Tile size dictated by size of shared memory

• Simplest case, block size = tile size

Page 26: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Tiled kernel using shared memory

• __global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // allocate tile in shared memory __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; // row and column indices int bx = blockIdx.x, by = blockIdx.y, tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty, Col = bx * TILE_WIDTH + tx; // compute Pd entry tile-by-tile Pvalue = 0; for ( int tileNum = 0; tileNum < width/TILE_WIDTH; ++tileNum ) { // Collaborative loading into shared memory Mds[tx][ty] = Md[tileNum * TILE_WIDTH + tx][Row]; Nds[tx][ty] = Nd[Col][tileNum * TILE_WIDTH + ty]; Pvalue += Md[Row][k] * Nd[k][Col]; for ( int k = 0; k < TILE_WIDTH; ++k ) Pvalue += Mds[tx][k] + Nds[k][ty] } Pd[Row][Col] = Pvalue; // store computed entry}

Page 27: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

T0,0

B0,0

Page 28: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

T1,0

B0,0

Page 29: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

T0,1

B0,0

Page 30: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

T1,1

B0,0

Page 31: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

B0,0

Page 32: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

tileNum = 0

B0,0

Page 33: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

tileNum = 0

B0,0

Page 34: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

tileNum = 0

B0,0

Page 35: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

tileNum = 1

B0,0

Page 36: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

B0,0 tile 0

B0,0 tile 0

B0,0 tile 1

B0,0 tile 1

B0,0

Page 37: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

B0,0 tile 0B1,0 tile 0

B0,0 tile 0B0,1 tile 0

B0,0 tile 1B1,0 tile 1

B0,0 tile 1B0,1 tile 1

B0,0 B1,0

B0,1 B1,1B0,1 tile 0B1,1 tile 0

B0,1 tile 1B1,1 tile 1

B1,0 tile 0B1,1 tile 0

B1,0 tile 1B1,1 tile 1

Page 38: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Tiled kernel using shared memory

• __global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // allocate tile in shared memory __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; // row and column indices int bx = blockIdx.x, by = blockIdx.y, tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty, Col = bx * TILE_WIDTH + tx; // compute Pd entry tile-by-tile Pvalue = 0; for ( int tileNum = 0; tileNum < width/TILE_WIDTH; ++tileNum ) { // Collaborative loading into shared memory Mds[tx][ty] = Md[tileNum * TILE_WIDTH + tx][Row]; Nds[tx][ty] = Nd[Col][tileNum * TILE_WIDTH + ty]; Pvalue += Md[Row][k] * Nd[k][Col]; for ( int k = 0; k < TILE_WIDTH; ++k ) Pvalue += Mds[tx][k] + Nds[k][ty] } Pd[Row][Col] = Pvalue; // store computed entry}

Page 39: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

m = tileNum

Page 40: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Performance gain• Theoretical gain for 16x16 tiles = 16

• (86.4/4) * 16 = 345.6 GFLOPS

Page 41: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Memory limitations on Parallelism

• MP resources are split between active blocks

• Imposes limits on number of active blocks

• 8K registers for a max of 768 threads

• 8K/768 = 10 registers per thread

• If a kernel uses more than 10 registers, the number of blocks to be processed by the MP is reduced to fit the registers

• GeForce 8800GTX has 16K shared memory for a max of 8 blocks

• ~2K shared memory per block

• ~16x16 tiles in matrix multiplication are optimal

• If a block uses more than ~2K shared memory, the number of blocks to be processed by the MP is reduced to fit shared memory

Page 42: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

• CUDA texture memory

• CUDA runtime and driver APIs

• Streams

Next time

Page 43: Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens Müller Recap • Grid and block dimensions • CUDA extensions to C • built-in variables

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

See you next time!