Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens...

Programming with CUDA, WS09 Waqar Saleem, Jens Müller

Lecture 4Tuesday, 3 November, 2009

Programming with CUDA and Parallel Algorithms

Waqar SaleemJens Müller


Recap• Grid and block dimensions

• CUDA extensions to C

• built-in variables

• vector types, variable and function qualifiers

• synchronization and timing

• atomic functions

• memory fence functions, volatile variables, math, warp vote, texture memory functions


Recap

• Grid and block dimensions

• CUDA extensions to C

• Memory management (runtime API)

• cudaMalloc, cudaFree, cudaMemcpy


• GPGPU before CUDA? OpenGL, DirectX

• CUDA thread creation and scheduling takes only a few cycles

• CPU threads can take up to thousands

• Device initialization at first device function call

• Caution: causes delay

Loose ends


Handling device memoryvoid main() { // allocate h_A, h_B,h_C, size N // assign values to host vectors // initialize device // allocate d_A,d_B,d_C, size N // copy h_A,h_B to d_A,d_B vAdd<<<1,N>>>(d_A,d_B,d_C); // copy d_C to h_C // output h_C // free host variables // free device variables}

void main() { int N; // assign N size_t size = N * sizeof( int ); int *h_A = malloc( size ); int *h_B = malloc( size ); int *h_C = malloc( size ); // assign values to vectors int *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, size ); cudaMalloc( (void**) &d_B, size ); cudaMalloc( (void**) &d_C, size ); cudaMemcpy( d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy( d_B, h_B, size, cudaMemcpyHostToDevice); vAdd<<<1,N>>>( d_A, d_B, d_C ); // ...}


Today

• Thread divergence

• Compiling CUDA programs (intro)

• Thread/block allocation in a MP

• Optimizing memory access


Thread divergence

• Threads in a block diverge when they follow different execution paths

• Divergent threads are serialized

• Slows device performance


B?

A

DC

E

noyes

A

B?

C

D

E

wait

wait

Thread divergence

• Threads in a block diverge when they follow different execution paths

• Divergent threads are serialized (slow)


NVIDIA C Compiler (nvcc)• Compile CUDA programs with nvcc

• Separates host and device code

• host code compiled by host compiler

• device code compiled further by nvcc

• many options: emulation mode, fast math, optimization level, ...


Transparent scalability• Code written once can run on any kind of

device

• Scaling (scheduling) is transparent to the user

• Imposes lack of inter-block communication


SPMD / SIMT

• SPMD: Single Program Multiple Data

• SIMT: Single Instruction Multiple Thread

• SIMD exposes size of device vector

• CUDA kernels can be programmed for a device of any specification


• GeForce 8800GTX

• up to 8 active blocks per MP

• up to 768 active threads

• 86.4 GB/s access to global memory

• peak performance of 367 GFLOPS

• warp size of 32


• Specific to hardware, not a CUDA concept

• Scheduling unit in a MP

• Max of 768 active threads = 24 active warps per MP

• Dedicated hardware to track IDs and execution status of threads in active warps

• Limits maximum number of active warps

Warps


Priority Queue of warps

• Active warps are queued and prioritized

• While a warp waits for the result of some high latency operation, another can start executing


Variables Memory Scope Lifetime

Automatic arrays

Automatic scalars

__shared__

__device__

__constant__

global (local) thread thread

registers thread thread

shared thread block

global grid(s) application

constant grid(s) application

Variables and memory• Shared is static

• Caution: Each thread has private version

• Constant memory resides in global memory

• cached, 65,536 bytes

• fast access depending on access pattern


Variables Memory Scope Lifetime

Automatic arrays

Automatic scalars

__shared__

__device__

__constant__

global (local) thread thread

registers thread thread

shared thread block

global grid(s) application

constant grid(s) application

Variables and memory

• Global memory has no sync, bad for inter-block communication

• used to pass info b/w kernels


• Solution: partition data into small tiles that can fit in shared memory

• Kernel computation on tiles must be independent of each other

• Might require modification of the algorithm

Memory trade-off

Speed Size

Global

Shared

Slow Large

Fast Small


Example: Matrix multiplication


Simple matrix multiplication kernel• // Pd, Md, Nd: global memory, width: shared memory

__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // row and column indices int bx = blockIdx.x, by = blockIdx.y; int tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; // compute Pd entry Pvalue = 0; for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // store computed entry Pd[Row][Col] = Pvalue;}


Kernel performance• // Pd, Md, Nd: global memory, width: shared memory

__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // ... for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // ...}

• 2 global memory accesses and 2 computations

• Compute operations to Global Memory Access (CGMA) ratio = 1.0


Kernel performance• // Pd, Md, Nd: global memory, width: shared memory

__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // ... for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // ...}

• Global memory bandwidth of 86.4GB/s

• Each iteration computes at no more than 86.4/4 = 21.6 GFLOPS

• The card has peak performance of 367 GFLOPS!


Spotting data parallelism

• Need to re-use dataexample: 2x2 blocks for 4x4 matrices

0,0 1,0 0,1 1,1

Md0,0 * Nd0,0 Md0,0 * Nd1,0 Md0,1 * Nd0,0 Md0,1 * Nd1,0





Spotting data parallelism• Need to re-use data

example: 2x2 blocks for 4x4 matrices

0,0 1,0 0,1 1,1






Spotting data parallelism (cont’d)• 4 rows/columns, each is fetched twice

• For NxN block, each is fetched N times

0,0 1,0 0,1 1,1






Re-organizing memory access• Load each row/column once into shared

memory and re-use within block

• Reduce global memory traffic by N

• Global memory access /= N, CGMA *= N

• Loaded rows, columns form a tile

• Tile size dictated by size of shared memory

• Simplest case, block size = tile size


Tiled kernel using shared memory

• __global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // allocate tile in shared memory __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; // row and column indices int bx = blockIdx.x, by = blockIdx.y, tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty, Col = bx * TILE_WIDTH + tx; // compute Pd entry tile-by-tile Pvalue = 0; for ( int tileNum = 0; tileNum < width/TILE_WIDTH; ++tileNum ) { // Collaborative loading into shared memory Mds[tx][ty] = Md[tileNum * TILE_WIDTH + tx][Row]; Nds[tx][ty] = Nd[Col][tileNum * TILE_WIDTH + ty]; Pvalue += Md[Row][k] * Nd[k][Col]; for ( int k = 0; k < TILE_WIDTH; ++k ) Pvalue += Mds[tx][k] + Nds[k][ty] } Pd[Row][Col] = Pvalue; // store computed entry}


T0,0

B0,0


T1,0

B0,0


T0,1

B0,0


T1,1

B0,0


0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

B0,0


0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

tileNum = 0

B0,0


0,0 0,1 1,11,0

0,0

1,0

0,1

1,1

tileNum = 1

B0,0


B0,0 tile 0

B0,0 tile 0

B0,0 tile 1

B0,0 tile 1

B0,0


B0,0 tile 0B1,0 tile 0




B0,0 B1,0

B0,1 B1,1B0,1 tile 0B1,1 tile 0





Tiled kernel using shared memory

• __global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // allocate tile in shared memory __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; // row and column indices int bx = blockIdx.x, by = blockIdx.y, tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty, Col = bx * TILE_WIDTH + tx; // compute Pd entry tile-by-tile Pvalue = 0; for ( int tileNum = 0; tileNum < width/TILE_WIDTH; ++tileNum ) { // Collaborative loading into shared memory Mds[tx][ty] = Md[tileNum * TILE_WIDTH + tx][Row]; Nds[tx][ty] = Nd[Col][tileNum * TILE_WIDTH + ty]; Pvalue += Md[Row][k] * Nd[k][Col]; for ( int k = 0; k < TILE_WIDTH; ++k ) Pvalue += Mds[tx][k] + Nds[k][ty] } Pd[Row][Col] = Pvalue; // store computed entry}


m = tileNum


Performance gain• Theoretical gain for 16x16 tiles = 16

• (86.4/4) * 16 = 345.6 GFLOPS


Memory limitations on Parallelism

• MP resources are split between active blocks

• Imposes limits on number of active blocks

• 8K registers for a max of 768 threads

• 8K/768 = 10 registers per thread

• If a kernel uses more than 10 registers, the number of blocks to be processed by the MP is reduced to fit the registers

• GeForce 8800GTX has 16K shared memory for a max of 8 blocks

• ~2K shared memory per block

• ~16x16 tiles in matrix multiplication are optimal

• If a block uses more than ~2K shared memory, the number of blocks to be processed by the MP is reduced to fit shared memory


• CUDA texture memory

• CUDA runtime and driver APIs

• Streams

Next time


See you next time!

Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens...

Documents

Transcript of Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens...