Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens...
Transcript of Programming with CUDA and Parallel Algorithms · Programming with CUDA, WS09 Waqar Saleem, Jens...
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Lecture 4Tuesday, 3 November, 2009
Programming with CUDA and Parallel Algorithms
Waqar SaleemJens Müller
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Recap• Grid and block dimensions
• CUDA extensions to C
• built-in variables
• vector types, variable and function qualifiers
• synchronization and timing
• atomic functions
• memory fence functions, volatile variables, math, warp vote, texture memory functions
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Recap
• Grid and block dimensions
• CUDA extensions to C
• Memory management (runtime API)
• cudaMalloc, cudaFree, cudaMemcpy
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
• GPGPU before CUDA? OpenGL, DirectX
• CUDA thread creation and scheduling takes only a few cycles
• CPU threads can take up to thousands
• Device initialization at first device function call
• Caution: causes delay
Loose ends
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Handling device memoryvoid main() { // allocate h_A, h_B,h_C, size N // assign values to host vectors // initialize device // allocate d_A,d_B,d_C, size N // copy h_A,h_B to d_A,d_B vAdd<<<1,N>>>(d_A,d_B,d_C); // copy d_C to h_C // output h_C // free host variables // free device variables}
void main() { int N; // assign N size_t size = N * sizeof( int ); int *h_A = malloc( size ); int *h_B = malloc( size ); int *h_C = malloc( size ); // assign values to vectors int *d_A, *d_B, *d_C; cudaMalloc( (void**) &d_A, size ); cudaMalloc( (void**) &d_B, size ); cudaMalloc( (void**) &d_C, size ); cudaMemcpy( d_A, h_A, size, cudaMemcpyHostToDevice); cudaMemcpy( d_B, h_B, size, cudaMemcpyHostToDevice); vAdd<<<1,N>>>( d_A, d_B, d_C ); // ...}
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Today
• Thread divergence
• Compiling CUDA programs (intro)
• Thread/block allocation in a MP
• Optimizing memory access
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Thread divergence
• Threads in a block diverge when they follow different execution paths
• Divergent threads are serialized
• Slows device performance
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
B?
A
DC
E
noyes
A
B?
C
D
E
wait
wait
Thread divergence
• Threads in a block diverge when they follow different execution paths
• Divergent threads are serialized (slow)
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
NVIDIA C Compiler (nvcc)• Compile CUDA programs with nvcc
• Separates host and device code
• host code compiled by host compiler
• device code compiled further by nvcc
• many options: emulation mode, fast math, optimization level, ...
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Transparent scalability• Code written once can run on any kind of
device
• Scaling (scheduling) is transparent to the user
• Imposes lack of inter-block communication
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
SPMD / SIMT
• SPMD: Single Program Multiple Data
• SIMT: Single Instruction Multiple Thread
• SIMD exposes size of device vector
• CUDA kernels can be programmed for a device of any specification
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
• GeForce 8800GTX
• up to 8 active blocks per MP
• up to 768 active threads
• 86.4 GB/s access to global memory
• peak performance of 367 GFLOPS
• warp size of 32
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
• Specific to hardware, not a CUDA concept
• Scheduling unit in a MP
• Max of 768 active threads = 24 active warps per MP
• Dedicated hardware to track IDs and execution status of threads in active warps
• Limits maximum number of active warps
Warps
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Priority Queue of warps
• Active warps are queued and prioritized
• While a warp waits for the result of some high latency operation, another can start executing
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Variables Memory Scope Lifetime
Automatic arrays
Automatic scalars
__shared__
__device__
__constant__
global (local) thread thread
registers thread thread
shared thread block
global grid(s) application
constant grid(s) application
Variables and memory• Shared is static
• Caution: Each thread has private version
• Constant memory resides in global memory
• cached, 65,536 bytes
• fast access depending on access pattern
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Variables Memory Scope Lifetime
Automatic arrays
Automatic scalars
__shared__
__device__
__constant__
global (local) thread thread
registers thread thread
shared thread block
global grid(s) application
constant grid(s) application
Variables and memory
• Global memory has no sync, bad for inter-block communication
• used to pass info b/w kernels
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
• Solution: partition data into small tiles that can fit in shared memory
• Kernel computation on tiles must be independent of each other
• Might require modification of the algorithm
Memory trade-off
Speed Size
Global
Shared
Slow Large
Fast Small
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Example: Matrix multiplication
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Simple matrix multiplication kernel• // Pd, Md, Nd: global memory, width: shared memory
__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // row and column indices int bx = blockIdx.x, by = blockIdx.y; int tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; // compute Pd entry Pvalue = 0; for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // store computed entry Pd[Row][Col] = Pvalue;}
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Kernel performance• // Pd, Md, Nd: global memory, width: shared memory
__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // ... for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // ...}
• 2 global memory accesses and 2 computations
• Compute operations to Global Memory Access (CGMA) ratio = 1.0
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Kernel performance• // Pd, Md, Nd: global memory, width: shared memory
__global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // ... for ( int k = 0; k < width; ++k ) Pvalue += Md[Row][k] * Nd[k][Col]; // ...}
• Global memory bandwidth of 86.4GB/s
• Each iteration computes at no more than 86.4/4 = 21.6 GFLOPS
• The card has peak performance of 367 GFLOPS!
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Spotting data parallelism
• Need to re-use dataexample: 2x2 blocks for 4x4 matrices
0,0 1,0 0,1 1,1
Md0,0 * Nd0,0 Md0,0 * Nd1,0 Md0,1 * Nd0,0 Md0,1 * Nd1,0
Md1,0 * Nd0,1 Md1,0 * Nd1,1 Md1,1 * Nd0,1 Md1,1 * Nd1,1
Md2,0 * Nd0,2 Md2,0 * Nd1,2 Md2,1 * Nd0,2 Md2,1 * Nd1,2
Md3,0 * Nd0,3 Md3,0 * Nd1,3 Md3,1 * Nd0,3 Md3,1 * Nd1,3
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Spotting data parallelism• Need to re-use data
example: 2x2 blocks for 4x4 matrices
0,0 1,0 0,1 1,1
Md0,0 * Nd0,0 Md0,0 * Nd1,0 Md0,1 * Nd0,0 Md0,1 * Nd1,0
Md1,0 * Nd0,1 Md1,0 * Nd1,1 Md1,1 * Nd0,1 Md1,1 * Nd1,1
Md2,0 * Nd0,2 Md2,0 * Nd1,2 Md2,1 * Nd0,2 Md2,1 * Nd1,2
Md3,0 * Nd0,3 Md3,0 * Nd1,3 Md3,1 * Nd0,3 Md3,1 * Nd1,3
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Spotting data parallelism (cont’d)• 4 rows/columns, each is fetched twice
• For NxN block, each is fetched N times
0,0 1,0 0,1 1,1
Md0,0 * Nd0,0 Md0,0 * Nd1,0 Md0,1 * Nd0,0 Md0,1 * Nd1,0
Md1,0 * Nd0,1 Md1,0 * Nd1,1 Md1,1 * Nd0,1 Md1,1 * Nd1,1
Md2,0 * Nd0,2 Md2,0 * Nd1,2 Md2,1 * Nd0,2 Md2,1 * Nd1,2
Md3,0 * Nd0,3 Md3,0 * Nd1,3 Md3,1 * Nd0,3 Md3,1 * Nd1,3
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Re-organizing memory access• Load each row/column once into shared
memory and re-use within block
• Reduce global memory traffic by N
• Global memory access /= N, CGMA *= N
• Loaded rows, columns form a tile
• Tile size dictated by size of shared memory
• Simplest case, block size = tile size
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Tiled kernel using shared memory
• __global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // allocate tile in shared memory __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; // row and column indices int bx = blockIdx.x, by = blockIdx.y, tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty, Col = bx * TILE_WIDTH + tx; // compute Pd entry tile-by-tile Pvalue = 0; for ( int tileNum = 0; tileNum < width/TILE_WIDTH; ++tileNum ) { // Collaborative loading into shared memory Mds[tx][ty] = Md[tileNum * TILE_WIDTH + tx][Row]; Nds[tx][ty] = Nd[Col][tileNum * TILE_WIDTH + ty]; Pvalue += Md[Row][k] * Nd[k][Col]; for ( int k = 0; k < TILE_WIDTH; ++k ) Pvalue += Mds[tx][k] + Nds[k][ty] } Pd[Row][Col] = Pvalue; // store computed entry}
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
T0,0
B0,0
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
T1,0
B0,0
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
T0,1
B0,0
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
T1,1
B0,0
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
0,0 0,1 1,11,0
0,0
1,0
0,1
1,1
B0,0
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
0,0 0,1 1,11,0
0,0
1,0
0,1
1,1
tileNum = 0
B0,0
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
0,0 0,1 1,11,0
0,0
1,0
0,1
1,1
tileNum = 0
B0,0
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
0,0 0,1 1,11,0
0,0
1,0
0,1
1,1
tileNum = 0
B0,0
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
0,0 0,1 1,11,0
0,0
1,0
0,1
1,1
tileNum = 1
B0,0
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
B0,0 tile 0
B0,0 tile 0
B0,0 tile 1
B0,0 tile 1
B0,0
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
B0,0 tile 0B1,0 tile 0
B0,0 tile 0B0,1 tile 0
B0,0 tile 1B1,0 tile 1
B0,0 tile 1B0,1 tile 1
B0,0 B1,0
B0,1 B1,1B0,1 tile 0B1,1 tile 0
B0,1 tile 1B1,1 tile 1
B1,0 tile 0B1,1 tile 0
B1,0 tile 1B1,1 tile 1
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Tiled kernel using shared memory
• __global__ void matrixMulKernel( float *Md, float *Nd, float *Pd, int width) { // allocate tile in shared memory __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; // row and column indices int bx = blockIdx.x, by = blockIdx.y, tx = threadIdx.x, ty = threadIdx.y; int Row = by * TILE_WIDTH + ty, Col = bx * TILE_WIDTH + tx; // compute Pd entry tile-by-tile Pvalue = 0; for ( int tileNum = 0; tileNum < width/TILE_WIDTH; ++tileNum ) { // Collaborative loading into shared memory Mds[tx][ty] = Md[tileNum * TILE_WIDTH + tx][Row]; Nds[tx][ty] = Nd[Col][tileNum * TILE_WIDTH + ty]; Pvalue += Md[Row][k] * Nd[k][Col]; for ( int k = 0; k < TILE_WIDTH; ++k ) Pvalue += Mds[tx][k] + Nds[k][ty] } Pd[Row][Col] = Pvalue; // store computed entry}
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
m = tileNum
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Performance gain• Theoretical gain for 16x16 tiles = 16
• (86.4/4) * 16 = 345.6 GFLOPS
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
Memory limitations on Parallelism
• MP resources are split between active blocks
• Imposes limits on number of active blocks
• 8K registers for a max of 768 threads
• 8K/768 = 10 registers per thread
• If a kernel uses more than 10 registers, the number of blocks to be processed by the MP is reduced to fit the registers
• GeForce 8800GTX has 16K shared memory for a max of 8 blocks
• ~2K shared memory per block
• ~16x16 tiles in matrix multiplication are optimal
• If a block uses more than ~2K shared memory, the number of blocks to be processed by the MP is reduced to fit shared memory
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
• CUDA texture memory
• CUDA runtime and driver APIs
• Streams
Next time
Programming with CUDA, WS09 Waqar Saleem, Jens Müller
See you next time!