CUDA Lecture 4 CUDA Programming Basics
description
Transcript of CUDA Lecture 4 CUDA Programming Basics
Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CUDA Lecture 4CUDA Programming
Basics
Things we need to consider:ControlSynchronizationCommunication
Parallel programming languages offer different ways of dealing with above
CUDA Programming Basics – Slide 2
Parallel Programming Basics
CUDA programming model – basic concepts and data types
CUDA application programming interface - basic
Simple examples to illustrate basic concepts and functionalities
Performance features will be covered laterCUDA Programming Basics – Slide 3
Overview
Basic kernels and execution on GPU Basic memory management Coordinating CPU and GPU execution
See the programming guide for the full API
CUDA Programming Basics – Slide 4
Outline of CUDA Basics
Integrated host + device application program in C Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C
code Programming model
Parallel code (kernel) is launched and executed on a device by many threads
Launches are hierarchical Threads are grouped into blocks Blocks are grouped into grids
Familiar serial code is written for a thread Each thread is free to execute a unique code path Built-in thread and block ID variables
CUDA Programming Basics – Slide 5
CUDA – C with no shader limitations!
CUDA Programming Basics – Slide 6
CUDA – C with no shader limitations!
Serial Code (host)
. . .
. . .
Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)KernelB<<< nBlk, nTid >>>(args);
A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another
type of parallel processing device Data-parallel portions of an application are
expressed as device kernels which run on many threads
CUDA Programming Basics – Slide 7
CUDA Devices and Threads
Differences between GPU and CPU threads GPU threads are extremely lightweight
Very little creation overhead GPU needs 1000s of threads for full
efficiency Multi-core CPU needs only a few
CUDA Programming Basics – Slide 8
CUDA Devices and Threads
The future of GPUs is programmable processing
So – build the architecture around the processor
CUDA Programming Basics – Slide 9
G80 – Graphics Mode
L2
FB
SP SP
L1
TF
Thre
ad P
roce
ssor
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
Processors execute computing threads New operating mode/hardware interface for
computing
CUDA Programming Basics – Slide 10
G80 CUDA Mode – A Device Example
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
CUDA Programming Basics – Slide 11
High Level ViewSM
EM SM
EM SM
EM SM
EM
Global Memory CPU Chipset
PCIe
CUDA Programming Basics – Slide 12
Blocks of Threads Run on a SM
Thread
Memory
Threadblock
Per-blockSharedMemory
SME
M
Streaming Processor Streaming Multiprocessor
Registers
Memory
CUDA Programming Basics – Slide 13
Whole Grid Runs on GPUMany blocks of threads
. . .
SME
M SME
M SME
M SME
M
Global Memory
CUDA Programming Basics – Slide 14
Extended CType Qualifiers
global, device, shared, local, constant
KeywordsthreadIdx, blockIdx
Intrinsics__syncthreads
Runtime APIMemory, symbol, execution management
Function launch
__device__ float filter[N];
__global__ void convolve (float *image) {
__shared__ float region[M]; ...
region[threadIdx] = image[i];
__syncthreads() ...
image[j] = result;}
// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);
Mark Murphy, “NVIDIA’s Experience with Open64,”www.capsl.udel.edu/conferences/open64/2008/Papers/101.doc
CUDA Programming Basics – Slide 15
Extended C
gcc / cl
G80 SASSfoo.sass
OCG
cudaccEDG C/C++ frontend
Open64 Global Optimizer
GPU Assemblyfoo.s
CPU Host Code foo.cpp
Integrated source(foo.cu)
A CUDA kernel is executed by an array of threadsAll threads run the same code (SPMD)Each thread has an ID that it uses to compute
memory addresses and make control decisions
CUDA Programming Basics – Slide 16
Arrays of Parallel Threads
76543210
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
Divide monolithic thread array into multiple blocksThreads within a block cooperate via shared
memory, atomic operations and barrier synchronization
Threads in different blocks cannot cooperate
CUDA Programming Basics – Slide 17
Thread Blocks: Scalable Cooperation
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
Thread Block 0
……float x = input[threadID];float y = func(x);output[threadID] = y;…
Thread Block 1
…float x = input[threadID];float y = func(x);output[threadID] = y;…
Thread Block N - 176543210 76543210 76543210
Threads launched for a parallel section are partitioned into thread blocksGrid = all blocks for a given launch
Thread block is a group of threads that canSynchronize their executionsCommunicate via shared memory
CUDA Programming Basics – Slide 18
Thread Hierarchy
Any possible interleaving of blocks should be validPresumed to run to completion without
preemptionCan run in any orderCan run concurrently OR sequentially
Blocks may coordinate but not synchronizeShared queue pointer: OKShared lock: BAD … can easily deadlock
Independence requirement gives scalability
CUDA Programming Basics – Slide 19
Blocks Must Be Independent
A CUDA program has two piecesHost code on the CPU which interfaces to the
GPUKernel code which runs on the GPU
At the host level, there is a choice of 2 APIs (Application Programming Interfaces):Runtime: simpler, more convenientDriver: much more verbose, more flexible,
closer to OpenCLWe will only use the Runtime API in this
courseCUDA Programming Basics – Slide 20
Basics of CUDA Programming
At the host code level, there are library routines for:memory allocation on graphics carddata transfer to/from device memory
constantstexture arrays (useful for lookup tables)ordinary data
error-checkingtiming
There is also a special syntax for launching multiple copies of the kernel process on the GPU.
CUDA Programming Basics – Slide 21
Basics of CUDA Programming
Each thread uses IDs to decide what data to work on Block ID: 1-D or 2-
D Unique within a
block Thread ID: 1-D, 2-D
or 3-D Unique within a
block Dimensions set at
launch Can be unique for
each gridCUDA Programming Basics – Slide 22
Block IDs and Thread IDsHost
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Built-in variables threadIdx,
blockIdx blockDim, gridDim
Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on
volumes …
CUDA Programming Basics – Slide 23
Block IDs and Thread IDsHost
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
In its simplest form launch of kernel looks like:kernel_routine<<<gridDim, blockDim>>>(args);
wheregridDim is the number of copies of the kernel (the
“grid” size”)blockDim is the number of threads within each
copy (the “block” size)args is a limited number of arguments, usually
mainly pointers to arrays in graphics memory, and some constants which get copied by value
The more general form allows gridDim and blockDim to be 2-D or 3-D to simplify application programs CUDA Programming Basics – Slide 24
Basics of CUDA Programming
At the lower level, when one copy of the kernel is started on a SM it is executed by a number of threads, each of which knows about:some variables passed as argumentspointers to arrays in device memory (also
arguments)global constants in device memoryshared memory and private registers/local variablessome special variables:
gridDim size (or dimensions) of grid of blocksblockIdx index (or 2-D/3-D indices) of blockblockDim size (or dimensions) of each blockthreadIdx index (or 2-D/3-D indices) of thread
CUDA Programming Basics – Slide 25
Basics of CUDA Programming
Suppose we have 1000 blocks, and each one has 128 threads – how does it get executed?
On current Tesla hardware, would probably get 8 blocks running at the same time on each SM, and each block has 4 warps => 32 warps running on each SM
Each clock tick, SM warp scheduler decides which warp to execute next, choosing from those not waiting fordata coming from device memory (memory latency)completion of earlier instructions (pipeline delay)
Programmer doesn’t have to worry about this level of detail, just make sure there are lots of threads / warps
CUDA Programming Basics – Slide 26
Basics of CUDA Programming
In the simplest case, we have a 1-D grid of blocks, and a 1-D set of threads within each block.
If we want to use a 2-D set of threads, then blockDim.x, blockDim.y give the dimensions, and threadIdx.x, threadIdx.y give the thread indices
To launch the kernel we would use somthing likedim3 nthreads(16,4);my_new_kernel<<<nblocks,nthreads>>>(d_x);
where dim3 is a special CUDA datatype with 3 components .x, .y, .z each initialized to 1.
CUDA Programming Basics – Slide 27
Basics of CUDA Programming
CUDA Programming Basics – Slide 28
For ExampleHost
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Launch withdim3 dimGrid(2, 2);dim3 dimBlock(4, 2, 2);kernelFunc<<<dimGrid,dimBlock>>>(…);
Zoomed in on block withblockIdx.x = blockIdx.y = 1,blockDim.x = 4,blockDim.y = blockDim.z = 2
Each thread in block has coordinates (threadIdx.x, threadIdx.y, threadIdx.z)
A similar approach is used for 3-D threads and/or 2-D grids. This can be very useful in 2-D / 3-D finite difference applications.
How do 2-D / 3-D threads get divided into warps?1-D thread ID defined bythreadIdx.x +threadIdx.y * blockDim.x +threadIdx.z * blockDim.x * blockDim.y
and this is then broken up into warps of size 32.
CUDA Programming Basics – Slide 29
Basics of CUDA Programming
Thread (1, 0)
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Global memory Main means of
communicating R/W data between host and device
Contents visible to all threads
Long latency access We will focus on
global memory for now Constant and
texture memory will come later CUDA Programming Basics – Slide 30
CUDA Memory Model Overview
Thread (1, 0)
CUDA Programming Basics – Slide 31
Memory ModelKernel
0. . . Per-device
GlobalMemory
. . .
Kernel 1
SequentialKernels
The API is an extension to the ANSI C programming languageLow learning curve
The hardware is designed to enable lightweight runtime and driverHigh performance
CUDA Programming Basics – Slide 32
CUDA API Highlights: Easy and Lightweight
CPU and GPU have separate memory spacesData is moved across the PCIe busUse functions to allocate/set/copy memory on
GPUVery similar to corresponding C functions
Pointers are just addressesCan’t tell from the pointer value whether the
address is on CPU or GPUMust exercise care when dereferencing
Dereferencing CPU pointer on GPU will likely crash and vice-versa
CUDA Programming Basics – Slide 33
Memory Spaces
Thread (1, 0)
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
cudaMalloc() Allocates object in
the device global memory
Requires two parameters
Address of a pointer to the allocated object
Size of allocated object
cudaFree() Frees objects from
device global memory Pointer to freed object
CUDA Programming Basics – Slide 34
CUDA Device Memory Allocation
Thread (1, 0)
Code exampleAllocate a 64-by-64 single precision float arrayAttach the allocated storage to Md
“d” is often used to indicate a device data structure
CUDA Programming Basics – Slide 35
CUDA Device Memory Allocation
TILE_WIDTH = 64;float* Md;int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);
cudaMalloc((void**)&Md, size);cudaMemset(Md, 0, size);cudaFree(Md);
Thread (1, 0)
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
cudaMemcpy() Memory data
transfer Requires four
parameters Pointer to destination Pointer to source Number of bytes
copied Type of transfer
Host to host Host to device Device to host Device to device
Asynchronous transfer
CUDA Programming Basics – Slide 36
CUDA Host-Device Data Transfer
Thread (1, 0)
cudaMemcpy()Returns after the copy is completeBlocks CPU thread until all bytes have been copiedDoesn’t start copying until previous CUDA calls
completeNon-blocking copies are also available
CUDA Programming Basics – Slide 37
Memory ModelDevice 0memory
Device 1memory
Host memory cudaMemcpy()
Code exampleTransfer a 64-by-64 single precision float arrayM is in host memory and Md is in device memorycudaMemcpyHostToDevice , cudaMemcpyDeviceToHost and cudaMemcpyDeviceToDevice are symbolic constants
CUDA Programming Basics – Slide 38
CUDA Host-Device Data Transfer
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);
CUDA Programming Basics – Slide 39
First Simple CUDA Example#include <stdio.h>
int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes); cudaMalloc((void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0;}
C/C++ with some restrictionsCan only access GPU memoryNo variable number of argumentsNo static variablesNo recursionNo dynamic polymorphism
Must be declared with a qualifier__global__ : launched by CPU, cannot be
called from GPU__device__ : called from other GPU functions,
cannot be called by the CPU__host__ : can be called by the CPU
CUDA Programming Basics – Slide 40
Code Executed on GPU
__global__ defines a kernel functionMust return void
__device__ and __host__ can be used togetherSample use: overloading operators
CUDA Programming Basics – Slide 41
CUDA Function DeclarationsExecuted
on theOnly
callable from the
__device__ float DeviceFunc() Device Device__global__ void KernelFunc() Device Host__host__ float HostFunc() Host Host
__device__ int reduction_lock = 0;
The __device__ prefix tells nvcc this is a global variable in the GPU, not the CPU.
The variable can be read and modified by any kernel
Its lifetime is the lifetime of the whole applicationCan also declare arrays of fixed sizeCan read/write by host code using special routines cudaMemcpyToSymbol, cudaMemcpyFromSymbol or with standard cudaMemcpy in combination with cudaGetSymbolAddress
CUDA Programming Basics – Slide 42
CUDA Function Declarations
__device__ functions cannot have their address taken
For functions executed on the deviceNo recursionNo static variable declarations inside the
functionNo variable number of arguments
CUDA Programming Basics – Slide 43
CUDA Function Declarations
As seen a kernel function must be called with an execution configuration:
Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking
CUDA Programming Basics – Slide 44
Calling a Kernel Function – Thread Creation
__global__ void KernelFunc(…);dim3 DimGrid(100, 50); // 5000 thread blocksdim3 DimBlock(4, 8, 8); // 256 threads/blocksize_t SharedMemBytes = 64; // 64 bytes shared memoryKernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(…);
The kernel code looks fairly normal once you get used to two things:code is written from the point of view of a single
threadquite different to OpenMP multithreadingsimilar to MPI, where you use the MPI “rank” to identify
the MPI processall local variables are private to that thread
need to think about where each variable livesany operation involving data in the device memory forces
its transfer to/from registers in the GPUno cache on old hardware so a second operation with the
same data will force a second transferoften better to copy the value into a local register variable
CUDA Programming Basics – Slide 45
Basics of CUDA Programming
CUDA Programming Basics – Slide 46
Next CUDA Example: Vector Addition
// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i];}
int main() { int N = 16; // total number of elements in the vector/array int TPB = 4; // number of threads per block // allocate and initialize host (CPU) memory float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C; // allocate device (GPU) memory cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // assign values to d_A and d_B; // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice); // Run grid of N/4 blocks of 4 threads each vecAdd<<< N/4, 4>>>(d_A, d_B, d_C); // copy result back to host memory cudaMemcpy( h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost); // do something with the result… // free device (GPU) memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);}
__global__ identifier says its a kernel function
Each thread sets one element of C[] arrayWithin each block of threads, threadIdx.x
ranges from 0 to blockDim.x-1, so each thread has a unique value for i
CUDA Programming Basics – Slide 47
Next CUDA Example: Vector Addition
CUDA Programming Basics – Slide 48
Kernel Variations and Output__global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = 7;}
Output:7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
__global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = blockIdx.x;}
Output:0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3
__global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = threadIdx.x;}
Output:0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
CUDA Programming Basics – Slide 49
Next CUDA Example: Kernel with 2-D Addressing
__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1;}
int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers
h_a = (int*)malloc(num_bytes); cudaMalloc((void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<<grid, block>>>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudaFree( d_a ); return 0;}
A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programsLeave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and
deviceAssume square matrix for simplicity
CUDA Programming Basics – Slide 50
A Simple Running ExampleMatrix Multiplication
P = M × N of size WIDTH-by-WIDTH
Without tilingOne thread
calculates one element of P
M and N are loaded WIDTH times from global memory
CUDA Programming Basics – Slide 51
Programming ModelSquare Matrix Multiplication Example
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
CUDA Programming Basics – Slide 52
Memory Layout of a Matrix in CM0,2
M1,1
M0,1M0,0
M1,0
M0,3
M1,2 M1,3
M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3
M2,1M2,0 M2,2 M2,3
M3,1M3,0 M3,2 M3,3
M3,1M3,0 M3,2 M3,3
M
53
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
Memory Layout of a Matrix in the Textbook
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3
M1,3M0,3 M2,3 M3,3
M
CUDA Programming Basics – Slide 54
Step 1: Matrix MultiplicationA Simple Host Version in C
M
N
P
WID
TH
WID
TH
WIDTH WIDTH
k
j
i
k
CUDA Programming Basics – Slide 55
Step 1: Matrix MultiplicationA Simple Host Version in C// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width){ for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * Width + k]; double b = N[k * Width + j]; sum += a * b; } P[i * Width + j] = sum; }}
CUDA Programming Basics – Slide 56
Step 2: Input Matrix Data Transfer(Host-Side Code)void MatrixMulOnDevice(float* M, float* N, float* P, int Width){ int size = Width * Width * sizeof(float); float* Md, Nd, Pd;
// allocate and load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
// allocate P on the device cudaMalloc(&Pd, size);
CUDA Programming Basics – Slide 57
Step 3: Output Matrix Data Transfer(Host-Side Code)
// kernel invocation code – to be shown later (Step 5) …
// read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
// free device matrices cudaFree(Md); cudaFree(Nd); cudaFree(Pd);}
CUDA Programming Basics – Slide 58
Step 4: Kernel Function (Overview)
Md
Nd
Pd
WID
TH
WID
TH
WIDTH WIDTH
k
threadIdx.x
threadIdx.y
kthreadIdx.x
threadIdx.y
CUDA Programming Basics – Slide 59
Step 4: Kernel Function// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel
(float* Md, float* Nd, float* Pd, int Width){ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;
for (int k = 0; k < Width; ++k) float Melement = Md[threadIdx.y * Width + k]; float Nelement = Nd[k * Width + threadIdx.x]; Pvalue += Melement * Nelement; }
Pd[threadIdx.y * Width + threadIdx.x] = Pvalue;}
CUDA Programming Basics – Slide 60
Step 5: Kernel Invocation(Host-Side Code)
// Set up the execution configurationdim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);
// Launch the device computation threadsMatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
One block of threads compute matrix PdEach thread computes one element of Pd
Each threadLoads a row of matrix MdLoads a column of matrix NdPerforms one multiply and addition for each
pair of Md and Nd elementsCompute to off-chip memory access ratio close
to 1:1 (not very high)Size of matrix limited by the number of
threads allowed in a thread blockCUDA Programming Basics – Slide 61
Only One Thread Block Used
CUDA Programming Basics – Slide 62
Only One Thread Block Used Grid 1Block 1
3 2 5 4
2
4
2
6
48
Thread(2, 2)
WIDTH
Md Pd
Nd
Have each 2-D thread block compute a (TILE_WIDTH)² sub-matrix (tile) of the result matrixEach has (TILE_WIDTH)² threads
Generate a 2-D grid of (WIDTH / TILE_WIDTH)² blocks
You still need to put a loop around the kernel call for cases where WIDTH / TILE_WIDTH is greater than the max grid size (64K)
CUDA Programming Basics – Slide 63
Handling Square Matrices with Arbitrary Size
CUDA Programming Basics – Slide 64
Matrix Multiplication Using Multiple Blocks
Md
Nd
Pd
WID
TH
WID
TH
WIDTH WIDTH
ty
tx
by
bx
TILE_WIDTH
Break-up Pd into tiles
Each block calculates one tile Each thread
calculates one element
Block size equal tile size
CUDA Programming Basics – Slide 65
A Small Example: Multiplication
Pd1,0Pd0,0
Pd0,1
Pd2,0Pd3,0
Pd1,1
Pd0,2 Pd2,2Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3Pd3,3Pd1,3
Block(0,0) Block(1,0)
Block(1,1)Block(0,1)
TILE_WIDTH = 2
Pd1,0Md2,0
Md1,1
Md1,0Md0,0
Md0,1
Md3,0
Md2,1
Pd0,0
Md3,1 Pd0,1
Pd2,0Pd3,0
Nd0,3Nd1,3
Nd1,2
Nd1,1
Nd1,0Nd0,0
Nd0,1
Nd0,2
Pd1,1
Pd0,2 Pd2,2Pd3,2Pd1,2
Pd3,1Pd2,1
Pd0,3 Pd2,3Pd3,3Pd1,3
CUDA Programming Basics – Slide 66
Revised Matrix Multiplication Kernel Using Multiple Blocks// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel
(float* Md, float* Nd, float* Pd, int Width){ // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;
float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];
Pd[Row*Width+Col] = Pvalue;}
All threads in a block execute the same kernel program (SPMD)
Programmer declares block: Block size 1 to 512 concurrent
threads Block shape 1-D, 2-D, or 3-D Block dimensions in threads
Threads have thread id numbers within block Thread program uses thread id
to select work and address shared data
CUDA Tools and Threads – Slide 67
CUDA Thread BlockCUDA Thread Block
Thread Id #:0 1 2 3 … m
Thread program
Courtesy: John Nickolls, NVIDIA
Threads in the same block share data and synchronize while doing their share of the work
Threads in different blocks cannot cooperate Each block can execute in any
order relative to other blocs!
CUDA Tools and Threads – Slide 68
CUDA Thread BlockCUDA Thread Block
Thread Id #:0 1 2 3 … m
Thread program
Courtesy: John Nickolls, NVIDIA
Hardware is free to assign blocks to any processor at any timeA kernel scales across any number of parallel
processors
Each block can execute in any order relative
to other blocksCUDA Tools and Threads – Slide 69
Transparent Scalability
Device
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Kernel grid
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Device
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
time
Processors execute computing threads New operating mode/hardware interface for
computing
CUDA Tools and Threads – Slide 70
G80 CUDA Mode – A Review
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
Threads are assigned to streaming multiprocessors (SMs) in block granularity Up to 8 blocks to each SM as resource allows Each SM in G80 can take up to 768 threads
Could be 256 (threads/block) × 3 blocks Or 128 (threads/block) × 6 blocks, etc.
Threads run concurrently Each SM maintains thread/block id numbers Each SM manages/schedules thread execution
CUDA Tools and Threads – Slide 71
G80 Example: Executing Thread Blocks
CUDA Tools and Threads – Slide 72
G80 Example: Executing Thread Blocks
t0 t1 t2 … tm
Blocks
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
t0 t1 t2 … tm
Blocks
SM 1SM 0
Flexible resource allocation
Each block is executed as 32-thread warps An implementation decision, not part of the CUDA
programming model Warps are scheduling units in an SM
If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM? Each block is divided into 256/32 = 8 warps There are 8 × 3 = 24 warps
CUDA Tools and Threads – Slide 73
G80 Example: Thread Scheduling
CUDA Tools and Threads – Slide 74
G80 Example: Thread Scheduling
…t0 t1 t2 …
t31
…
…t0 t1 t2 …
t31
…Block 1 Warps Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1Streaming Multiprocessor
Shared Memory
…t0 t1 t2 …
t31
…Block 1 Warps
Each SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by an SM Warps whose next instruction has its operands ready for
consumption are eligible for execution Eligible warps are selected for execution on a prioritized
scheduling policy All threads in a warp execute the same instruction when
selected
CUDA Tools and Threads – Slide 75
G80 Example: Thread Scheduling
TB1W1
TB = Thread Block, W = Warp
TB2W1
TB3W1
TB2W1
TB1W1
TB3W2
TB1W2
TB1W3
TB3W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
For matrix multiplication using multiple blocks, should I use 8 × 8, 16 × 16 or 32 × 32 blocks?
For 8 × 8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!
For 16 × 16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.
For 32 × 32, we have 1024 threads per Block. Not even one can fit into an SM!
CUDA Tools and Threads – Slide 76
G80 Block Granularity Considerations
The API is an extension to the C programming language
It consists of: Language extensions
To target portions of the code for execution on the device
A runtime library split into: A common component providing built-in vector
types and a subset of the C runtime library in both host and device codes
A host component to control and access one or more devices from the host
A device component providing device-specific functions CUDA Tools and Threads – Slide 77
Application Programming Interface
dim3 gridDim; Dimensions of the grid in blocks (gridDim.z
unused) dim3 blockDim;
Dimensions of the block in threads dim3 blockIdx;
Block index within the grid dim3 threadIdx;
Thread index within the block
CUDA Tools and Threads – Slide 78
Language Extensions: Built-in Variables
pow, sqrt, cbrt, hypot exp, exp2, expm1 log, log2, log10, log1p sin, cos, tan, asin, acos, atan, atan2 sinh, cosh, tanh, asinh, acosh, atanh ceil, floor, trunc, round Etc.
When executed on the host, a given function uses the C runtime implementation if available
These functions are only supported for scalar types, not vector types
CUDA Tools and Threads – Slide 79
Common Runtime Component: Mathematical Functions
Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x)) __pow __log, __log2, __log10 __exp __sin, __cos, __tan
CUDA Tools and Threads – Slide 80
Common Runtime Component: Mathematical Functions
Provides functions to deal with: Device management (including multi-device
systems) Memory management Error handling
Initializes the first time a runtime function is called
A host thread can invoke device code on only one device Multiple host threads required to run on multiple
devicesCUDA Tools and Threads – Slide 81
Host Runtime Component
void __syncthreads(); Synchronizes all threads in a block Once all threads have reached this point,
execution resumes normally Used to avoid RAW / WAR / WAW hazards
when accessing shared or global memory Allowed in conditional constructs only if the
conditional is uniform across the entire thread block
CUDA Tools and Threads – Slide 82
Device Runtime Component:Synchronization Function
memory allocationcudaMalloc((void **)&xd, nbytes);
data copyingcudaMemcpy(xh, xd, nbytes, cudaMemcpyDeviceToHost);
reminder: d (h) to distinguish an array on the device (host) is not mandatory, just helpful labeling
kernel routine is declared by __global__ prefix, and is written from point of view of a single thread
CUDA Programming Basics – Slide 83
Final Thoughts
Reading: Chapters 3 and 4, “Programming Massively Parallel Processors” by Kirk and Hwu.
Based on original material fromThe University of Illinois at Urbana-Champaign
David Kirk, Wen-mei W. HwuOxford University: Mike GilesStanford University
Jared Hoberock, David TarjanRevision history: last updated 6/22/2011.
CUDA Programming Basics – Slide 84
End Credits