CUDA Lecture 4 CUDA Programming Basics

Prepared 6/22/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

CUDA Lecture 4CUDA Programming

Basics

Things we need to consider:ControlSynchronizationCommunication

Parallel programming languages offer different ways of dealing with above

CUDA Programming Basics – Slide 2

Parallel Programming Basics

CUDA programming model – basic concepts and data types

CUDA application programming interface - basic

Simple examples to illustrate basic concepts and functionalities

Performance features will be covered laterCUDA Programming Basics – Slide 3

Overview

Basic kernels and execution on GPU Basic memory management Coordinating CPU and GPU execution

See the programming guide for the full API


Outline of CUDA Basics

Integrated host + device application program in C Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C

code Programming model

Parallel code (kernel) is launched and executed on a device by many threads

Launches are hierarchical Threads are grouped into blocks Blocks are grouped into grids

Familiar serial code is written for a thread Each thread is free to execute a unique code path Built-in thread and block ID variables


CUDA – C with no shader limitations!


CUDA – C with no shader limitations!

Serial Code (host)

. . .

. . .

Parallel Kernel (device)KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)KernelB<<< nBlk, nTid >>>(args);

A compute device Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Is typically a GPU but can also be another

type of parallel processing device Data-parallel portions of an application are

expressed as device kernels which run on many threads


CUDA Devices and Threads

Differences between GPU and CPU threads GPU threads are extremely lightweight

Very little creation overhead GPU needs 1000s of threads for full

efficiency Multi-core CPU needs only a few


CUDA Devices and Threads

The future of GPUs is programmable processing

So – build the architecture around the processor


G80 – Graphics Mode

L2

FB

SP SP

L1

TF

Thre

ad P

roce

ssor

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Processors execute computing threads New operating mode/hardware interface for

computing


G80 CUDA Mode – A Device Example

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store


High Level ViewSM

EM SM

EM SM

EM SM

EM

Global Memory CPU Chipset

PCIe


Blocks of Threads Run on a SM

Thread

Memory

Threadblock

Per-blockSharedMemory

SME

M

Streaming Processor Streaming Multiprocessor

Registers

Memory


Whole Grid Runs on GPUMany blocks of threads

. . .

SME

M SME

M SME

M SME

M

Global Memory


Extended CType Qualifiers

global, device, shared, local, constant

KeywordsthreadIdx, blockIdx

Intrinsics__syncthreads

Runtime APIMemory, symbol, execution management

Function launch

__device__ float filter[N];

__global__ void convolve (float *image) {

__shared__ float region[M]; ...

region[threadIdx] = image[i];

__syncthreads() ...

image[j] = result;}

// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve<<<100, 10>>> (myimage);

Mark Murphy, “NVIDIA’s Experience with Open64,”www.capsl.udel.edu/conferences/open64/2008/Papers/101.doc


Extended C

gcc / cl

G80 SASSfoo.sass

OCG

cudaccEDG C/C++ frontend

Open64 Global Optimizer

GPU Assemblyfoo.s

CPU Host Code foo.cpp

Integrated source(foo.cu)

A CUDA kernel is executed by an array of threadsAll threads run the same code (SPMD)Each thread has an ID that it uses to compute

memory addresses and make control decisions


Arrays of Parallel Threads

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Divide monolithic thread array into multiple blocksThreads within a block cooperate via shared

memory, atomic operations and barrier synchronization

Threads in different blocks cannot cooperate


Thread Blocks: Scalable Cooperation


threadID

Thread Block 0

……float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block 1


Thread Block N - 176543210 76543210 76543210

Threads launched for a parallel section are partitioned into thread blocksGrid = all blocks for a given launch

Thread block is a group of threads that canSynchronize their executionsCommunicate via shared memory


Thread Hierarchy

Any possible interleaving of blocks should be validPresumed to run to completion without

preemptionCan run in any orderCan run concurrently OR sequentially

Blocks may coordinate but not synchronizeShared queue pointer: OKShared lock: BAD … can easily deadlock

Independence requirement gives scalability


Blocks Must Be Independent

A CUDA program has two piecesHost code on the CPU which interfaces to the

GPUKernel code which runs on the GPU

At the host level, there is a choice of 2 APIs (Application Programming Interfaces):Runtime: simpler, more convenientDriver: much more verbose, more flexible,

closer to OpenCLWe will only use the Runtime API in this

courseCUDA Programming Basics – Slide 20

Basics of CUDA Programming

At the host code level, there are library routines for:memory allocation on graphics carddata transfer to/from device memory

constantstexture arrays (useful for lookup tables)ordinary data

error-checkingtiming

There is also a special syntax for launching multiple copies of the kernel process on the GPU.



Each thread uses IDs to decide what data to work on Block ID: 1-D or 2-

D Unique within a

block Thread ID: 1-D, 2-D

or 3-D Unique within a

block Dimensions set at

launch Can be unique for

each gridCUDA Programming Basics – Slide 22

Block IDs and Thread IDsHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Built-in variables threadIdx,

blockIdx blockDim, gridDim

Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on

volumes …


Block IDs and Thread IDsHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA


Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

In its simplest form launch of kernel looks like:kernel_routine<<<gridDim, blockDim>>>(args);

wheregridDim is the number of copies of the kernel (the

“grid” size”)blockDim is the number of threads within each

copy (the “block” size)args is a limited number of arguments, usually

mainly pointers to arrays in graphics memory, and some constants which get copied by value

The more general form allows gridDim and blockDim to be 2-D or 3-D to simplify application programs CUDA Programming Basics – Slide 24


At the lower level, when one copy of the kernel is started on a SM it is executed by a number of threads, each of which knows about:some variables passed as argumentspointers to arrays in device memory (also

arguments)global constants in device memoryshared memory and private registers/local variablessome special variables:

gridDim size (or dimensions) of grid of blocksblockIdx index (or 2-D/3-D indices) of blockblockDim size (or dimensions) of each blockthreadIdx index (or 2-D/3-D indices) of thread



Suppose we have 1000 blocks, and each one has 128 threads – how does it get executed?

On current Tesla hardware, would probably get 8 blocks running at the same time on each SM, and each block has 4 warps => 32 warps running on each SM

Each clock tick, SM warp scheduler decides which warp to execute next, choosing from those not waiting fordata coming from device memory (memory latency)completion of earlier instructions (pipeline delay)

Programmer doesn’t have to worry about this level of detail, just make sure there are lots of threads / warps



In the simplest case, we have a 1-D grid of blocks, and a 1-D set of threads within each block.

If we want to use a 2-D set of threads, then blockDim.x, blockDim.y give the dimensions, and threadIdx.x, threadIdx.y give the thread indices

To launch the kernel we would use somthing likedim3 nthreads(16,4);my_new_kernel<<<nblocks,nthreads>>>(d_x);

where dim3 is a special CUDA datatype with 3 components .x, .y, .z each initialized to 1.




For ExampleHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA


Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Launch withdim3 dimGrid(2, 2);dim3 dimBlock(4, 2, 2);kernelFunc<<<dimGrid,dimBlock>>>(…);

Zoomed in on block withblockIdx.x = blockIdx.y = 1,blockDim.x = 4,blockDim.y = blockDim.z = 2

Each thread in block has coordinates (threadIdx.x, threadIdx.y, threadIdx.z)

A similar approach is used for 3-D threads and/or 2-D grids. This can be very useful in 2-D / 3-D finite difference applications.

How do 2-D / 3-D threads get divided into warps?1-D thread ID defined bythreadIdx.x +threadIdx.y * blockDim.x +threadIdx.z * blockDim.x * blockDim.y

and this is then broken up into warps of size 32.



Thread (1, 0)

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Global memory Main means of

communicating R/W data between host and device

Contents visible to all threads

Long latency access We will focus on

global memory for now Constant and

texture memory will come later CUDA Programming Basics – Slide 30

CUDA Memory Model Overview

Thread (1, 0)


Memory ModelKernel

0. . . Per-device

GlobalMemory

. . .

Kernel 1

SequentialKernels

The API is an extension to the ANSI C programming languageLow learning curve

The hardware is designed to enable lightweight runtime and driverHigh performance


CUDA API Highlights: Easy and Lightweight

CPU and GPU have separate memory spacesData is moved across the PCIe busUse functions to allocate/set/copy memory on

GPUVery similar to corresponding C functions

Pointers are just addressesCan’t tell from the pointer value whether the

address is on CPU or GPUMust exercise care when dereferencing

Dereferencing CPU pointer on GPU will likely crash and vice-versa


Memory Spaces

Thread (1, 0)

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

cudaMalloc() Allocates object in

the device global memory

Requires two parameters

Address of a pointer to the allocated object

Size of allocated object

cudaFree() Frees objects from

device global memory Pointer to freed object


CUDA Device Memory Allocation

Thread (1, 0)

Code exampleAllocate a 64-by-64 single precision float arrayAttach the allocated storage to Md

“d” is often used to indicate a device data structure


CUDA Device Memory Allocation

TILE_WIDTH = 64;float* Md;int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);cudaMemset(Md, 0, size);cudaFree(Md);

Thread (1, 0)

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

cudaMemcpy() Memory data

transfer Requires four

parameters Pointer to destination Pointer to source Number of bytes

copied Type of transfer

Host to host Host to device Device to host Device to device

Asynchronous transfer


CUDA Host-Device Data Transfer

Thread (1, 0)

cudaMemcpy()Returns after the copy is completeBlocks CPU thread until all bytes have been copiedDoesn’t start copying until previous CUDA calls

completeNon-blocking copies are also available


Memory ModelDevice 0memory

Device 1memory

Host memory cudaMemcpy()

Code exampleTransfer a 64-by-64 single precision float arrayM is in host memory and Md is in device memorycudaMemcpyHostToDevice , cudaMemcpyDeviceToHost and cudaMemcpyDeviceToDevice are symbolic constants


CUDA Host-Device Data Transfer

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);


First Simple CUDA Example#include <stdio.h>

int main(){ int dimx = 16; int num_bytes = dimx*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc((void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int i=0; i<dimx; i++) printf("%d ", h_a[i] ); printf("\n"); free( h_a ); cudaFree( d_a ); return 0;}

C/C++ with some restrictionsCan only access GPU memoryNo variable number of argumentsNo static variablesNo recursionNo dynamic polymorphism

Must be declared with a qualifier__global__ : launched by CPU, cannot be

called from GPU__device__ : called from other GPU functions,

cannot be called by the CPU__host__ : can be called by the CPU


Code Executed on GPU

__global__ defines a kernel functionMust return void

__device__ and __host__ can be used togetherSample use: overloading operators


CUDA Function DeclarationsExecuted

on theOnly

callable from the

__device__ float DeviceFunc() Device Device__global__ void KernelFunc() Device Host__host__ float HostFunc() Host Host

__device__ int reduction_lock = 0;

The __device__ prefix tells nvcc this is a global variable in the GPU, not the CPU.

The variable can be read and modified by any kernel

Its lifetime is the lifetime of the whole applicationCan also declare arrays of fixed sizeCan read/write by host code using special routines cudaMemcpyToSymbol, cudaMemcpyFromSymbol or with standard cudaMemcpy in combination with cudaGetSymbolAddress


CUDA Function Declarations

__device__ functions cannot have their address taken

For functions executed on the deviceNo recursionNo static variable declarations inside the

functionNo variable number of arguments


CUDA Function Declarations

As seen a kernel function must be called with an execution configuration:

Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking


Calling a Kernel Function – Thread Creation

__global__ void KernelFunc(…);dim3 DimGrid(100, 50); // 5000 thread blocksdim3 DimBlock(4, 8, 8); // 256 threads/blocksize_t SharedMemBytes = 64; // 64 bytes shared memoryKernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(…);

The kernel code looks fairly normal once you get used to two things:code is written from the point of view of a single

threadquite different to OpenMP multithreadingsimilar to MPI, where you use the MPI “rank” to identify

the MPI processall local variables are private to that thread

need to think about where each variable livesany operation involving data in the device memory forces

its transfer to/from registers in the GPUno cache on old hardware so a second operation with the

same data will force a second transferoften better to copy the value into a local register variable




Next CUDA Example: Vector Addition

// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i];}

int main() { int N = 16; // total number of elements in the vector/array int TPB = 4; // number of threads per block // allocate and initialize host (CPU) memory float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C; // allocate device (GPU) memory cudaMalloc( (void**) &d_A, N * sizeof(float)); cudaMalloc( (void**) &d_B, N * sizeof(float)); cudaMalloc( (void**) &d_C, N * sizeof(float)); // assign values to d_A and d_B; // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice); // Run grid of N/4 blocks of 4 threads each vecAdd<<< N/4, 4>>>(d_A, d_B, d_C); // copy result back to host memory cudaMemcpy( h_C, d_C, N * sizeof(float), cudaMemcpyDeviceToHost); // do something with the result… // free device (GPU) memory cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);}

__global__ identifier says its a kernel function

Each thread sets one element of C[] arrayWithin each block of threads, threadIdx.x

ranges from 0 to blockDim.x-1, so each thread has a unique value for i


Next CUDA Example: Vector Addition


Kernel Variations and Output__global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = 7;}

Output:7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

__global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = blockIdx.x;}

Output:0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

__global__ void kernel( int *a ) { int idx = threadIdx.x + blockDim.x * blockIdx.x; a[idx] = threadIdx.x;}

Output:0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3


Next CUDA Example: Kernel with 2-D Addressing

__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = a[idx]+1;}

int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes); cudaMalloc((void**)&d_a, num_bytes ); if( 0==h_a || 0==d_a ) { printf("couldn't allocate memory\n"); return 1; } cudaMemset( d_a, 0, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<<grid, block>>>( d_a, dimx, dimy ); cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost ); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) printf("%d ", h_a[row*dimx+col] ); printf("\n"); } free( h_a ); cudaFree( d_a ); return 0;}

A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programsLeave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and

deviceAssume square matrix for simplicity


A Simple Running ExampleMatrix Multiplication

P = M × N of size WIDTH-by-WIDTH

Without tilingOne thread

calculates one element of P

M and N are loaded WIDTH times from global memory


Programming ModelSquare Matrix Multiplication Example

M

N

P

WID

TH

WID

TH

WIDTH WIDTH


Memory Layout of a Matrix in CM0,2

M1,1

M0,1M0,0

M1,0

M0,3

M1,2 M1,3

M0,2M0,1M0,0 M0,3 M1,1M1,0 M1,2 M1,3 M2,1M2,0 M2,2 M2,3

M2,1M2,0 M2,2 M2,3

M3,1M3,0 M3,2 M3,3

M3,1M3,0 M3,2 M3,3

M

53

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

Memory Layout of a Matrix in the Textbook

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M


Step 1: Matrix MultiplicationA Simple Host Version in C

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

k

j

i

k


Step 1: Matrix MultiplicationA Simple Host Version in C// Matrix multiplication on the (CPU) host in double precisionvoid MatrixMulOnHost(float* M, float* N, float* P, int Width){ for (int i = 0; i < Width; ++i) for (int j = 0; j < Width; ++j) { double sum = 0; for (int k = 0; k < Width; ++k) { double a = M[i * Width + k]; double b = N[k * Width + j]; sum += a * b; } P[i * Width + j] = sum; }}


Step 2: Input Matrix Data Transfer(Host-Side Code)void MatrixMulOnDevice(float* M, float* N, float* P, int Width){ int size = Width * Width * sizeof(float); float* Md, Nd, Pd;

// allocate and load M, N to device memory cudaMalloc(&Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// allocate P on the device cudaMalloc(&Pd, size);


Step 3: Output Matrix Data Transfer(Host-Side Code)

// kernel invocation code – to be shown later (Step 5) …

// read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// free device matrices cudaFree(Md); cudaFree(Nd); cudaFree(Pd);}


Step 4: Kernel Function (Overview)

Md

Nd

Pd

WID

TH

WID

TH

WIDTH WIDTH

k

threadIdx.x

threadIdx.y

kthreadIdx.x

threadIdx.y


Step 4: Kernel Function// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel

(float* Md, float* Nd, float* Pd, int Width){ // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0;

for (int k = 0; k < Width; ++k) float Melement = Md[threadIdx.y * Width + k]; float Nelement = Nd[k * Width + threadIdx.x]; Pvalue += Melement * Nelement; }

Pd[threadIdx.y * Width + threadIdx.x] = Pvalue;}


Step 5: Kernel Invocation(Host-Side Code)

// Set up the execution configurationdim3 dimGrid(1, 1);dim3 dimBlock(Width, Width);

// Launch the device computation threadsMatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

One block of threads compute matrix PdEach thread computes one element of Pd

Each threadLoads a row of matrix MdLoads a column of matrix NdPerforms one multiply and addition for each

pair of Md and Nd elementsCompute to off-chip memory access ratio close

to 1:1 (not very high)Size of matrix limited by the number of

threads allowed in a thread blockCUDA Programming Basics – Slide 61

Only One Thread Block Used


Only One Thread Block Used Grid 1Block 1

3 2 5 4

2

4

2

6

48

Thread(2, 2)

WIDTH

Md Pd

Nd

Have each 2-D thread block compute a (TILE_WIDTH)² sub-matrix (tile) of the result matrixEach has (TILE_WIDTH)² threads

Generate a 2-D grid of (WIDTH / TILE_WIDTH)² blocks

You still need to put a loop around the kernel call for cases where WIDTH / TILE_WIDTH is greater than the max grid size (64K)


Handling Square Matrices with Arbitrary Size


Matrix Multiplication Using Multiple Blocks

Md

Nd

Pd

WID

TH

WID

TH

WIDTH WIDTH

ty

tx

by

bx

TILE_WIDTH

Break-up Pd into tiles

Each block calculates one tile Each thread

calculates one element

Block size equal tile size


A Small Example: Multiplication

Pd1,0Pd0,0

Pd0,1

Pd2,0Pd3,0

Pd1,1

Pd0,2 Pd2,2Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3Pd3,3Pd1,3

Block(0,0) Block(1,0)

Block(1,1)Block(0,1)

TILE_WIDTH = 2

Pd1,0Md2,0

Md1,1

Md1,0Md0,0

Md0,1

Md3,0

Md2,1

Pd0,0

Md3,1 Pd0,1

Pd2,0Pd3,0

Nd0,3Nd1,3

Nd1,2

Nd1,1

Nd1,0Nd0,0

Nd0,1

Nd0,2

Pd1,1

Pd0,2 Pd2,2Pd3,2Pd1,2

Pd3,1Pd2,1

Pd0,3 Pd2,3Pd3,3Pd1,3


Revised Matrix Multiplication Kernel Using Multiple Blocks// Matrix multiplication kernel – per thread code__global__ void MatrixMulKernel

(float* Md, float* Nd, float* Pd, int Width){ // Calculate the row index of the Pd element and M int Row = blockIdx.y*TILE_WIDTH + threadIdx.y; // Calculate the column idenx of Pd and N int Col = blockIdx.x*TILE_WIDTH + threadIdx.x;

float Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row*Width+k] * Nd[k*Width+Col];

Pd[Row*Width+Col] = Pvalue;}

All threads in a block execute the same kernel program (SPMD)

Programmer declares block: Block size 1 to 512 concurrent

threads Block shape 1-D, 2-D, or 3-D Block dimensions in threads

Threads have thread id numbers within block Thread program uses thread id

to select work and address shared data

CUDA Tools and Threads – Slide 67

CUDA Thread BlockCUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

Threads in the same block share data and synchronize while doing their share of the work

Threads in different blocks cannot cooperate Each block can execute in any

order relative to other blocs!


CUDA Thread BlockCUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

Hardware is free to assign blocks to any processor at any timeA kernel scales across any number of parallel

processors

Each block can execute in any order relative

to other blocksCUDA Tools and Threads – Slide 69

Transparent Scalability

Device

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

time

Processors execute computing threads New operating mode/hardware interface for

computing


G80 CUDA Mode – A Review

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Threads are assigned to streaming multiprocessors (SMs) in block granularity Up to 8 blocks to each SM as resource allows Each SM in G80 can take up to 768 threads

Could be 256 (threads/block) × 3 blocks Or 128 (threads/block) × 6 blocks, etc.

Threads run concurrently Each SM maintains thread/block id numbers Each SM manages/schedules thread execution


G80 Example: Executing Thread Blocks


G80 Example: Executing Thread Blocks

t0 t1 t2 … tm

Blocks

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

t0 t1 t2 … tm

Blocks

SM 1SM 0

Flexible resource allocation

Each block is executed as 32-thread warps An implementation decision, not part of the CUDA

programming model Warps are scheduling units in an SM

If 3 blocks are assigned to an SM and each block has 256 threads, how many warps are there in an SM? Each block is divided into 256/32 = 8 warps There are 8 × 3 = 24 warps


G80 Example: Thread Scheduling



…t0 t1 t2 …

t31

…

…t0 t1 t2 …

t31

…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1Streaming Multiprocessor

Shared Memory

…t0 t1 t2 …

t31

…Block 1 Warps

Each SM implements zero-overhead warp scheduling At any time, only one of the warps is executed by an SM Warps whose next instruction has its operands ready for

consumption are eligible for execution Eligible warps are selected for execution on a prioritized

scheduling policy All threads in a warp execute the same instruction when

selected



TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

For matrix multiplication using multiple blocks, should I use 8 × 8, 16 × 16 or 32 × 32 blocks?

For 8 × 8, we have 64 threads per Block. Since each SM can take up to 768 threads, there are 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!

For 16 × 16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.

For 32 × 32, we have 1024 threads per Block. Not even one can fit into an SM!


G80 Block Granularity Considerations

The API is an extension to the C programming language

It consists of: Language extensions

To target portions of the code for execution on the device

A runtime library split into: A common component providing built-in vector

types and a subset of the C runtime library in both host and device codes

A host component to control and access one or more devices from the host

A device component providing device-specific functions CUDA Tools and Threads – Slide 77

Application Programming Interface

dim3 gridDim; Dimensions of the grid in blocks (gridDim.z

unused) dim3 blockDim;

Dimensions of the block in threads dim3 blockIdx;

Block index within the grid dim3 threadIdx;

Thread index within the block


Language Extensions: Built-in Variables

pow, sqrt, cbrt, hypot exp, exp2, expm1 log, log2, log10, log1p sin, cos, tan, asin, acos, atan, atan2 sinh, cosh, tanh, asinh, acosh, atanh ceil, floor, trunc, round Etc.

When executed on the host, a given function uses the C runtime implementation if available

These functions are only supported for scalar types, not vector types


Common Runtime Component: Mathematical Functions

Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x)) __pow __log, __log2, __log10 __exp __sin, __cos, __tan


Common Runtime Component: Mathematical Functions

Provides functions to deal with: Device management (including multi-device

systems) Memory management Error handling

Initializes the first time a runtime function is called

A host thread can invoke device code on only one device Multiple host threads required to run on multiple

devicesCUDA Tools and Threads – Slide 81

Host Runtime Component

void __syncthreads(); Synchronizes all threads in a block Once all threads have reached this point,

execution resumes normally Used to avoid RAW / WAR / WAW hazards

when accessing shared or global memory Allowed in conditional constructs only if the

conditional is uniform across the entire thread block


Device Runtime Component:Synchronization Function

memory allocationcudaMalloc((void **)&xd, nbytes);

data copyingcudaMemcpy(xh, xd, nbytes, cudaMemcpyDeviceToHost);

reminder: d (h) to distinguish an array on the device (host) is not mandatory, just helpful labeling

kernel routine is declared by __global__ prefix, and is written from point of view of a single thread


Final Thoughts

Reading: Chapters 3 and 4, “Programming Massively Parallel Processors” by Kirk and Hwu.

Based on original material fromThe University of Illinois at Urbana-Champaign

David Kirk, Wen-mei W. HwuOxford University: Mike GilesStanford University

Jared Hoberock, David TarjanRevision history: last updated 6/22/2011.


End Credits

CUDA Lecture 4 CUDA Programming Basics

Documents

Transcript of CUDA Lecture 4 CUDA Programming Basics