Tech Talk NVIDIA CUDA

63
CUDA > J. Rühmkorf > July 22nd 2009 Slide 1 NVIDIA CUDA The Compute Unified Device Architecture Jens Rühmkorf Tech Talk, DLR Köln-Porz, July 22nd 2009

Transcript of Tech Talk NVIDIA CUDA

Page 1: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 1

NVIDIA CUDA The Compute Unified Device Architecture

Jens Rühmkorf Tech Talk, DLR Köln-Porz, July 22nd 2009

Page 2: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 2

References

University of Illinois at Urbana-Champaign, Wen-Mei Hwu & David Kirk, course ECE 498 AL, Spring 2009: http://courses.ece.illinois.edu/ece498/al/

Website about General-Purpose Computation on Graphics Hardware: http://gpgpu.org/developer/cudaACM Queue, Vol. 6 No. 2, March/April 2008 (Issue on GPGPU): http://mags.acm.org/queue/20080304/Dr. Dobb‘s: CUDA, Supercomputing for the Masses, Part 1-13: http://www.ddj.com/architect/207200659NVIDIA CUDA Best Practices Guide http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/NVI DIA_CUDA_BestPracticesGuide_2.3.pdfHubert Nguyen (ed.), GPU Gems 3, Addison-Wesley, 2007, online: http://developer.nvidia.com/object/gpu-gems-3.html

Page 3: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 3

Multi- and Manycore Architectures A Difficult Road Lies Ahead

Don Knuth on Multicore Architectures“[…] my personal unhappiness with the current trend toward multicore architecture. To me, it looks more or less like the hardware designers have run out of ideas, and that they’re trying to pass the blame for the future demise of Moore’s Law to the software writers by giving us machines that work faster only on a few key benchmarks! I won’t be surprised at all if the whole multithreading idea turns out to be a flop”In: InformIT, April 25th 2008 http://www.informit.com/articles/article.aspx?p=1193856

Page 4: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 4

Overview

A high level view on CUDA

CUDA programming model

CUDA memory model

CUDA application programming interface

Simple CUDA example

Page 5: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 5

Multicore: yoke of oxenEach core optimized for executing a single thread

Manycore: flock of chickensCores optimized for aggregate throughput, deemphasizing individual performance

Multicore Manycore

Multicore and Manycore (1) Structural Differences

Page 6: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 6

Multicore and Manycore (2) Technical Characteristics

Specifica- tions Core i7 960 GTX285

Processing Elements

4 cores, 4 way SIMD

@3.2 GHz

30 cores, 8 way SIMD

@1.5 GHz

Resident Threads (max)

4 cores, 2 threads, 4 width SIMD:

32 strands

30 cores, 32 SIMD vectors, 32 width

SIMD: 30720 strands

SP GFLOP/s 102 1080

Memory Bandwidth 25.6 GB/s 159 GB/s

Register File - 1.875 MB

Local Store - 480 kB

Core i7

GTX285

Page 7: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 7

Multicore and Manycore (3) Performance Comparison: CPU vs. GPU

CPU vs. GPUy-axis: floating point operations per sec., single precision

Page 8: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 8

CPU(host)GPU w/

local DRAM(device)

An Example of the Physical Reality Behind CUDA

Page 9: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 9

CUDA Processing Flow

Page 10: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 10

CUDA in a Nutshell Key Characteristics

CUDA is designed for wide SIMD/SPMD parallelism & scalabilityCUDA provides 3 key abstractions, i.e. a hierarchy:

of thread groups,of shared memories, andof barrier synchronization

CUDA programs are written in C + extensionsOpenCL is inspired by CUDA, but HW & SW vendor neutral

Programming model essentially identical

Page 11: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 11

Hello World

// Compute vector sum C = A+B

// Each thread performs one pair-wise addition

__global__ void vecAdd(float* A, float* B, float* C) {

int i = threadIdx.x + blockDim.x * blockIdx.x;

C[i] = A[i] + B[i]

}

int main() {

// Run N/256 blocks of 256 threads each

vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);

}

hello-world.cu

Page 12: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 12

Overview

A high level view on CUDA

CUDA programming model

CUDA memory model

CUDA application programming interface

Simple CUDA example

Page 13: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 13

. . .Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Serial Code (host)

. . .Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

Integrated host + device application C programSerial or modestly parallel parts in host C codeHighly parallel parts in device SPMD kernel C code

CUDA Programming Model Structure of a CUDA application

Page 14: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 14

CUDA Programming Model CUDA Devices and Threads

A compute deviceIs a coprocessor to the CPU or hostHas its own DRAM (device memory) Runs many threads in parallelIs typically a GPU but can also be another type of parallel processing device

Express data-parallel portions as device kernels (which run on many threads)

Page 15: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 15

CUDA Programming Model Arrays of Parallel Threads

Execute a Kernel by specifying arrays of threadsAll threads run the same code (SPMD) Use thread-ID to compute memory addresses & make control decisions

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Page 16: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 16

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Thread Block 0

……float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block 1

…float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block N - 1

CUDA Programming Model Use Thread Blocks for (Scalable) Cooperation

76543210 76543210 76543210

Divide monolithic thread array into multiple blocksThreads within a block can cooperate via

shared memory, atomic operations, and barrier synchronization

Threads in different blocks cannot cooperate

Page 17: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 17

CUDA Programming Model Organisation of Thread Blocks

Thread Blocks can be one-, two- or three-dimensional arraysThe host issues a sequence of kernel invocations (kernel 1, kernel 2) to the deviceEach kernel is executed as a batch of threadsThis batch is organized as a grid of thread blocks

2-dimensional thread blocks

Page 18: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 18

CUDA Programming Model Block IDs and Thread IDs

Each thread uses IDs to decide what data to work on

Block ID: 1D, 2D, or 3DThread ID: 1D, 2D, or 3D

Simplifies memory addressing when processing multidimensional data

Page 19: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 19

Overview

A high level view on CUDA

CUDA programming model

CUDA memory model

CUDA application programming interface

Simple CUDA example

Page 20: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 20

CUDA Memory Model Overview

Global memoryMain means of communicating R/W Data between host and deviceContents visible to all threadsLong latency access

We will focus on global memory for nowConstant and texture memory will not be covered here

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Constant Memory

Texture Memory

Page 21: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 21

CUDA Memory Model CUDA Device Memory Allocation

Grid

GlobalMemory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

cudaMalloc(): Allocates object in the device global memory

Requires two parametersAddress of a pointer to the allocated objectSize of allocated object

cudaFree()

Frees object from the device global memoryPointer to freed object

Page 22: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 22

CUDA Memory Model CUDA Host-Device Data Transfer

cudaMemcpy()

memory data transferRequires four parameters

Pointer to destination Pointer to sourceNumber of bytes to copyType of transfer

Type of transfer is one of:Host to HostHost to DeviceDevice to HostDevice to Device

Asynchronous transfer

Grid

GlobalMemory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Page 23: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 23

Overview

A high level view on CUDA

CUDA programming model

CUDA memory model

CUDA application programming interface

Simple CUDA example

Page 24: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 24

gcc / cl

G80 SASSfoo.sass

OCG

CUDA API Extended C

cudaccEDG C/C++ frontend

Open64 Global Optimizer

GPU Assemblyfoo.s

CPU Host Code foo.cpp

Integrated source(foo.cu)

Mark Murphy, “NVIDIA’s Experience with with Open64,”www.capsl.udel.edu/conferences/open64/2008 008/Papers/101.doc

Page 25: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 25

CUDA API C for CUDA

Function type specifiers__global__, __device__, __host__

Variable type specifiers__device__, __shared__, __constant__

KeywordsthreadIdx, blockIdx

Intrinsics / builtin functions:__syncthreads()

Runtime APIMemory, symbol, execution management

Function launch

__device__ float filter[N];

__global__ void convolve(float*image) {

__shared__ float region[M];...region[threadIdx] = image[i]; __syncthreads()...

image[j] = result;}

// Allocate GPU memoryvoid *myimage = cudaMalloc(bytes)

// 100 blocks, 10 threads per blockconvolve<<<100, 10>>>(myimage);

image-convolution.cu

Page 26: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 26

__global__:defines a kernel functionmust return void

__device__ and __host__ can be used together

executed on: only callable from:

__device__ float deviceFunc() device device

__global__ void kernelFunc() device host

__host__ float hostFunc() host host

CUDA API CUDA Function Type Qualifiers (1)

Page 27: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 27

CUDA API CUDA Function Type Qualifiers (2)

__device__ functions cannot have their address takenFor functions executed on the device:

no recursionno static variable declarations inside the functionno variable number of arguments

Page 28: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 28

__device__Resides in global memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library.

__shared__ (optionally used together with __device__) Resides in the shared memory space of a thread block, Has the lifetime of the block, Is only accessible from all the threads within the block.

Not covered here:__constant__ (optionally used together with __device__)

Resides in constant memory space, Has the lifetime of an application, Is accessible from all the threads within the grid and from the host through the runtime library.

CUDA API CUDA Variable Type Qualifiers

Page 29: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 29

A kernel function ( == __global__ function) must be called with an execution configuration:

__global__ void kernelFunc(...) {...};

dim3 dimGrid(100, 50); // 5000 thread blocks

dim3 dimBlock(4, 8, 8); // 256 threads per block

size_t sharedMemBytes = 64; // 64 bytes of shared memory

kernelFunc<<< dimGrid, dimBlock, sharedMemBytes >>>(...);

Any call to a kernel function is asynchronous from CUDA 1.0 onExplicit synchronization needed for blocking

CUDA API Calling a Kernel Function – Execution Configuration

Page 30: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 30

Overview

A high level view on CUDA

CUDA programming model

CUDA memory model

CUDA application programming interface

Simple CUDA example

Page 31: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 31

A simple matrix multiplication example that illustrates the basic features of memory and thread management in CUDA programs

Leave shared memory usage until laterLocal, register usageThread ID usageMemory data transfer API between host and device

Assume square matrix for simplicity

A Simple CUDA Example Matrix Multiplication

Page 32: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 32

Simple CUDA Example Square Matrix Multiplication

P = M * N of size WIDTH x WIDTH

Here: without tiling!One thread calculates one element of PM and N are loaded WIDTH times from global memory

M

N

P

WID

TH

WID

TH

WIDTH WIDTH

Page 33: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 33

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3

M1,3M0,3 M2,3 M3,3

M

Memory Layout of a Matrix in C

Page 34: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 34

M

N

P

WID

THW

IDTH

WIDTH WIDTH

i

k

k

j

Step 1: Matrix Multiplication A Simple Host Version in C

// Matrix multiplication on the (CPU) // host in double precisionvoid matrixMulOnHost(float* M, float* N,

float* P, int width) { for (int i = 0; i < width; ++i)

for (int j = 0; j < width; ++j) {double sum = 0;for (int k = 0; k < width; ++k) {

double a = M[i * width + k];double b = N[k * width + j];sum += a * b;

}P[i * width + j] = sum;

}}

Page 35: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 35

void matrixMulOnDevice(float* M, float* N, float* P, int width) {

int size = width * width * sizeof(float);

float* Md, *Nd, *Pd;

1. // Allocate and load M, N to device memory

cudaMalloc(&Md, size);

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMalloc(&Nd, size);

cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the device

cudaMalloc(&Pd, size);

Step 2: Input Matrix Data Transfer (Host-sided Code)

Page 36: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 36

Step 3: Output Matrix Data Transfer (Host-sided Code)

2. // Kernel invocation code – to be shown later

3. // Read P from the device

cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matrices

cudaFree(Md);

cudaFree(Nd);

cudaFree(Pd);

}

Page 37: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 37

// Matrix multiplication kernel – per thread code

__global__ void matrixMulKernel(float* Md, float* Nd, float* Pd, int width) {

// Pvalue is used to store the element of the matrix

// that is computed by the thread

float Pvalue = 0;

// see next page…

Step 4: Kernel Function (1)

Page 38: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 38

Nd

Md Pd

WID

THW

IDTH

WIDTH WIDTH

ty

tx

ty

tx

k

kfor (int k = 0; k < width; ++k) {

float Melement = Md[k + threadIdx.y*width];

float Nelement = Nd[threadIdx.x + k*width];

Pvalue += Melement * Nelement;

}

{

int i = threadIdx.x + threadIdx.y*width;

Pd[i] = Pvalue;

}

}

Step 4: Kernel Function (2)

Page 39: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 39

Step 5: Kernel Invocation (Host-sided Code)

// Insert into step 2. from before

// Setup the execution configuration

dim3 dimGrid(1, 1);

dim3 dimBlock(width, width);

// Launch the device computation threads!

matrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, width);

Page 40: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 40

Grid 1Block 1

3 2 5 4

2

4

2

6

48

Thread(2, 2)

WIDTH

Md Pd

Nd

Example Far from Ideal Only One Thread Block Used

One Block of threads compute matrix Pd

Each thread computes one element of Pd

Each threadLoads a row of matrix MdLoads a column of matrix NdPerform one multiply and addition for each pair of Md and Nd elementsCompute to off-chip memory access ratio close to 1:1 (not very high)

Size of matrix limited by the number of threads allowed in a thread block

Page 41: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 41

CUDA: A Bright Future?

Eure Rede aber sei: Ja, ja; nein, nein. Was darüber ist, das ist vom Übel.

Matthäus 5:37

Page 42: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 42

NVIDIA CUDA: Appendix Best Practices & Things to Watch Out For

Page 43: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 43

Appendix Best Practices & Things to Watch Out For

Obtain relevant hardware dataCompiling a CUDA programLinkingDebuggingC for CUDA vs. CUDA Driver APIWatch out: floating point computationsUnsupported C language elementsBranching of codeCoalesced access to device global memoryAccess patterns to avoid bank conflicts

Page 44: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 44

Obtain Relevant Hardware Data

Make sure to obtain relevant additional hardware dataCall cudaGetDeviceProperties()

Page 45: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 45

PTX Code

C/C++ CUDAApplication

G80 … GPU

Target code

Virtual

Physical

CPU Code

float4 me = gx[gtid];me.x += me.y * me.z;

ld.global.v4.f32 {$f1,$f3,$f5,$f7}, [$r9+0];mad.f32 $f1, $f5, $f3, $f1;

Compiling a CUDA Program (1)Parallel Thread eXecution (PTX)

Virtual Machine and ISA (Instruction Set Architecture)Programming modelExecution resources and state

PTX to TargetCompiler

NVCC

Page 46: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 46

Compiling a CUDA Program (2) NVCC as a Compiler Driver

Any source file containing CUDA language extensions must be compiled with NVCCNVCC is a compiler driver

Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ...

NVCC outputs:C code (host CPU Code)

Must then be compiled with the rest of the application using another tool

PTXObject code directlyOr, PTX source, interpreted at runtime

Page 47: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 47

Any executable with CUDA code requires two dynamic libraries:The CUDA runtime library (cudart) The CUDA core library (cuda)

Linking

Page 48: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 48

Debugging Using the Device Emulation Mode

An executable compiled in device emulation mode (enabled via nvcc -deviceemu) runs completely on the host using the CUDA runtime

No need of any device and CUDA driverEach device thread is emulated with a host thread

Running in device emulation mode, one can:Use host native debug support (breakpoints, inspection, etc.) Access any device-specific data from host code and vice-versaCall any host function from device code (e.g. printf) and vice- versaDetect deadlock situations caused by improper usage of __syncthreads()

Page 49: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 49

Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results.Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode

Device Emulation Mode Pitfalls

Page 50: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 50

CUDA Driver API vs. C for CUDA (1) Extended C

gcc / cl

G80 SASSfoo.sass

OCG

cudaccEDG C/C++ frontend

Open64 Global Optimizer

GPU Assemblyfoo.s

CPU Host Code foo.cpp

Integrated source(foo.cu)

Mark Murphy, “NVIDIA’s Ex perience Experience with Open64,”www.capsl.udel.edu/conferences/open6 n64/2008/Papers/101.doc

Page 51: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 51

CUDA Driver API vs. C for CUDA (2) Mutually Exclusive: Choose One or the Other

The C runtime for CUDA handles kernel loading and setting kernels before they are launched. The implicit code initialization, CUDA context management, CUDA module management (cubin and function mapping), kernel configuration, and parameter passing are all performed by the C runtime for CUDA.It comprises two principal parts:

The low-level functions (cuda_runtime_api.h) have a C-style interface that does not require compilation with nvcc.The high-level functions (cuda_runtime.h) have a C++-style interface built on top of the low-level functions.

Of these, the high-level functions are the most commonly used. They wrap some of the low-level functions, using overloading, references, and default arguments. These wrappers can be used from C++ code and can be compiled with any C++ compiler.

The driver API is a lower-level API than the runtime API. When compared with the runtime API, the driver API has these advantages:

No dependency on the runtime libraryMore control over devices (for example, only the driver API enables one CPU thread to control multiple GPUs) No C extensions in the host code, so compilers other than the default CPU compiler can be used

Its primary disadvantagesVerbose code Greater difficulty in debugging No device emulation

A key point is that for every runtime API function, there is an equivalent driver API function. The driver API, however, includes other functions missing in the runtime API, such as those for migrating a context from one host thread to another.

Page 52: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 52

CUDA Driver API vs. C for CUDA (3) Example: Vector Addition Using C for CUDAconst unsigned int cnBlockSize = 512;const unsigned int cnBlocks = 3;const unsigned int cnDimension =

cnBlocks * cnBlockSize;// create CUDA device & contextcudaSetDevice( 0 ); // pick first device// allocate host vectorsfloat * pA = new float[cnDimension];float * pB = new float[cnDimension];float * pC = new float[cnDimension];// initialize host memoryrandomInit(pA, cnDimension);randomInit(pB, cnDimension);

// allocate device memoryfloat *pDeviceMemA, *pDeviceMemB,

*pDeviceMemC;cudaMalloc((void **)&pDeviceMemA,

cnDimension * sizeof(float));cudaMalloc((void **)&pDeviceMemB,

cnDimension * sizeof(float));cudaMalloc((void **)&pDeviceMemC,

cnDimension * sizeof(float));

// copy host vectors to devicecudaMemcpy(pDeviceMemA, pA, cnDimension *

sizeof(float),cudaMemcpyHostToDevice);cudaMemcpy(pDeviceMemB, pB, cnDimension *

sizeof(float), cudaMemcpyHostToDevice);

vectorAdd<<<cnBlocks, cnBlockSize>>> (pDeviceMemA, pDeviceMemB,pDeviceMemC);

// copy result from device to hostcudaMemcpy ((void *) pC, pDeviceMemC,

cnDimension * sizeof(float),cudaMemcpyDeviceToHost);

delete[] pA;delete[] pB;delete[] pC;cudaFree(pDeviceMemA);cudaFree(pDeviceMemB);cudaFree(pDeviceMemC);

Page 53: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 53

CUDA Driver API vs. C for CUDA (4) Example: Vector Addition Using CUDA Driver APIconst unsigned int cnBlocks = 3;const unsigned int cnDimension = cnBlocks * cnBlockSize;CUdevice hDevice;CUcontext hContext;CUmodule hModule;CUfunction hFunction;// create CUDA device & contextcuInit(0);cuDeviceGet(&hContext, 0); // pick first devicecuCtxCreate(&hContext, 0, hDevice));cuModuleLoad(&hModule, “vectorAdd.cubin”);cuModuleGetFunction(&hFunction, hModule, "vectorAdd");// allocate host vectorsfloat * pA = new float[cnDimension];float * pB = new float[cnDimension];float * pC = new float[cnDimension];// initialize host memoryrandomInit(pA, cnDimension);randomInit(pB, cnDimension);// allocate memory on the deviceCUdeviceptr pDeviceMemA, pDeviceMemB, pDeviceMemC;cuMemAlloc(&pDeviceMemA, cnDimension * sizeof(float));cuMemAlloc(&pDeviceMemB, cnDimension * sizeof(float));cuMemAlloc(&pDeviceMemC, cnDimension * sizeof(float));// copy host vectors to devicecuMemcpyHtoD(pDeviceMemA, pA, cnDimension * sizeof(float));cuMemcpyHtoD(pDeviceMemB, pB, cnDimension * sizeof(float));// set up parameter valuescuFuncSetBlockShape(cuFunction, cnBlockSize, 1, 1);

#define ALIGN_UP(offset, alignment) \(offset) = ((offset) + (alignment) – 1) & ~((alignment) – 1)

int offset = 0;void* ptr;ptr = (void*)(size_t)pDeviceMemA;ALIGN_UP(offset, __alignof(ptr));cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));offset += sizeof(ptr);ptr = (void*)(size_t)pDeviceMemB;ALIGN_UP(offset, __alignof(ptr));cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));offset += sizeof(ptr);ptr = (void*)(size_t)pDeviceMemC;ALIGN_UP(offset, __alignof(ptr));cuParamSetv(cuFunction, offset, &ptr, sizeof(ptr));offset += sizeof(ptr);cuParamSetSize(cuFunction, offset);// execute kernelcuLaunchGrid(cuFunction, cnBlocks, 1);// copy the result from device back to hostcuMemcpyDtoH((void *) pC, pDeviceMemC,cnDimension * sizeof(float));delete[] pA;delete[] pB;delete[] pC;cuMemFree(pDeviceMemA);cuMemFree(pDeviceMemB);cuMemFree(pDeviceMemC);

Page 54: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 54

Watch Out: Floating Point Computations Differing Results of FP Computations

Results of floating-point computations will slightly differ because of:Different compiler outputs, instruction setsUse of extended precision for intermediate results

There are various options to force strict single precision on the host

Page 55: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 55

Watch Out: Floating Point Computations Single and Double Precision Operations

Double precisionNo deviations from the IEEE 754 standard

Single precision Denormals and signalling NaNs are not supported; Only two IEEE rounding modes are supported (chop and round-to- nearest even); and The precision of division/square root is slightly lower than single precision.

Page 56: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 56

Limitations (1) Only a Subset of C Available

C for CUDA offers only a subset of the C language:Recursion-freeFunction-pointer-free

Functions reside in the global device memory, therefore we cannot obtain their addresses

Page 57: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 57

Limitations (2) Branching in Programm Code

For best performanceThreads should be running in groups 32 threads 32 threads = 1 warp

All threads of a warp should take the same execution pathOtherwise, branching will probably hurt

Page 58: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 58

Coalesced Access to Device Global Memory

High Priority: Ensure global memory accesses are coalesced whenever possibleGlobal memory loads and stores by threads of a half warp (16 threads) are coalesced by the device in as few as one transaction (or two transactions in the case of 128-bit words) But: certain access requirements have to be met

Page 59: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 59

Coalesced Access (Reading Floats)

Page 60: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 60

Uncoalesced Access (Reading Floats)

Page 61: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 61

Shared Memory16 KB Organized in 16 Banks, 1 KB each

Shared Memory As fast as a register …… if no bank conflicts occur!

Bank conflict: More than one thread in the same half-warp access the same bank

Access needs to be serialized Cost = max (# of concurrent access)

Shared Memory – Bank Conflicts

Page 62: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 62

Linear addressingStep size = 1 Word

Random Permutation

Linear addressingStep size = 3 Words

Broadcast

Shared Memory – No Bank Conflicts

Page 63: Tech Talk NVIDIA CUDA

CUDA > J. Rühmkorf > July 22nd 2009

Slide 63

Linear adressingStep size = 2 Words

Linear addressingStep size = 8 words

No conflict or 5-way conflict

Shared Memory – Bank Conflicts