GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved...
Transcript of GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved...
![Page 1: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/1.jpg)
University of Maryland
GPU WORKSHOP
![Page 2: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/2.jpg)
2
1 Intro to GPU Computing
2 OpenACC with hands-on
3 CUDA C/C++ with hands-on
4
5
AGENDA
![Page 3: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/3.jpg)
3
Parallel programming—Why do you care?
![Page 4: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/4.jpg)
The world IS parallel
![Page 5: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/5.jpg)
5
Accelerator Programming—Why do you Care?
![Page 6: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/6.jpg)
Power of 300 Petaflop
CPU-only Supercomputer =Power for the city
of San Francisco
HPC’s Biggest Challenge: Power
![Page 7: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/7.jpg)
7
UNPRECEDENTED VALUE TO SCIENTIFIC COMPUTING
1 Tesla K40 GPU102 ns/day
64 Sandy Bridge CPUs 58 ns/day
AMBER Molecular Dynamics Simulation
DHFR NVE Benchmark
![Page 8: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/8.jpg)
8
3 WAYS TO ACCELERATE APPLICATIONS
Applications
Libraries
“Drop-in”
Acceleration
Programming
LanguagesOpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
![Page 9: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/9.jpg)
9
Linear AlgebraFFT, BLAS,
SPARSE, Matrix
Numerical & MathRAND, Statistics
Data Struct. & AISort, Scan, Zero Sum
Visual ProcessingImage & Video
NVIDIA
cuFFT,
cuBLAS,
cuSPARSE
NVIDIA
Math Lib NVIDIA cuRAND
NVIDIA
NPP
NVIDIA
Video
Encode
GPU AI –
Board
Games
GPU AI –
Path Finding
GPU ACCELERATOED LIBRARIES“Drop-on” Acceleration for your Applications
![Page 10: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/10.jpg)
10
3 WAYS TO ACCELERATE APPLICATIONS
Applications
Libraries
“Drop-in”
Acceleration
Programming
LanguagesOpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
![Page 11: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/11.jpg)
11
OPENACC DIRECTIVES
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
...
End Program myscience
CPU GPU
Your original
Fortran or C code
Simple Compiler hints
Compiler Parallelizes code
Works on many-core GPUs &
multicore CPUs
OpenACC
Compiler
Hint
![Page 12: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/12.jpg)
12
FAMILIAR TO OPENMP PROGRAMMERS
main() {
double pi = 0.0; long i;
#pragma omp parallel for reduction(+:pi)
for (i=0; i<N; i++)
{
double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t);
}
printf(“pi = %f\n”, pi/N);
}
CPU
OpenMP
main() {
double pi = 0.0; long i;
#pragma acc kernels
for (i=0; i<N; i++)
{
double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t);
}
printf(“pi = %f\n”, pi/N);
}
CPU GPU
OpenACC
![Page 13: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/13.jpg)
13
DIRECTIVES: EASY & POWERFUL
Real-Time Object Detection
Global Manufacturer of Navigation Systems
Valuation of Stock Portfolios using Monte Carlo
Global Technology Consulting Company
Interaction of Solvents and Biomolecules
University of Texas at San Antonio
Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications.”
-- Developer at the Global Manufacturer of Navigation Systems
“5x in 40 Hours 2x in 4 Hours 5x in 8 Hours
![Page 14: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/14.jpg)
14
subroutine saxpy(n, a, x, y)real :: x(:), y(:), ainteger :: n, i
$!acc kernelsdo i=1,n
y(i) = a*x(i)+y(i)enddo
$!acc end kernelsend subroutine saxpy
...$ Perform SAXPY on 1M elementscall saxpy(2**20, 2.0, x_d, y_d)...
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc kernels
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
...
// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
...
A VERY SIMPLE EXERCISE: SAXPY
SAXPY in C SAXPY in Fortran
![Page 15: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/15.jpg)
15
GPU Architecture
![Page 16: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/16.jpg)
16
GPU ARCHITECTURE
Global memory
Analogous to RAM in a CPU server
Accessible by both GPU and CPU
Currently up to 12 GB
ECC on/off options for Quadro and Tesla products
Streaming Multiprocessors (SM)
Perform the actual computation
Each SM has its own: Control units, registers, execution pipelines, caches
Two Main Components
![Page 17: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/17.jpg)
17
GPU ARCHITECTURE
Many CUDA Cores per SM
Architecture dependent
Special-function units
cos/sin/tan, etc.
Shared mem + L1 cache
Thousands of 32-bit registers
Streaming Multiprocessor (SM) Register File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16
Special Func Units x 4
Interconnect Network
64K Configurable
Cache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
![Page 18: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/18.jpg)
18
GPU ARCHITECTURE
Floating point & Integer unit
IEEE 754-2008 floating-point standard
Fused multiply-add (FMA) instruction for both single and double precision
Logic unit
Move, compare unit
Branch unit
CUDA Core Register File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16
Special Func Units x 4
Interconnect Network
64K Configurable
Cache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
CUDA CoreDispatch Port
Operand Collector
Result Queue
FP Unit INT Unit
![Page 19: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/19.jpg)
19
GPU MEMORY HIERARCHY REVIEW
L2
Global Memory
Registers
L1
SM-N
SMEM
Registers
L1
SM-0
SMEM
Registers
L1
SM-1
SMEM
![Page 20: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/20.jpg)
20
GPU ARCHITECTURE
Extremely fast, but small, i.e., 10s of Kb
Programmer chooses whether to use cache as L1 or Shared Mem
L1
Hardware-managed—used for things like register spilling
Should NOT attempt to utilize like CPU caches
Shared Memory—programmer MUST synchronize data accesses!!!
User-managed scratch pad
Repeated access to same data or multiple threads with same data
Memory System on each SM
![Page 21: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/21.jpg)
21
GPU ARCHITECTURE
Unified L2 cache (100s of Kb)
Fast, coherent data sharing across all cores in the GPU
ECC protection
DRAM
ECC supported for GDDR5 memory
All major internal memories are ECC protected
Register file, L1 cache, L2 cache
Memory system on each GPU board
![Page 22: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/22.jpg)
22
CUDA Programming model
![Page 23: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/23.jpg)
23
ANATOMY OF A CUDA C/C++ APPLICATIONSerial code executes in a Host (CPU) thread
Parallel code executes in many Device (GPU) threadsacross multiple processing elements
CUDA C/C++ Application
Serial code
Serial code
Parallel code
Parallel code
Device = GPU
…
Host = CPU
Device = GPU
...
Host = CPU
![Page 24: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/24.jpg)
25
CUDA C : C WITH A FEW KEYWORDS
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Standard C Code
Parallel C Code
![Page 25: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/25.jpg)
26
CUDA KERNELS
Parallel portion of application: execute as a kernel
Entire GPU executes kernel, many threads
CUDA threads:
Lightweight
Fast switching
1000s execute simultaneously
CPU Host Executes functions
GPU Device Executes kernels
![Page 26: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/26.jpg)
27
CUDA KERNELS: PARALLEL THREADS
A kernel is a function executed on the GPU as an array of threads in parallel
All threads execute the same code, can take different paths
Each thread has an ID
Select input/output data
Control decisions
float x = input[threadIdx.x];
float y = func(x);
output[threadIdx.x] = y;
![Page 27: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/27.jpg)
CUDA Kernels: Subdivide into Blocks
![Page 28: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/28.jpg)
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
![Page 29: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/29.jpg)
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
![Page 30: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/30.jpg)
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
![Page 31: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/31.jpg)
CUDA Kernels: Subdivide into Blocks
Threads are grouped into blocks
Blocks are grouped into a grid
A kernel is executed as a grid of blocks of threads
GPU
![Page 32: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/32.jpg)
Kernel Execution
• Each kernel is executed on
one device
• Multiple kernels can execute
on a device at one time
………
CUDA-enabled GPU
CUDA thread • Each thread is executed by a
core
CUDA core
CUDA thread block
• Each block is executed by
one SM and does not migrate
• Several concurrent blocks can
reside on one SM depending
on the blocks’ memory
requirements and the SM’s
memory resources
…
CUDA Streaming
Multiprocessor
CUDA kernel grid
...
![Page 33: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/33.jpg)
Thread blocks allow cooperation
Register File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16
Special Func Units x 4
Interconnect Network
64K Configurable
Cache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
Threads may need to cooperate:
Cooperatively load/store blocks of
memory all will use
Share results with each other or
cooperate to produce a single result
Synchronize with each other
![Page 34: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/34.jpg)
35
THREAD BLOCKS ALLOW SCALABILITY
Blocks can execute in any order, concurrently or sequentially
This independence between blocks gives scalability:
A kernel scales across any number of SMs
Device with 2 SMs
SM 0 SM 1
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Kernel Grid
Launch
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Device with 4 SMs
SM 0 SM 1 SM 2 SM 3
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
![Page 35: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/35.jpg)
36
Memory System Hierarchy
![Page 36: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/36.jpg)
37
MEMORY HIERARCHY
Thread:
Registers
![Page 37: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/37.jpg)
38
MEMORY HIERARCHY
Thread:
Registers
Local memory
Local Local Local Local Local Local Local
![Page 38: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/38.jpg)
39
MEMORY HIERARCHY
Thread:
Registers
Local memory
Block of threads:
Shared memory
![Page 39: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/39.jpg)
40
MEMORY HIERARCHY : SHARED MEMORY
__shared__ int a[SIZE];
Allocated per thread block, same lifetime as the block
Accessible by any thread in the block
Several uses:
Sharing data among threads in a block
User-managed cache (reducing gmemaccesses)
![Page 40: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/40.jpg)
41
MEMORY HIERARCHY
Thread:
Registers
Local memory
Block of threads:
Shared memory
All blocks:
Global memory
![Page 41: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/41.jpg)
42
MEMORY HIERARCHY : GLOBAL MEMORY
Accessible by all threads of any kernel
Data lifetime: from allocation to deallocation by host code
cudaMalloc (void ** pointer, size_t nbytes)
cudaMemset (void * pointer, int value, size_tcount)
cudaFree (void* pointer)
![Page 42: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/42.jpg)
43
CUDA memory management
![Page 43: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/43.jpg)
44
MEMORY SPACES
CPU and GPU have separate memory spaces
Data is moved across PCIe bus
Use functions to allocate/set/copy memory on GPU just like standard C
Pointers are just addresses
Can’t tell from the pointer value whether the address is on CPU or GPU
Must use cudaPointerGetAttributes(…)
Must exercise care when dereferencing:
Dereferencing CPU pointer on GPU will likely crash
Dereferencing GPU pointer on CPU will likely crash
![Page 44: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/44.jpg)
45
GPU MEMORY ALLOCATION / RELEASE
Host (CPU) manages device (GPU) memory
cudaMalloc (void ** pointer, size_t nbytes)
cudaMemset (void * pointer, int value, size_t count)
cudaFree (void* pointer)
int n = 1024;
int nbytes = 1024*sizeof(int);
int * d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );
cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
Note: Device memory from
GPU point of view
is also referred to as
global memory.
![Page 45: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/45.jpg)
46
DATA COPIES
cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);
returns after the copy is complete
blocks CPU thread until all bytes have been copied
doesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
Non-blocking memcopies are provided
![Page 46: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/46.jpg)
47
Basic kernels and execution
![Page 47: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/47.jpg)
48
CUDA PROGRAMMING MODEL REVISITED
Parallel code (kernel) is launched and executed on a device by many threads
Threads are grouped into thread blocks
Parallel code is written for a thread
Each thread is free to execute a unique code path
Built-in thread and block ID variables
![Page 48: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/48.jpg)
49
THREAD HIERARCHY
Threads launched for a parallel section are partitioned into thread blocks
Grid = all blocks for a given launch
Thread block is a group of threads that can:
Synchronize their execution
Communicate via shared memory
![Page 49: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/49.jpg)
50
IDS AND DIMENSIONS
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
(Continued)
Threads
3D IDs, unique within a block
Blocks
2D IDs, unique within a grid
Dimensions set at launch time
Can be unique for each grid
Built-in variables
threadIdx, blockIdx
blockDim, gridDim
![Page 50: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/50.jpg)
51
IDS AND DIMENSIONS
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
Threads
3D IDs, unique within a block
Blocks
2D IDs, unique within a grid
Dimensions set at launch time
Can be unique for each grid
Built-in variables
threadIdx, blockIdx
blockDim, gridDim
![Page 51: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/51.jpg)
52
LAUNCHING KERNELS ON GPU
Launch parameters (triple chevron <<<>>> notation)
grid dimensions (up to 2D), dim3 type
thread-block dimensions (up to 3D), dim3 type
shared memory: number of bytes per block
for extern smem variables declared without size
Optional, 0 by default
stream ID
Optional, 0 by default
dim3 grid(16, 16);
dim3 block(16,16);
kernel<<<grid, block, 0, 0>>>(...);
kernel<<<32, 512>>>(...);
![Page 52: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/52.jpg)
53
GPU KERNEL EXECUTION
Kernel launches on a grid of blocks, <<<grid,block>>>(arg1,…)
Each block is launched on one SM
A block is divided into warps of 32 threads each (think 32-way vector)
Warps in a block are scheduled and executed.
All threads in a warp execute same instruction simultaneously (think SIMD)
Number of blocks/SM determined by resources required by the block
Registers, shared memory, total warps, etc.
Block runs to completion on SM it started on, no migration.
![Page 53: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/53.jpg)
54
Thread
Block Multiprocessor
32 Threads
32 Threads
32 Threads
...
Warps
A thread block consists of
32-thread warps
A warp is executed
physically in parallel
(SIMD) on a multiprocessor
=
WARPS (THE REST OF THE STORY…)
![Page 54: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/54.jpg)
55
Software Hardware
Threads are executed by scalar processors
Thread
Scalar
Processor
Thread
Block Multiprocessor
Thread blocks are executed on multiprocessors
Thread blocks do not migrate
Several concurrent thread blocks can reside on one
multiprocessor - limited by multiprocessor
resources (shared memory and register file)
...
Grid Device
A kernel is launched as a grid of thread blocks
EXECUTION MODEL
![Page 55: GPU WORKSHOP - University Of Maryland · CPU and GPU have separate memory spaces Data is moved across PCIe bus Use functions to allocate/set/copy memory on GPU just like standard](https://reader033.fdocuments.net/reader033/viewer/2022053009/5f0c56787e708231d434e73e/html5/thumbnails/55.jpg)
56
BLOCKS MUST BE INDEPENDENT
Any possible interleaving of blocks should be valid
presumed to run to completion without pre-emption
can run in any order
can run concurrently OR sequentially
Blocks may coordinate but not synchronize
shared queue pointer: OK
shared lock: BAD … any dependence on order easily deadlocks
Independence requirement gives scalability