Programming with CUDA · Programming with CUDA ... CUDA C programming guide – CUDA Programming 4 …
CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution...
-
Upload
margery-chambers -
Category
Documents
-
view
235 -
download
0
Transcript of CUDA - 101 Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution...
CUDA - 101
Basics
Overview
• What is CUDA?• Data Parallelism• Host-Device model• Thread execution• Matrix-multiplication
GPU revised!
What is CUDA?
• Compute Device Unified Architecture• Programming interface to GPU• Supports C/C++ and Fortran natively– Third party wrappers for Python, Java, MATLAB etc
• Various libraries available– cuBLAS, cuFFT and many more…– https://developer.nvidia.com/gpu-accelerated-libr
aries
CUDA computing stack
CUDA computing stack
CUDA computing stack
CUDA computing stack
Data Parallel programming
i1
Kernel
i2 i3 … iN
o1 o2 o3 … oN
Data parallel algorithm
• Dot product : C = A . BA1 B1 …
C1 C2 C3 … CN
A2 B2 A3 B3 AN BN
+ + + + +Kernel
Host-Device model
CPU (Host) GPU (Device)
Threads
• A thread is an instance of the kernel program– Independent in a data
parallel model– Can be executed on a
different core• Host tells the device to
run a kernel program– And how many threads
to launch
Matrix-Multiplication
CPU-only MatrixMultiplication
Execute this code
For all elements of P
Memory Indexing in C (and CUDA)
M(i, j) = M[i + j * width]
CUDA version - I
CUDA program flow
• Allocate input and output memory on host– Do the same for device
• Transfer input data from host -> device• Launch kernel on device• Transfer output data from device -> host
Allocating Device memory
• Host tells the device when to allocate and free memory in device
• Functions for host-program– cudaMalloc(memory reference, size)– cudaFree(memory reference)
Transfer Data to/from device
• Again, host tells device when to transfer data• cudaMemcpy(target, source, size, flag)
CUDA version - 2Host Memory
Device Memory
Allocate matrix M on deviceTransfer M from host -> Device
Allocate matrix N on deviceTransfer N from host -> Device
Allocate matrix P on device
Execute Kernel on Device
Transfer P from Device-> Host
Free Device memories for M, N and P
Matrix Multiplication Kernel
• Kernel specifies the function to be executed on Device
Parameters = Device memories, width
Thread = Each element of output matrix P
Dot product of M’s row and N’s column
Write dot product at current location
Extensions : Function qualifiers
Extensions : Thread indexing
• All threads execute the same code– But they need work on separate memory data
• threadId.x & threadId.y– These variables automatically receive
corresponding values for their threads
Thread Grid
• Represents group of all threads to be executed for a particular kernel
• Two level hierarchy– Grid is composed of Blocks– Each Block is composed of threads
Thread Grid
0, 0 1, 0 2, 0 width-1, 0
0, 1 width–1, 1
0, 2
0, width-1 width – 1, width - 1
Conclusion
• Sample code and tutorials• CUDA nodes?• Programming guide – http://docs.nvidia.com/cuda/cuda-c-programming
-guide/
• SDK– https://developer.nvidia.com/cuda-downloads– Available for windows, Mac and Linux– Lot of sample programs
QUESTIONS?