Zellescher Weg 16
Trefftz-Bau (HRSK-Anbau) - HRSK 151
Tel. +49 351 - 463 - 39871
Guido Juckeland ([email protected])
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
Einführung in die Programmierung von
Grafikprozessoren
PGI Accelerate, NVIDIA CUDA und OpenCL
Folie 2
Guido Juckeland
Gliederung
1. Motivation, Inhalte und Ziele des Kurses
2. Grundlagen
3. PGI Accelerate – der einfache Einstieg
4. Einführung NVIDIA CUDA
5. CUDA etwas tiefer
6. Einführung OpenCL
7. Wie weiter?
Folie 3
Motivation
Guido Juckeland
• Jeder hat einen Grafikprozessor im Rechner
• 2 TFLOP/s Rechenleistung (einfache Genauigkeit)
• Einfach nutzbare Programmierschnittstellen
• Ähnliche Datentypen im Grafikbereich und im wissenschaftlichen Rechnen
• Nächste Hochleistungsrechner-Beschaffung durch das ZIH
• Möglichkeit die Systemgröße (und auch den Stromverbrauch) kleiner zu halten
Folie 4
Guido Juckeland
Was ist Inhalt des Kurses?
Einführung (!) in die Thematik
Kennenlernen der Umgebungen von PGI, AMD und NVIDIA
Erste eigene Programme testen
Folie 5
Guido Juckeland
Was ist nicht Inhalt des Kurses?
Hardwarenahe Programmierung von Grafikprozessoren
Hybrid-parallele Anwendungen
Folie 6
Guido Juckeland
2. Grundlagen
Folie 7
Grafikkarte
Systemaufbau
Guido Juckeland
CPU
GPU
Haupt-
speicher
Grafik-
speicher
Folie 8
Wichtige Begriffe
• Anwendung: Programm, dass auf dem Hauptprozessor ausgeführt
• Übertragungsbandbreite: Anzahl der Daten, die pro Zeiteinheit zwischen Hauptspeicher und Grafikspeicher übertragen werden können
• Thread: Ein Stück „Arbeit‚, dass von einer Programminstanz bearbeitet wird
• SIMD (Single Instruction Multiple Data): Mit einer Prozessoranweisung mehr als ein Datum berechnen
• FLOP/s: Anzahl der Gleitkommabefehle, die pro Sekunde abgearbeitet werden
• Hauptprozessor: Allzweckprozessor, der jeden Algorithmus, der sich mit einer Turing-Maschine beschreiben lässt, ausführen kann
• Grafikprozessor: Spezialrecheneinheit, um 2- bzw. 3-dimensionale Bilder für die Bildausgabe berechnet
Guido Juckeland
Folie 9
Warum nicht schon früher?
• Man muss jetzt sowieso Multi-threading programmieren, um alle CPU Kerne zu nutzen – warum dann nicht gleich 100 GPU-Kerne nutzen?
• Softwareinfrastruktur vorhanden (gute Entwicklungsumgebung mit C-ähnlichen Sprachen)
• Frei programmierbare GPU Recheneinheiten
• Unterstützung mehrerer Datentypen auf den Grafikprozessoren
• DirectX9 Grafikkarten werden von STREAM/CUDA unterstützt
Guido Juckeland
Folie 10
Vergleich GPU und CPU
CPU
•5% der Fläche für ALU
•Hauptspeicher mit niedriger Zugriffszeit (1/10 GPU)
•Großer Cache (10*GPU)
•Linearer Speicher
GPU
•40% der Fläche für ALU
•Speicher mit hoher Bandbreite (10*CPU)
•Kleiner Cache
•Ausgefeilte Speichertechniken
Guido Juckeland
Folie 11
Guido Juckeland
3. PGI Accelerate – der einfache Einstieg
http://www.pgroup.com/resources/accel.htm
Folie 12
Components
• CUDA Fortran interface
PGI CUDA Fortran
• The OpenMP of CUDA programming
PGI Accelerate
Guido Juckeland
Works only with NVIDIA GPUs!!
Folie 13
How does it work?
/* alloc */ a = (float*)malloc(n*sizeof(float)); r = (float*)malloc(n*sizeof(float));
/* initialize */ for( i = 0; i < n; ++i ) a[i] = (float)(i+1);
/* compute */
for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
Guido Juckeland
Folie 14
How does it work (2)?
/* alloc */ a = (float*)malloc(n*sizeof(float)); r = (float*)malloc(n*sizeof(float));
/* initialize */ for( i = 0; i < n; ++i ) a[i] = (float)(i+1);
/* compute */
#pragma acc region
{
for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
}
Guido Juckeland
Folie 15
Erste Übung
• Einloggen (SSH) mit Username pXX (Passwort: CUDApXX!) auf joker.zih.tu-dresden.de
• Wenn grafischer Editor gebraucht, dann mit X-Forwarding einloggen und „gedit‚ oder „emacs‚ verwenden
• PGI Umgebung laden: module load software module load pgi
• Copy the examples to your home directory and unpack: cp /tmp/pgi_accelerator_examples.tar . tar xf pgi_accelerator_examples.tar
• Build and run the first example: make c1.exe # (f1.exe for Fortan) ./c1.exe
• Are you really using the GPU? export ACC_NOTIFY=1
• (Achtung, es streiten sich hier alle um EINE Grafikkarte)
Guido Juckeland
Folie 16
Cool – that was easy, but how fast was it?
Example 2
• Computes: for( i = 0; i < n; ++i ){ s = sinf(a[i]); c = cosf(a[i]); r[i] = s*s + c*c; }
• Correctness check: for( i = 0; i < n; ++i ) assert( fabsf(r[i] - e[i]) < 0.000001f );
• Watch out! GPUs not (yet) IEEE -754 compliant!
• Hint: When you have a choice, use trigonometric functions on the GPU
instead of square roots.
Guido Juckeland
Folie 17
Let‘s run it
• Compile and run: make c2.exe ./c2.exe 100000 iterations completed 2064439 microseconds on GPU 1692 microseconds on host
• What’s the deal? Compile again and run again: make c2.time ./c2.time /home/h1/juckel/joker/pgi1/c2.c main 32: region entered 1 time time(us): total=2037112 init=2035604 region=1508 kernels=83 data=806 w/o init: total=1508 max=1508 min=1508 avg=1508 34: kernel launched 1 times grid: [391] block: [256] time(us): total=83 max=83 min=83 avg=83
Guido Juckeland
Folie 18
Oh - one more thing (actually two)
• How to fix the initialization issue: Uncomment acc_init( acc_device_nvidia );
• What if I don‘t know if I (will) have a GPU?
• Build universal binaries make c2.uni
• Run on host only export ACC_DEVICE=host
• Now go and program a matrix-matrix multiply.
Guido Juckeland
Folie 19
Tuning accelerated code (1)
• In general use „time‚ option of pgi compiler!
• Initialization
• Always once per accelerated application
• Can only be avoided by running pgcudainit & (CAUTION!)
• Host-Accelerator Traffic
• Eats up a lot of time for typical kernels
• Mark temporary arrays as private #pragma acc region private(a[0:n-1])
• Mark data that is not need back on CPU as local #pragma acc region local(b[0:n-1])
• Use contigous memory and mark it as such #pragma acc region copy(a[0:n-1][0:m-1])
Guido Juckeland
Folie 20
Tuning accelerated code (2)
• Kernel performance
• Stride 1 accesses !! (Superword = 16 words)
• Scheduling • Specify the number of parallel running threads • Usually done automatically and done pretty good • If not enough parallelism in loop, do it manually #pragma acc region vector(x)
• Use performance tools (e.g. cudaprof)
• Data regions
• Keep data on GPU for multiple loops #pragma acc data region copy(a(1:n,1:m)) local(b(2:n-1,2:m-1)) copyin(w(2:n-1)) { for ... #pragma acc region
Guido Juckeland
Folie 21
Guido Juckeland
4. Einführung NVIDIA CUDA
http://www.nvidia.com/cuda
© 2008 NVIDIA Corporation.
What is CUDA?
CUDA is a scalable parallel programming model and a software environment for parallel computing
Minimal extensions to familiar C/C++ environmentHeterogeneous serial-parallel programming model
NVIDIA’s TESLA architecture accelerates CUDAExpose the computational horsepower of NVIDIA GPUsEnable GPU computing
CUDA also maps well to multicore CPUs
© 2008 NVIDIA Corporation.
Outline
CUDA programming modelBasics of CUDA programming
Software stackData managementExecuting code on the GPU
CUDA librariesBLASFFT
© 2008 NVIDIA Corporation.
Some Design Goals
Scale to 100’s of cores, 1000’s of parallel threads
Let programmers focus on parallel algorithmsNot on the mechanics of a parallel programming language
Enable heterogeneous systems (i.e. CPU + GPU)CPU and GPU are separate devices with separate DRAMs
© 2008 NVIDIA Corporation.
CUDA Kernels and Threads
Parallel portions of an application are executed on the device as kernels
One kernel is executed at a timeMany threads execute each kernel
Differences between CUDA and CPU threads CUDA threads are extremely lightweight
Very little creation overheadInstant switching
CUDA uses 1000s of threads to achieve efficiencyMulti-core CPUs can use only a few
Definitions Device = GPU Host = CPU
Kernel = function that runs on the device
© 2008 NVIDIA Corporation.
Arrays of Parallel Threads
A CUDA kernel is executed by an array of threadsAll threads run the same codeEach thread has an ID that it uses to compute memory addresses and make control decisions
0 1 2 3 4 5 6 7
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
© 2008 NVIDIA Corporation.
Thread Cooperation
The Missing Piece: threads may need to cooperate
Thread cooperation is valuableShare results to avoid redundant computationShare memory accesses
Drastic bandwidth reduction
Thread cooperation is a powerful feature of CUDA
Cooperation between a monolithic array of threads is not scalable
Cooperation within smaller batches of threads is scalable
© 2008 NVIDIA Corporation.
Thread Batching
Kernel launches a grid of thread blocksThreads within a block cooperate via shared memoryThreads within a block can synchronizeThreads in different blocks cannot cooperate
Allows programs to transparently scale to different GPUs
GridThread Block 0
Shared Memory
Thread Block 1
Shared Memory
Thread Block N-1
Shared Memory
…
© 2008 NVIDIA Corporation.
Transparent Scalability
Kernel grid
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Device Device
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Block 0 Block 1
Hardware is free to schedule thread blocks on any processor
A kernel scales across parallel multiprocessors
© 2008 NVIDIA Corporation.
8-Series Architecture (G80)
128 thread processors execute kernel threads16 multiprocessors, each contains
8 thread processorsShared memory enables thread cooperation
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
SharedMemory
Multiprocessor
ThreadProcessors
SharedMemory
© 2008 NVIDIA Corporation.
10-Series Architecture
240 thread processors execute kernel threads30 multiprocessors, each contains
8 thread processorsOne double-precision unitShared memory enables thread cooperation
ThreadProcessors
Multiprocessor
SharedMemory
Double
© 2008 NVIDIA Corporation.
Kernel Memory Access
Per-thread
Per-block
Per-device
ThreadRegisters
Local Memory
SharedMemoryBlock
...Kernel 0
...Kernel 1
GlobalMemoryTi
me
On-chip
Off-chip, uncached
• On-chip, small• Fast
• Off-chip, large• Uncached• Persistent across
kernel launches• Kernel I/O
© 2008 NVIDIA Corporation.
Physical Memory Layout
“Local” memory resides in device DRAMUse registers and shared memory to minimize local memory use
Host can read and write global memory but not shared memory
Multiprocessor
Host
CPU
ChipsetDRAM
Device
DRAM
Local Memory
GlobalMemory
GPU
Multiprocessor
MultiprocessorRegisters
Shared Memory
© 2008 NVIDIA Corporation.
Execution ModelSoftware Hardware
Threads are executed by thread processors
Thread
Thread Processor
Thread Block Multiprocessor
Thread blocks are executed on multiprocessors
Thread blocks do not migrate
Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file)
...
Grid Device
A kernel is launched as a grid of thread blocks
Only one kernel can execute on a device at one time
© 2008 NVIDIA Corporation.
Key Parallel Abstractions in CUDA
Trillions of lightweight threadsSimple decomposition model
Hierarchy of concurrent threadsSimple execution model
Lightweight synchronization of primitivesSimple synchronization model
Shared memory model for thread cooperationSimple communication model
© 2008 NVIDIA Corporation.
Outline
CUDA programming modelBasics of CUDA programming
Software stackData managementExecuting code on the GPU
CUDA librariesBLASFFT
© 2008 NVIDIA Corporation.
CUDA Installation
CUDA installation consists of DriverCUDA Toolkit (compiler, libraries)CUDA SDK (example codes)
© 2008 NVIDIA Corporation.
CUDA Software Development
NVIDIA C Compiler
NVIDIA Assemblyfor Computing (PTX) CPU Host Code
Integrated CPU + GPUC Source Code
CUDA Optimized Libraries:math.h, FFT, BLAS, …
CUDADriver Profiler Standard C Compiler
GPU CPU
© 2008 NVIDIA Corporation.
Compiling CUDA Code
NVCC
C/C++ CUDAApplication
PTX to TargetCompiler
G80 … GPU
Target code
PTX Code Virtual
Physical
CPU Code
© 2008 NVIDIA Corporation.
Build Configurations
nvcc <filename>.cu [-o <executable>]Builds release mode
nvcc -g <filename>.cuBuilds debug modeCan debug host code but not device code
nvcc -deviceemu <filename>.cuBuilds device emulation modeAll code runs on CPU, no debug symbols
nvcc -deviceemu -g <filename>.cuBuilds debug device emulation modeAll code runs on CPU, with debug symbols
© 2008 NVIDIA Corporation.
Outline
CUDA programming modelBasics of CUDA programming
Software stackData managementExecuting code on the GPU
CUDA librariesBLASFFT
© 2008 NVIDIA Corporation.
Managing Memory
CPU and GPU have separate memory spacesHost (CPU) code manages device (GPU) memory:
Allocate / freeCopy data to and from deviceApplies to global device memory (DRAM)
Multiprocessor
Host
CPU
ChipsetDRAM
Device
DRAM
Local Memory
GlobalMemory
GPU
Multiprocessor
MultiprocessorRegisters
Shared Memory
© 2008 NVIDIA Corporation.
GPU Memory Allocation / Release
cudaMalloc(void ** pointer, size_t nbytes)cudaMemset(void * pointer, int value, size_t count)cudaFree(void* pointer)
int n = 1024;int nbytes = 1024*sizeof(int);int *a_d = 0;cudaMalloc( (void**)&a_d, nbytes );cudaMemset( a_d, 0, nbytes);cudaFree(a_d);
© 2008 NVIDIA Corporation.
Data Copies
cudaMemcpy(void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);
direction specifies locations (host or device) of src and dstBlocks CPU thread: returns after the copy is completeDoesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice
© 2008 NVIDIA Corporation.
Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;
nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}
© 2008 NVIDIA Corporation.
Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;
nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}
Host
a_h
b_h
© 2008 NVIDIA Corporation.
Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;
nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}
Host Device
a_h
b_h
a_d
b_d
© 2008 NVIDIA Corporation.
Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;
nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}
Host Device
a_h
b_h
a_d
b_d
© 2008 NVIDIA Corporation.
Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;
nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}
Host Device
a_h
b_h
a_d
b_d
© 2008 NVIDIA Corporation.
Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;
nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}
Host Device
a_h
b_h
a_d
b_d
© 2008 NVIDIA Corporation.
Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;
nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}
Host Device
a_h
b_h
a_d
b_d
© 2008 NVIDIA Corporation.
Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;
nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}
Host Device
a_h
b_h
a_d
b_d
© 2008 NVIDIA Corporation.
Data Movement Example int main(void){ float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ;
nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0;}
Host Device
Folie 23
Erste Übung
• Einloggen (SSH) mit Username pXX (Passwort: CUDApXX!) auf joker.zih.tu-dresden.de
• Wenn grafischer Editor gebraucht, dann mit X-Forwarding einloggen und „gedit‚ oder „emacs‚ verwenden
• CUDA Umgebung laden: module load software module load cuda
• Erzeugen Sie ein CUDA Programm für das letzte Beispiel
• Übersetzen Sie es mit „nvcc‚ in ein ausführbares Programm
• Führen Sie das Programm mit „./a.out‚ aus
• (Achtung, es streiten sich hier alle um EINE Grafikkarte)
Guido Juckeland
Folie 24
Guido Juckeland
6. CUDA etwas tiefer
© 2008 NVIDIA Corporation.
Outline
CUDA programming modelBasics of CUDA programming
Software stackData managementExecuting code on the GPU
CUDA librariesFFTBLAS
© 2008 NVIDIA Corporation.
Executing Code on the GPU
Kernels are C functions with some restrictions
Cannot access host memoryMust have void return typeNo variable number of arguments (“varargs”)Not recursiveNo static variables
Function arguments automatically copied from host to device
© 2008 NVIDIA Corporation.
Function Qualifiers
Kernels designated by function qualifier:__global__
Function called from host and executed on deviceMust return void
Other CUDA function qualifiers__device__
Function called from device and run on deviceCannot be called from host code
__host__
Function called from host and executed on host (default)__host__ and __device__ qualifiers can be combined to generate both CPU and GPU code
© 2008 NVIDIA Corporation.
Launching Kernels
Modified C function call syntax:
kernel<<<dim3 dG, dim3 dB>>>(…)
Execution Configuration (“<<< >>>”)dG - dimension and size of grid in blocks
Two-dimensional: x and yBlocks launched in the grid: dG.x * dG.y
dB - dimension and size of blocks in threads: Three-dimensional: x, y, and zThreads per block: dB.x * dB.y * dB.z
Unspecified dim3 fields initialize to 1
© 2008 NVIDIA Corporation.
Execution Configuration Examples
kernel<<<32,512>>>(...);
dim3 grid, block;grid.x = 2; grid.y = 4;block.x = 8; block.y = 16;
kernel<<<grid, block>>>(...);
dim3 grid(2, 4), block(8,16);
kernel<<<grid, block>>>(...);
Equivalent assignment using constructor functions
© 2008 NVIDIA Corporation.
CUDA Built-in Device Variables
All __global__ and __device__ functions have access to these automatically defined variables
dim3 gridDim;Dimensions of the grid in blocks (at most 2D)
dim3 blockDim;Dimensions of the block in threads
dim3 blockIdx;Block index within the grid
dim3 threadIdx;Thread index within the block
© 2008 NVIDIA Corporation.
Unique Thread IDs
Built-in variables are used to determine unique thread IDs
Map from local thread ID (threadIdx) to a global ID which can be used as array indices
0
0 1 2 3 4
1
0 1 2 3 4
2
0 1 2 3 4
blockIdx.x
blockDim.x = 5
threadIdx.x
blockIdx.x*blockDim.x+ threadIdx.x
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Grid
© 2008 NVIDIA Corporation.
Minimal Kernels
__global__ void minimal( int* a_d, int value){ *a_d = value;}
__global__ void assign( int* a_d, int value){ int idx = blockDim.x * blockIdx.x + threadIdx.x; a_d[idx] = value;}
© 2008 NVIDIA Corporation.
Increment Array Example
CPU program CUDA program
void inc_cpu(int *a, int N){ int idx;
for (idx = 0; idx<N; idx++) a[idx] = a[idx] + 1;}
int main(){ ... inc_cpu(a, N);}
__global__ void inc_gpu(int *a, int N){ int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) a[idx] = a[idx] + 1;}
int main(){ … dim3 dimBlock (blocksize); dim3 dimGrid( ceil( N / (float)blocksize) ); inc_gpu<<<dimGrid, dimBlock>>>(a, N);}
© 2008 NVIDIA Corporation.
Host Synchronization
All kernel launches are asynchronouscontrol returns to CPU immediatelykernel executes after all previous CUDA calls have completed
cudaMemcpy() is synchronouscontrol returns to CPU after copy completescopy starts after all previous CUDA calls have completed
cudaThreadSynchronize()blocks until all previous CUDA calls complete
© 2008 NVIDIA Corporation.
Host Synchronization Example
// copy data from host to devicecudaMemcpy(a_d, a_h, numBytes, cudaMemcpyHostToDevice);
// execute the kernelinc_gpu<<<ceil(N/(float)blocksize), blocksize>>>(a_d, N);
// run independent CPU coderun_cpu_stuff();
// copy data from device back to hostcudaMemcpy(a_h, a_d, numBytes, cudaMemcpyDeviceToHost);
© 2008 NVIDIA Corporation.
Variable Qualifiers (GPU code)
__device__Stored in global memory (large, high latency, no cache)Allocated with cudaMalloc (__device__ qualifier implied)Accessible by all threadsLifetime: application
__shared__Stored in on-chip shared memory (very low latency)Specified by execution configuration or at compile timeAccessible by all threads in the same thread blockLifetime: thread block
Unqualified variables:Scalars and built-in vector types are stored in registersWhat doesn’t fit in registers spills to “local” memory
© 2008 NVIDIA Corporation.
Using shared memory
Size known at compile time
__global__ void kernel(…){ … __shared__ float sData[256]; …}
int main(void){ … kernel<<<nBlocks,blockSize>>>(…); …}
Size known at kernel launch
__global__ void kernel(…){ … extern __shared__ float sData[]; …} int main(void){ … smBytes = blockSize*sizeof(float); kernel<<<nBlocks, blockSize,
smBytes>>>(…); …}
© 2008 NVIDIA Corporation.
GPU Thread Synchronization
void __syncthreads();Synchronizes all threads in a block
Generates barrier synchronization instructionNo thread can pass this barrier until all threads in the block reach itUsed to avoid RAW / WAR / WAW hazards when accessing shared memory
Allowed in conditional code only if the conditional is uniform across the entire thread block
© 2008 NVIDIA Corporation.
GPU Atomic Operations
Associative operationsadd, sub, increment, decrement, min, max, ...and, or, xorexchange, compare, swap
Atomic operations on 32-bit words in global memory
Requires compute capability 1.1 or higher (G84/G86/G92)
Atomic operations on 32-bit words in shared memory and 64-bit words in global memory
Requires compute capability 1.2 or higher
© 2008 NVIDIA Corporation.
Built-in Vector Types
Can be used in GPU and CPU code
[u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4], double[1..2]
Structures accessed with x, y, z, w fields: uint4 param; int y = param.y;
dim3Based on uint3Used to specify dimensionsDefault value (1,1,1)
© 2008 NVIDIA Corporation.
CUDA Error Reporting to CPU
All CUDA calls return error code:Except for kernel launchescudaError_t type
cudaError_t cudaGetLastError(void)Returns the code for the last error (no error has a code)Can be used to get error from kernel execution
char* cudaGetErrorString(cudaError_t code)Returns a null-terminated character string describing the error
printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );
© 2008 NVIDIA Corporation.
CUDA Programming Resources
We only covered basic featuresSee Programming Guide for more of the APIAdditional features covered in the Optimization session
CUDA SDK examples
CUDA Zone - http://www.nvidia.com/cudaCUDA UForums
© 2008 NVIDIA Corporation.
Outline
CUDA programming modelBasics of CUDA programming
Software stackData managementExecuting code on the GPU
CUDA librariesBLASFFT
© 2008 NVIDIA Corporation.
CUBLAS
Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver
Self contained at the API level, no direct interaction with CUDA driver
Basic model for useCreate matrix and vector objects in GPU memory spaceFill objects with dataCall CUBLAS functionsRetrieve data
CUBLAS library helper functionsCreating and destroying data in GPU spaceWriting data to and retrieving data from objects
© 2008 NVIDIA Corporation.
Supported Features
Single Precision Double Precision*
Real Complex Real Complex
Level 1 ✔ ✔ ✔
Level 2 ✔ dgemv, dger,dsyr, dtrsv
Level 3 ✔ cgemm ✔ zgemm
*Double-precision functions only supported on GPUs with double-precision hardware
© 2008 NVIDIA Corporation.
Using CUBLAS
Interface to CUBLAS library is in cublas.hFunction naming convention
cublas + BLAS nameeg. cublasSgemm
Following BLAS convention, CUBLAS uses column-major storage Error handling
CUBLAS core functions do not return an errorCUBLAS provides function to retrieve last error recorded
CUBLAS helper functions do return an errorImplemented using C-based CUDA tool chain
Interfacing to C/C++ applications is trivial
© 2008 NVIDIA Corporation.
CUBLAS Helper Functions
cublasInit()Initializes CUBLAS library
cublasShutdown()Releases resources used by CUBLAS library
cublasGetError()Returns last error from CUBLAS core function (+ resets)
cublasAlloc()Wrapper around cudaMalloc() to allocate space for array
cublasFree()destroys object in GPU memory
cublas[Set|Get][Vector|Matrix]()Copies array elements between CPU and GPU memoryAccommodates non-unit strides
© 2008 NVIDIA Corporation.
sgemmExample.c#include <stdio.h>#include <stdlib.h>#include "cublas.h"
int main(void){ float *a_h, *b_h, *c_h; float *a_d, *b_d, *c_d; float alpha = 1.0f, beta = 0.0f; int N = 2048, n2 = N*N; int nBytes, i;
nBytes = n2*sizeof(float);
a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); c_h = (float *)malloc(nBytes);
for (i=0; i < n2; i++) {
a_h[i] = rand() / (float) RAND_MAX;
b_h[i] = rand() / (float) RAND_MAX; }
cublasInit();
cublasAlloc(n2, sizeof(float), (void **)&a_d); cublasAlloc(n2, sizeof(float), (void **)&b_d); cublasAlloc(n2, sizeof(float), (void **)&c_d);
cublasSetVector(n2, sizeof(float), a_h, 1, a_d, 1); cublasSetVector(n2, sizeof(float), b_h, 1, b_d, 1);
cublasSgemm('n', 'n', N, N, N, alpha, a_d, N, b_d, N, beta, c_d, N);
cublasGetVector(n2, sizeof(float), c_d, 1, c_h, 1);
free(a_h); free(b_h); free(c_h); cublasFree(a_d); cublasFree(b_d); cublasFree(c_d);
cublasShutdown(); return 0;}
© 2008 NVIDIA Corporation.
Additional Resources
CUDA SDK examplesimpleCUBLAS
CUBLAS Library documentationin doc folder of CUDA Toolkit or download from CUDA Zone
© 2008 NVIDIA Corporation.
Outline
CUDA programming modelBasics of CUDA programming
Software stackData managementExecuting code on the GPU
CUDA librariesBLASFFT
© 2008 NVIDIA Corporation.
CUFFT
CUFFT is the CUDA FFT library1D, 2D, and 3D transforms of complex and real single-precision dataBatched execution for multiple 1D transforms in parallel1D transforms up to 8 million elements2D and 3D transforms in the range of [2,16384]In-place and out-of-place
© 2008 NVIDIA Corporation.
More on CUFFT
For 2D and 3D transforms, CUFFT uses row-major orderCUFFT performs un-normalized transforms
IFFT(FFT(A)) = length(A)*ACUFFT modeled after FFTW
Based on plans used to specify optimal configuration for a particular sized FFTOnce a plan is created it can be reused (to avoid recomputing the optimal configuration)
© 2008 NVIDIA Corporation.
CUFFT Types and Definitions
cufftHandleType used to store and access CUFFT plans
cufftResultsEnumeration of API function return values
cufftRealsingle-precision, real datatype
cufftComplexsingle-precision, complex datatype
Real and complex transformsCUFFT_C2C, CUFFT_C2R, CUFFT_R2C
DirectionsCUFFT_FORWARD, CUFFT_INVERSE
© 2008 NVIDIA Corporation.
CUFFT Example#include <stdio.h>#include <math.h>#include "cufft.h"
int main(int argc, char *argv[]) { cufftComplex *a_h, *a_d; cufftHandle plan; int N = 1024, batchSize = 10; int i, nBytes; double maxError;
nBytes = sizeof(cufftComplex)*N*batchSize; a_h = (cufftComplex *)malloc(nBytes);
for (i=0; i < N*batchSize; i++) { a_h[i].x = sinf(i); a_h[i].y = cosf(i); }
cudaMalloc((void **)&a_d, nBytes); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cufftPlan1d(&plan, N, CUFFT_C2C, batchSize);
cufftExecC2C(plan, a_d, a_d, CUFFT_FORWARD); cufftExecC2C(plan, a_d, a_d, CUFFT_INVERSE);
cudaMemcpy(a_h, a_d, nBytes, cudaMemcpyDeviceToHost);
// check error - normalize for (maxError = 0.0, i=0; i < N*batchSize; i++) { maxError = max(fabs(a_h[i].x/N-sinf(i)), maxError); maxError = max(fabs(a_h[i].y/N-cosf(i)), maxError); }
printf("Max fft error = %g\n", maxError);
cufftDestroy(plan); free(a_h); cudaFree(a_d);
return 0;}
© 2008 NVIDIA Corporation.
Additional CUFFT Resources
CUDA SDK examplessimpleCUFFTconvolutionFFT2DoceanFFT
CUFFT Library documentationIn doc folder of CUDA Toolkit or download from CUDA Zone
© 2008 NVIDIA Corporation.
Getting Started with CUDAGreg Ruetsch, Brent Oster
Folie 26
Guido Juckeland
7. Wie weiter?
Folie 27
Bekannte Probleme
2 getrennte (Haupt-) Speicherbereiche für GPU und CPU
Keine Fehlererkennung/-korrektur im GPU-Speichersystem(!!)
Bei vorhandenem Algorithmus dauert eine schnelle Implementierung etwa 3-6 MM
Was, wenn eine GPU nicht reicht??
Guido Juckeland
Folie 28
Fragen & Manöverkritik
Guido Juckeland
Top Related