OpenCL Tutorial - Basics
-
Upload
ozgursahin13 -
Category
Documents
-
view
98 -
download
9
description
Transcript of OpenCL Tutorial - Basics
OpenCL TutorialGuillermo Marcus
14:00 Part IOpenCL OverviewHello Vector
15:30 Coffee Break
16:00 Part IIReductionMatrix Multiply
Overview
About me
Dr. Guillermo [email protected]
PhD from Heidelberg in Computer Science 2011Head of the Scientific Computing Research Group until March 2013NVIDIA (OptiX Group) from May 2013
Teached the ZITI Master Lecture in GPU Computing between 2011-2013
OpenCL Overview
Standarized language to program acceleratorshttp://www.khronos.org/opencl
C-based, APIs and GPU code is C or C-likeCompiles at runtime
Supported by multiple hardware vendorsNVIDIA, AMD, ARM, PowerVR, Altera
While code is portable, optimizations are not!
OpenCL Basics
Application Models
Execution Model
Memory Model
Application Model
Activities are driven by the host computer
Multiple platforms, multiple devices possible
IO is an important part of the model
GPU Kernels
- Starts a computation in the GPU- "Launches" (starts) a collection of threads- Requires code to execute AND a specification (how the threads are organized)- Can be blocking or non-blocking
Execution Model
Work ItemsKernel code"Serial" execution threadPrivate variables
Work GroupsSynchronization inside the groupData sharing inside the group
Program GridCollection of Work GroupsNo synchronizationNo data Sharing
int a[N], b[N], c[N];int i, tid;
tid = getThreadID();for(i=tid; i<N; i+=4)
c[i] = a[i] + b[i];
Work Item
Work Items
A single thread in the GPUThe are executed normally as SIM
Thread code is the same for all work itemsWork items can have private variablesHave an Unique ID inside the kernel
int a[N], b[N], c[N];int i, tid;
tid = getThreadID();for(i=tid; i<N; i+=4)
c[i] = a[i] + b[i];
Single Instruction, Multiple Threads
Combines the flexibility of the thread model with the efficiency of the Single Instruction, Multiple Data architecture.
Normally, there are many more threads than workers.
worker
1 worker
2 worker
3 worker
4
int a[N], b[N], c[N];int tid;
tid = getThreadID();c[tid] = a[tid] + b[tid];
Work Groups
Work Groups are collections of Work ItemsItems inside a Work Group ...
are executed in parallel*share local datahave a local IDcan be organized as 1D,2D,3D* arrays
Work Groups ...are independent of each otherhave an unique ID inside the kernel
Program Grid
Work Groups are organized as a 1D, 2D, 3D array
Between Work Groups there is ...No communicationNo data synchronization
In fact, often there is not even data coherency between work groups!
Memory Model
Hierarchical organization of areas:Host, Global, Local, Registers
Moving data between areas is expensive
Data coherency is not garanteed at all times or across all areas
Every area has its own constraint set
Controlled by attributes in the code definition
Memory Model Overview
Host Memory
Main Memory of the Host Computer
Can move data only between the host and the GPU Global Memory
Transfer is always initiated by the Host,can be Synchronous or Asynchronous
Bandwidth is limited by the PCIe links
Global Memory
Main GPU Memory available to all threadsBiggest in size, up to several GBs
Huge bandwidth, but also huge latencytypically 400-800 cyclesnot always cached
Performance is very dependent of access patterns
Local Memory
Available to all threads inside a Work GroupLimited in size (typical: 8KB-64KB)
Latency comparable to registers
Constrained by access rules (i.e. bank conflicts) limiting the performance by access patterns
Used as scratchpad or cache of global memory
GPU Registers
Private to every thread
Normally hidden, no direct access, optimized by the compiler
Fastest access, only constrained in number of available registers
Some platforms may use more registers than others..... depends on the hardware architecture
Constant Memory
Read only memory
Cached
Good for storing Look Up Tables and non-changeable values
It is normally a small area of the global memory
Private Memory
Unique to every Work Item
Normally it is mapped first to registers, then to global memory when there is no more free registers
Kernel Specification
Defines the number and distribution of threads inside the kernel.
A GPU program can be launched with different specifications, creating different kernels.
The distribution is defined as global and local settings, defining the total number of threads, and the number of threads per work group, respectively, as well as their organization.
Global and Local Settings (1D)
// Create kernel specification (ND range)NDRange global(VECT_SIZE);NDRange local(1);
// Create kernel specification (ND range)int groups = VECT_SIZE/64 + ((VECT_SIZE % 64 == 0) ? 0 : 1);NDRange global(64*groups);NDRange local(64);
Global and Local Settings (2D)// Create kernel specification (ND range)
int gX = X_SIZE/4 + ((X_SIZE % 4 == 0) ? 0 : 1);int gY = Y_SIZE/3 + ((Y_SIZE % 3 == 0) ? 0 : 1);NDRange global(gX*4, gY*3);NDRange local(4,3);
Basic built-in functions values