OpenCL Tutorial - Basics

Post on 08-Nov-2014

99 views 9 download

Tags:

description

OpenCL Tutorial - Basics

Transcript of OpenCL Tutorial - Basics

OpenCL TutorialGuillermo Marcus

14:00 Part IOpenCL OverviewHello Vector

15:30 Coffee Break

16:00 Part IIReductionMatrix Multiply

Overview

About me

Dr. Guillermo Marcusguillermo.marcus@gmail.com

PhD from Heidelberg in Computer Science 2011Head of the Scientific Computing Research Group until March 2013NVIDIA (OptiX Group) from May 2013

Teached the ZITI Master Lecture in GPU Computing between 2011-2013

OpenCL Overview

Standarized language to program acceleratorshttp://www.khronos.org/opencl

C-based, APIs and GPU code is C or C-likeCompiles at runtime

Supported by multiple hardware vendorsNVIDIA, AMD, ARM, PowerVR, Altera

While code is portable, optimizations are not!

OpenCL Basics

Application Models

Execution Model

Memory Model

Application Model

Activities are driven by the host computer

Multiple platforms, multiple devices possible

IO is an important part of the model

GPU Kernels

- Starts a computation in the GPU- "Launches" (starts) a collection of threads- Requires code to execute AND a specification (how the threads are organized)- Can be blocking or non-blocking

Execution Model

Work ItemsKernel code"Serial" execution threadPrivate variables

Work GroupsSynchronization inside the groupData sharing inside the group

Program GridCollection of Work GroupsNo synchronizationNo data Sharing

int a[N], b[N], c[N];int i, tid;

tid = getThreadID();for(i=tid; i<N; i+=4)

c[i] = a[i] + b[i];

Work Item

Work Items

A single thread in the GPUThe are executed normally as SIM

Thread code is the same for all work itemsWork items can have private variablesHave an Unique ID inside the kernel

int a[N], b[N], c[N];int i, tid;

tid = getThreadID();for(i=tid; i<N; i+=4)

c[i] = a[i] + b[i];

Single Instruction, Multiple Threads

Combines the flexibility of the thread model with the efficiency of the Single Instruction, Multiple Data architecture.

Normally, there are many more threads than workers.

worker

1 worker

2 worker

3 worker

4

int a[N], b[N], c[N];int tid;

tid = getThreadID();c[tid] = a[tid] + b[tid];

Work Groups

Work Groups are collections of Work ItemsItems inside a Work Group ...

are executed in parallel*share local datahave a local IDcan be organized as 1D,2D,3D* arrays

Work Groups ...are independent of each otherhave an unique ID inside the kernel

Program Grid

Work Groups are organized as a 1D, 2D, 3D array

Between Work Groups there is ...No communicationNo data synchronization

In fact, often there is not even data coherency between work groups!

Memory Model

Hierarchical organization of areas:Host, Global, Local, Registers

Moving data between areas is expensive

Data coherency is not garanteed at all times or across all areas

Every area has its own constraint set

Controlled by attributes in the code definition

Memory Model Overview

Host Memory

Main Memory of the Host Computer

Can move data only between the host and the GPU Global Memory

Transfer is always initiated by the Host,can be Synchronous or Asynchronous

Bandwidth is limited by the PCIe links

Global Memory

Main GPU Memory available to all threadsBiggest in size, up to several GBs

Huge bandwidth, but also huge latencytypically 400-800 cyclesnot always cached

Performance is very dependent of access patterns

Local Memory

Available to all threads inside a Work GroupLimited in size (typical: 8KB-64KB)

Latency comparable to registers

Constrained by access rules (i.e. bank conflicts) limiting the performance by access patterns

Used as scratchpad or cache of global memory

GPU Registers

Private to every thread

Normally hidden, no direct access, optimized by the compiler

Fastest access, only constrained in number of available registers

Some platforms may use more registers than others..... depends on the hardware architecture

Constant Memory

Read only memory

Cached

Good for storing Look Up Tables and non-changeable values

It is normally a small area of the global memory

Private Memory

Unique to every Work Item

Normally it is mapped first to registers, then to global memory when there is no more free registers

Kernel Specification

Defines the number and distribution of threads inside the kernel.

A GPU program can be launched with different specifications, creating different kernels.

The distribution is defined as global and local settings, defining the total number of threads, and the number of threads per work group, respectively, as well as their organization.

Global and Local Settings (1D)

// Create kernel specification (ND range)NDRange global(VECT_SIZE);NDRange local(1);

// Create kernel specification (ND range)int groups = VECT_SIZE/64 + ((VECT_SIZE % 64 == 0) ? 0 : 1);NDRange global(64*groups);NDRange local(64);

Global and Local Settings (2D)// Create kernel specification (ND range)

int gX = X_SIZE/4 + ((X_SIZE % 4 == 0) ? 0 : 1);int gY = Y_SIZE/3 + ((Y_SIZE % 3 == 0) ? 0 : 1);NDRange global(gX*4, gY*3);NDRange local(4,3);

Basic built-in functions values