Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon...

114
Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 – Sep 2019

Transcript of Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon...

Page 1: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

Efficient GPU Programmingin Modern C++

Gordon BrownPrincipal Software Engineer, SYCL & C++

CppCon 2019 – Sep 2019

Page 2: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.2

Page 3: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.3

This talk is based on the SYCL programming model

Terminology may differ for other programming models

Page 4: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.4

Agenda

Why use the GPU?

Brief introduction to SYCL

SYCL programming model

Optimising GPU programs

● Choosing the right algorithm

● Basic GPU programming principles

● Ideas for further optimisations

Page 5: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.5

Why use the GPU?

Page 6: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.6

“The free lunch is over”

“The end of Moore’s Law”

“The future is parallel”

Page 7: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.7

Take a typical Intel chip

Intel Core i7 7th Gen○ 4x CPU cores

■ Each with hyperthreading■ Each with support for 256bit

AVX2 instructions○ Intel Gen 9.5 GPU

■ With 1280 processing elements

Page 8: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.8

Regular sequential C++ code (non-vectorised) running on a single thread only takes advantage of a very small amount of the available resources of the chip

Page 9: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.9

Vectorisation allows you to fully utilise a single CPU core

Page 10: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.10

Multi-threading allows you to fully utilise all CPU cores

Page 11: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.11

Heterogeneous dispatch allows you to fully utilise the entire chip

Page 12: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.12

GPGPU programming was

once a niche technology

● Limited to specific

domain

● Separate source

solutions

● Verbose low-level APIs

● Very steep learning

curve

Page 13: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.13

This is not the case anymore

● Almost everything has a

GPU now

● Single source solutions

● Modern C++ programming

models

● More accessible to the

average C++ developer

C++AMP

SYCL

CUDA Agency

Kokkos

HPX

Raja

Page 14: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.14

Brief introduction to SYCL

Page 15: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.15

Cross-platform, single-source, high-level, C++ programming layerBuilt on top of OpenCL and based on standard C++11

Delivering a heterogeneous programming solution for C++

Page 16: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.16

Applications

SYCL for OpenCL

OpenCL

C++ template libraries

OpenCL-enabled devices

Host compiler Device compiler

Device IR(SPIR, SPIR-V, etc)

Page 17: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.17

__global__ vec_add(float *a, float *b, float *c) {

return c[i] = a[i] + b[i];

}

float *a, *b, *c;

vec_add<<<range>>>(a, b, c);

vector<float> a, b, c;

#pragma parallel_for

for(int i = 0; i < a.size(); i++) {

c[i] = a[i] + b[i];

}

cgh.parallel_for<vec_add>(range, [=](cl::sycl::id<2> idx) {

c[idx] = a[idx] + c[idx];

}));

array_view<float> a, b, c;

extent<2> e(64, 64);

parallel_for_each(e, [=](index<2> idx) restrict(amp) {

c[idx] = a[idx] + b[idx];

});

Page 18: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.18

int main(int argc, char *argv[]) {

}

Page 19: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.19

#include <CL/sycl.hpp>

using namespace cl::sycl;

int main(int argc, char *argv[]) {

}

The whole SYCL API is included in the CL/sycl.hpp header file

Page 20: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.20

#include <CL/sycl.hpp>

using namespace cl::sycl;

int main(int argc, char *argv[]) {

queue gpuQueue{gpu_selector{}};

}

A queue is used to enqueue work to a device such as a GPU

A device selector is a function object which provides a heuristic for selecting a suitable device

Page 21: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.21

#include <CL/sycl.hpp>

using namespace cl::sycl;

int main(int argc, char *argv[]) {

queue gpuQeueue{gpu_selector{}};

defaultQueue.submit([&](handler &cgh){

});

}

A command group describes a unit work of work to be executed by a device

A command group is created by a function object passed to the submit function of the queue

Page 22: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.23

#include <CL/sycl.hpp>

using namespace cl::sycl;

int main(int argc, char *argv[]) {

std::vector<float> dA{ … }, dB{ … }, dO{ … };

queue gpuQeueue{gpu_selector{}};

buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));

buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));

buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));

defaultQueue.submit([&](handler &cgh){

});

}

Buffers take ownership of data and manage it across the host and any number of devices

Page 23: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.24

#include <CL/sycl.hpp>

using namespace cl::sycl;

int main(int argc, char *argv[]) {

std::vector<float> dA{ … }, dB{ … }, dO{ … };

queue gpuQeueue{gpu_selector{}};

{

buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));

buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));

buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));

defaultQueue.submit([&](handler &cgh){

});

}

}

Buffers synchronize on destruction via RAII waiting for any command groups that need to write back to it

Page 24: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.25

#include <CL/sycl.hpp>

using namespace cl::sycl;

int main(int argc, char *argv[]) {

std::vector<float> dA{ … }, dB{ … }, dO{ … };

queue gpuQeueue{gpu_selector{}};

{

buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));

buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));

buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));

defaultQueue.submit([&](handler &cgh){

auto inA = bufA.get_access<access::mode::read>(cgh);

auto inB = bufB.get_access<access::mode::read>(cgh);

auto out = bufO.get_access<access::mode::write>(cgh);

});

}

}

Accessors describe the way in which you would like to access a buffer

They are also use to access the data from within a kernel function

Page 25: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.26

#include <CL/sycl.hpp>

using namespace cl::sycl;

class add;

int main(int argc, char *argv[]) {

std::vector<float> dA{ … }, dB{ … }, dO{ … };

queue gpuQeueue{gpu_selector{}};

{

buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));

buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));

buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));

defaultQueue.submit([&](handler &cgh){

auto inA = bufA.get_access<access::mode::read>(cgh);

auto inB = bufB.get_access<access::mode::read>(cgh);

auto out = bufO.get_access<access::mode::write>(cgh);

cgh.parallel_for<add>(range<1>(dA.size()),

[=](id<1> i){ out[i] = inA[i] + inB[i]; });

});

}

}

Commands such as parallel_for can be used to define kernel functions

The first argument here is a range, specifying the iteration space

The second argument is a function object that represents the entry point for the SYCL kernel

The function object must takean id parameter that describes the current iteration being executed

Page 26: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.27

#include <CL/sycl.hpp>

using namespace cl::sycl;

class add;

int main(int argc, char *argv[]) {

std::vector<float> dA{ … }, dB{ … }, dO{ … };

queue gpuQeueue{gpu_selector{}};

{

buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));

buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));

buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));

defaultQueue.submit([&](handler &cgh){

auto inA = bufA.get_access<access::mode::read>(cgh);

auto inB = bufB.get_access<access::mode::read>(cgh);

auto out = bufO.get_access<access::mode::write>(cgh);

cgh.parallel_for<add>(range<1>(dA.size()),

[=](id<1> i){ out[i] = inA[i] + inB[i]; });

});

}

}

Kernel functions defined using lambdas have to have a typename to provide them with a name

The reason for this is that C++ does not have a standard ABI for lambdas so they are represented differently across the host and device compiler

Page 27: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.28

#include <CL/sycl.hpp>

using namespace cl::sycl;

class add;

int main(int argc, char *argv[]) {

std::vector<float> dA{ … }, dB{ … }, dO{ … };

queue gpuQeueue{gpu_selector{}};

{

buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));

buffer<float, 1> bufB(dB.data(), range<1>(dB.size()));

buffer<float, 1> bufO(dO.data(), range<1>(dO.size()));

defaultQueue.submit([&](handler &cgh){

auto inA = bufA.get_access<access::mode::read>(cgh);

auto inB = bufB.get_access<access::mode::read>(cgh);

auto out = bufO.get_access<access::mode::write>(cgh);

cgh.parallel_for<add>(range<1>(dA.size()),

[=](id<1> i){ out[i] = inA[i] + inB[i]; });

});

}

}

The rest of this talk will focus on kernels and how to optimize

them

Page 28: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.29

SYCL programming model

Page 29: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.30

Processing Element

1. A processing element executes a

single work-item

1

work-item

Page 30: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.31

Processing Element

Private memory

1. A processing element executes a

single work-item

2. Each work-item can access private

memory, a dedicated memory region

for each processing element1

work-item

2

Page 31: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.32

Processing Element

Private memory

1. A processing element executes a

single work-item

2. Each work-item can access private

memory, a dedicated memory region

for each processing element

3. A compute is composed of a number

of processing elements and executes

one or more work-group which are

composed of a number of work-items

1

Compute unit

work-itemwork-group(s)

2

3

Page 32: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.33

Private memory

1. A processing element executes a

single work-item

2. Each work-item can access private

memory, a dedicated memory region

for each processing element

3. A compute is composed of a number

of processing elements and executes

one or more work-group which are

composed of a number of work-items

4. Each work-item can access the local

memory of their work-group, a

dedicated memory region for each

compute unit

Local memory

Compute unit

work-group(s)

2

3

4Processing

Element

1

work-item

Page 33: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.34

Private memory

1. A processing element executes a

single work-item

2. Each work-item can access private

memory, a dedicated memory region

for each processing element

3. A compute is composed of a number

of processing elements and executes

one or more work-group which are

composed of a number of work-items

4. Each work-item can access the local

memory of their work-group, a

dedicated memory region for each

compute unit

5. A device can execute multiple work-

groups

Local memory

Compute unit

work-group(s)

2

3

4

5

Processing Element

1

work-item

Page 34: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.35

Processing Element

Private memory

1. A processing element executes a

single work-item

2. Each work-item can access private

memory, a dedicated memory region

for each processing element

3. A compute is composed of a number

of processing elements and executes

one or more work-group which are

composed of a number of work-items

4. Each work-item can access the local

memory of their work-group, a

dedicated memory region for each

compute unit

5. A device can execute multiple work-

groups

6. Each work-item can access global

memory, a single memory region

available to all processing elements

1

Local memory

Global memory

Compute unit

work-itemwork-group(s)

2

3

4

6

5

Page 35: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.36

Private memory Local memory Global memory< <

Page 36: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.43

GPUs execute a large number of

work-items

Page 37: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.44

They are not all guaranteed to

execute concurrently, most GPUs do

execute a number of work-items

uniformly (lock-step)

Page 38: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.45

The number that are executed

concurrently varies between

different GPUs

There is no guarantee as to the

order in which they execute

Page 39: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.46

What are GPUs good at?

➢ Highly parallel○ GPUs can run a very large number of processing elements in parallel

➢ Efficient at floating point operations○ GPUs can achieve very high FLOPs (floating-point operations per second)

➢ Large bandwidth○ GPUs are optimised for throughput and can handle a very large

bandwidth of data

Page 40: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.47

Optimising GPU programs

Page 41: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.48

There are different levels of optimisations you can apply

➢ Choosing the right algorithm

➢ This means choosing an algorithm that is well suited to parallelism

➢ Basic GPU programming principles

➢ Such as coalescing global memory access or using local memory

➢ Architecture specific optimisations

➢ Optimising for register usage or avoiding bank conflicts

➢ Micro-optimisations

➢ Such as floating point dnorm hacks

Page 42: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.49

There are different levels of optimisations you can apply

➢ Choosing the right algorithm

➢ This means choosing an algorithm that is well suited to parallelism

➢ Basic GPU programming principles

➢ Such as coalescing global memory access or using local memory

➢ Architecture specific optimisations

➢ Optimising for register usage or avoiding bank conflicts

➢ Micro-optimisations

➢ Such as floating point dnorm hacks

This talk will mostly focus on these two

Page 43: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.50

Choosing the right algorithm

Page 44: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.51

What to parallelise on a GPU

➢ Find hotspots in your code base○ Looks for areas of your codebase that are hit often and well suited to

parallelism on the GPU

➢ Follow an adaptive optimisation approach such as APOD○ Analyse -> Parallelise -> Optimise -> Deploy

➢ Avoid over-optimisation○ You may reach a point where optimisations provide diminishing returns

Page 45: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.52

What to look for in an algorithm

➢ Naturally data parallel○ Performing the same operation on multiple items in the computation

➢ Large problem○ Enough work to utilise the GPU’s processing elements

➢ Independent progress○ Little or no dependencies between items in the computation

➢ Non-divergent control flow○ Little or no branch or loop divergence

Page 46: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.53

As a motivational example we will be looking at an image convolution

➢ The image convolution algorithm is “embarrassingly parallel”○ Each item in the computation can be calculated entirely independently

➢ The image convolution algorithm is very computation heavy○ A large number of operations have to be calculated for each item in the

computation, particularly when using larger filters

➢ Image processing requires a large bandwidth○ A lot of data must be passed through the GPU to process an image,

particularly if the image is very high resolution

Page 47: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.54

1 7 5 8 2 3 8 3 4 6 2 2 4 5 8 3

1 3 4 3 2 4 3 4 5 6 1 6 5 7 8 5

9 2 1 8 1 4 6 9 5 1 4 5 1 9 4 7

3 6 2 0 2 2 9 8 2 7 9 4 2 6 1 5

1 7 2 2 8 4 6 8 4 7 6 8 3 2 4 1

4 9 9 5 1 3 7 3 8 1 7 4 1 5 9 4

4 0 6 3 6 9 9 6 8 5 9 9 0 2 1 5

3 8 1 2 4 7 1 7 6 7 7 2 6 3 6 7

6 7 5 4 3 1 4 4 2 6 3 0 5 0 7 0

1 3 4 2 2 8 1 6 4 9 5 3 7 1 2 4

7 5 4 3 7 0 4 0 3 0 4 4 2 8 9 0

0 9 9 8 0 2 9 8 2 1 6 0 6 3 4 1

6 4 0 1 9 1 7 4 8 3 0 5 0 2 0 6

1 5 7 6 3 0 6 5 4 6 0 4 1 8 7 0

3 3 0 5 9 8 2 4 7 1 5 2 0 4 9 7

1 9 0 4 0 3 0 6 1 2 8 7 0 1 2 9

1 2 1

2 4 2

1 2 1

Approximate gaussian blur 3x3

1/16

3

Page 48: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.55

Page 49: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.56

Basic GPU programming principles

Page 50: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.57

Optimizing GPU programs means maximizing throughput

Compute

Memory

Maximize compute operations

Reduce time spent on memory

Page 51: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.58

Optimizing GPU programs means maximizing throughput

➢ Maximise compute operations per cycle

➢ Make effective utilisation of the GPU’s hardware

➢ Reduce time spent on memory operations

➢ Reduce latency of memory access

Page 52: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.59

Avoid divergent control flow

➢ Divergent branches and loops can cause inefficient utilisation

➢ If consecutive work-items execute different branches they must execute separate instructions

➢ If some work-items execute more iterations of a loop than neighbouring work-items this leaves them doing nothing

Page 53: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.60

a[globalId] = 0;

if (globalId < 4) {

a[globalId] = x();

} else {

a[globalId] = y();

}

Page 54: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.61

a[globalId] = 0;

if (globalId < 4) {

a[globalId] = x();

} else {

a[globalId] = y();

}

Page 55: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.62

a[globalId] = 0;

if (globalId < 4) {

a[globalId] = x();

} else {

a[globalId] = y();

}

Page 56: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.63

a[globalId] = 0;

if (globalId < 4) {

a[globalId] = x();

} else {

a[globalId] = y();

}

Page 57: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.64

for (int i = 0; i <

globalId; i++) {

do_something();

}

Page 58: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.65

for (int i = 0; i <

globalId; i++) {

do_something();

}

Page 59: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.66

for (int i = 0; i <

globalId; i++) {

do_something();

}

Page 60: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.67

for (int i = 0; i <

globalId; i++) {

do_something();

}

x2 x3 x4 x5 x6 x7

Page 61: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.68

for (int i = 0; i <

globalId; i++) {

do_something();

}

x2 x3 x4 x5 x6 x7

Page 62: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.69

cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),

[=](cl::sycl::nd_item<2> item) {

int rowOffset = item.get_global_id(0) * WIDTH * NUM_CHANNELS;

int my = NUM_CHANNELS * item.get_global_id(1) + rowOffset;

int fIndex = 0;

float sumR = 0.0f, sumG = 0.0f, float sumB = 0.0f, float sumA = 0.0f;

for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {

int curRow = my + r * (WIDTH * NUM_CHANNELS);

for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE;

c++, fIndex += NUM_CHANNELS) {

int offset = c * NUM_CHANNELS;

sumR += inputAcc[curRow + offset] * filterAcc[fIndex];

sumG += inputAcc[curRow + offset + 1] * filterAcc[fIndex + 1];

sumB += inputAcc[curRow + offset + 2] * filterAcc[fIndex + 2];

sumA += inputAcc[curRow + offset + 3] * filterAcc[fIndex + 3];

}

}

outputAcc[my] = sumR;

outputAcc[my + 1] = sumG;

outputAcc[my + 2] = sumB;

outputAcc[my + 3] = sumA;

});

First we calculate the linear position of the data element within global memory relative to the current work-item

Page 63: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.70

cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),

[=](cl::sycl::nd_item<2> item) {

int rowOffset = item.get_global_id(0) * WIDTH * NUM_CHANNELS;

int my = NUM_CHANNELS * item.get_global_id(1) + rowOffset;

int fIndex = 0;

float sumR = 0.0f, sumG = 0.0f, float sumB = 0.0f, float sumA = 0.0f;

for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {

int curRow = my + r * (WIDTH * NUM_CHANNELS);

for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE;

c++, fIndex += NUM_CHANNELS) {

int offset = c * NUM_CHANNELS;

sumR += inputAcc[curRow + offset] * filterAcc[fIndex];

sumG += inputAcc[curRow + offset + 1] * filterAcc[fIndex + 1];

sumB += inputAcc[curRow + offset + 2] * filterAcc[fIndex + 2];

sumA += inputAcc[curRow + offset + 3] * filterAcc[fIndex + 3];

}

}

outputAcc[my] = sumR;

outputAcc[my + 1] = sumG;

outputAcc[my + 2] = sumB;

outputAcc[my + 3] = sumA;

});

Then we loop over each element in the filter, incrementing an offset as we go

Page 64: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.71

cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),

[=](cl::sycl::nd_item<2> item) {

int rowOffset = item.get_global_id(0) * WIDTH * NUM_CHANNELS;

int my = NUM_CHANNELS * item.get_global_id(1) + rowOffset;

int fIndex = 0;

float sumR = 0.0f, sumG = 0.0f, float sumB = 0.0f, float sumA = 0.0f;

for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {

int curRow = my + r * (WIDTH * NUM_CHANNELS);

for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE;

c++, fIndex += NUM_CHANNELS) {

int offset = c * NUM_CHANNELS;

sumR += inputAcc[curRow + offset] * filterAcc[fIndex];

sumG += inputAcc[curRow + offset + 1] * filterAcc[fIndex + 1];

sumB += inputAcc[curRow + offset + 2] * filterAcc[fIndex + 2];

sumA += inputAcc[curRow + offset + 3] * filterAcc[fIndex + 3];

}

}

outputAcc[my] = sumR;

outputAcc[my + 1] = sumG;

outputAcc[my + 2] = sumB;

outputAcc[my + 3] = sumA;

});

Then we multiply each data element in global memory with the corresponding element of the filter and add it to a sum, for each channel

Page 65: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.72

cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),

[=](cl::sycl::nd_item<2> item) {

int rowOffset = item.get_global_id(0) * WIDTH * NUM_CHANNELS;

int my = NUM_CHANNELS * item.get_global_id(1) + rowOffset;

int fIndex = 0;

float sumR = 0.0f, sumG = 0.0f, float sumB = 0.0f, float sumA = 0.0f;

for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {

int curRow = my + r * (WIDTH * NUM_CHANNELS);

for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE;

c++, fIndex += NUM_CHANNELS) {

int offset = c * NUM_CHANNELS;

sumR += inputAcc[curRow + offset] * filterAcc[fIndex];

sumG += inputAcc[curRow + offset + 1] * filterAcc[fIndex + 1];

sumB += inputAcc[curRow + offset + 2] * filterAcc[fIndex + 2];

sumA += inputAcc[curRow + offset + 3] * filterAcc[fIndex + 3];

}

}

outputAcc[my] = sumR;

outputAcc[my + 1] = sumG;

outputAcc[my + 2] = sumB;

outputAcc[my + 3] = sumA;

});

Finally we write out the sums to global memory again

Page 66: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.73

0

2

4

6

8

10

12

3x3 5x5 7x7 9x9 11x11

Kernel time (ms)

Naive

Image conv

512x512 source image

Intel HD Graphics 530

Page 67: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.74

Coalesced global memory access

➢ Reading and writing from global memory is very expensive

➢ It often means copying across an off-chip bus

➢ Reading and writing from global memory is done in chunks

➢ This means accessing data that is physically close together in memory is more efficient

Page 68: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.75

float data[size];

Page 69: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.76

float data[size];

...

f(a[globalId]);

Page 70: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.77

float data[size];

...

f(a[globalId]);

Page 71: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.78

float data[size];

...

f(a[globalId]);

100% global access utilisation

Page 72: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.79

float data[size];

...

f(a[globalId * 2]);

Page 73: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.80

float data[size];

...

f(a[globalId * 2]);

50% global access utilisation

Page 74: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.81

This becomes very important when dealing with multiple dimensions

It’s important to ensure that the order work-items are executed in aligns with the order that data elements that are accessed

This maintains coalesced global memory access

global_id(0)gl

ob

al_i

d(1

)

auto id0 = get_global_id(0);

auto id1 = get_global_id(1);

auto linearId = (id1 * 4) + id0;

a[linearId] = f();

Row-major

Page 75: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.82

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Here data elements are accessed in row-major and work-items are executed in row-major

Global memory access is coalesced

global_id(0)gl

ob

al_i

d(1

)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

auto id0 = get_global_id(0);

auto id1 = get_global_id(1);

auto linearId = (id1 * 4) + id0;

a[linearId] = f();

Row-major

Row-major

Page 76: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.83

0 4 8 12

1 5 9 13

2 6 10 14

3 7 11 15If the work-items were executed in column-major

Global memory access is no longer coalesced

global_id(0)gl

ob

al_i

d(1

)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

auto id0 = get_global_id(0);

auto id1 = get_global_id(1);

auto linearId = (id1 * 4) + id0;

a[linearId] = f();

Row-major

Column-major

Page 77: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.84

0 4 8 12

1 5 9 13

2 6 10 14

3 7 11 15

However if you were to switch the data access pattern to column-major

Global memory access is coalesced again

global_id(0)gl

ob

al_i

d(1

)

auto id0 = get_global_id(0);

auto id1 = get_global_id(1);

auto linearId = (id0 * 4) + id1;

a[linearId] = f();

Column-major

Column-major

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 78: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.85

cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),

[=](cl::sycl::nd_item<2> item) {

int rowOffset = item.get_global_id(1) * WIDTH * NUM_CHANNELS;

int my = NUM_CHANNELS * item.get_global_id(0) + rowOffset;

int fIndex = 0;

float sumR = 0.0f, sumG = 0.0f, float sumB = 0.0f, float sumA = 0.0f;

for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {

int curRow = my + r * (WIDTH * NUM_CHANNELS);

for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE;

c++, fIndex += NUM_CHANNELS) {

int offset = c * NUM_CHANNELS;

sumR += inputAcc[curRow + offset] * filterAcc[fIndex];

sumG += inputAcc[curRow + offset + 1] * filterAcc[fIndex + 1];

sumB += inputAcc[curRow + offset + 2] * filterAcc[fIndex + 2];

sumA += inputAcc[curRow + offset + 3] * filterAcc[fIndex + 3];

}

}

outputAcc[my] = sumR;

outputAcc[my + 1] = sumG;

outputAcc[my + 2] = sumB;

outputAcc[my + 3] = sumA;

});

Reversing the global ids will flip the linearization from row-major to column-major

Whether column-major or row-major linearization is more efficient depends on the device you are on

Page 79: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.86

0

20

40

60

80

100

120

3x3 5x5 7x7 9x9 11x11

Kernel time (% of base)

Naive Coalesced

Image conv

512x512 source image

Intel HD Graphics 530

Page 80: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.87

Make use of vector operations

➢ GPUs are vector processors

➢ Each processing element is capable of wide instructions which can operate on multiple elements of data at once

➢ Many compilers can auto-vectorise

➢ This can affect the amount of performance gain you may see in vectorising your kernels

Page 81: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.88

float rS, gS, bS, aS;

float r1, g1, b1, a1;

float r2, g2, b2, a2;

...

rS = r1 + r2;

gS = g1 + g2;

bS = b1 + b2;

aS = a1 + a2;

32bit FP add

32bit FP add

32bit FP add

32bit FP add

Page 82: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.89

cl::sycl::float4 vS;

cl::sycl::float4 v1;

cl::sycl::float4 v2;

...

rS = v1 + v2; 128bit FP vector add

Page 83: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.90

cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),

[=](cl::sycl::nd_item<2> item) {

int rowOffset = item.get_global_id(1) * WIDTH;

int my = item.get_global_id(0) + rowOffset;

int fIndex = 0;

cl::sycl::float4 sum = cl::sycl::float4{0.0f};

for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {

int curRow = my + r * WIDTH

for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE; c++) {

sum += inputAcc[curRow + c] * filterAcc[fIndex];

fIndex++;

}

}

outputAcc[my] = sum;

});

To vectorise the kernel define all accessors in terms of SYCL vector types

This allows us to remove the calculations to factor in the number of channels

This also allows us to reduce the multiplications and assignments to single vector operators

Page 84: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.91

82

84

86

88

90

92

94

96

98

100

102

3x3 5x5 7x7 9x9 11x11

Kernel time (% of base)

Coalesced Vectorised

Image conv

512x512 source image

Intel HD Graphics 530

Page 85: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.92

Make use of local memory

➢ Local memory is much lower latency to access than global memory

➢ Cache commonly accessed data and temporary results in local memory rather than reading and writing to global memory

➢ Using local memory is not necessarily always more efficient

➢ If data is not accessed frequently enough to warrant the copy to local memory you may not see a performance gain

Page 86: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.94

1 7 5 8 2 3 8 3 4 6 2 2 4 5 8 3

1 3 4 3 2 4 3 4 5 6 1 6 5 7 8 5

9 2 1 8 1 4 6 9 5 1 4 5 1 9 4 7

3 6 2 0 2 2 9 8 2 7 9 4 2 6 1 5

1 7 2 2 8 4 6 8 4 7 6 8 3 2 4 1

4 9 9 5 1 3 7 3 8 1 7 4 1 5 9 4

4 0 6 3 6 9 9 6 8 5 9 9 0 2 1 5

3 8 1 2 4 7 1 7 6 7 7 2 6 3 6 7

6 7 5 4 3 1 4 4 2 6 3 0 5 0 7 0

1 3 4 2 2 8 1 6 4 9 5 3 7 1 2 4

7 5 4 3 7 0 4 0 3 0 4 4 2 8 9 0

0 9 9 8 0 2 9 8 2 1 6 0 6 3 4 1

6 4 0 1 9 1 7 4 8 3 0 5 0 2 0 6

1 5 7 6 3 0 6 5 4 6 0 4 1 8 7 0

3 3 0 5 9 8 2 4 7 1 5 2 0 4 9 7

1 9 0 4 0 3 0 6 1 2 8 7 0 1 2 9

Each item in the computation needs to read neighbouring elements

This means each element of data is read multiple times

• 3x3 filter: up to 9 ops• 5x5 filter: up to 25 ops• And so on…

If each of these operations loads from global memory this is can be very expensive

1 2 1

2 4 2

1 2 1

1/16

Page 87: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.95

1 7 5 8 2 3 8 3 4 6 2 2 4 5 8 3

1 3 4 3 2 4 3 4 5 6 1 6 5 7 8 5

9 2 1 8 1 4 6 9 5 1 4 5 1 9 4 7

3 6 2 0 2 2 9 8 2 7 9 4 2 6 1 5

1 7 2 2 8 4 6 8 4 7 6 8 3 2 4 1

4 9 9 5 1 3 7 3 8 1 7 4 1 5 9 4

4 0 6 3 6 9 9 6 8 5 9 9 0 2 1 5

3 8 1 2 4 7 1 7 6 7 7 2 6 3 6 7

6 7 5 4 3 1 4 4 2 6 3 0 5 0 7 0

1 3 4 2 2 8 1 6 4 9 5 3 7 1 2 4

7 5 4 3 7 0 4 0 3 0 4 4 2 8 9 0

0 9 9 8 0 2 9 8 2 1 6 0 6 3 4 1

6 4 0 1 9 1 7 4 8 3 0 5 0 2 0 6

1 5 7 6 3 0 6 5 4 6 0 4 1 8 7 0

3 3 0 5 9 8 2 4 7 1 5 2 0 4 9 7

1 9 0 4 0 3 0 6 1 2 8 7 0 1 2 9

A common technique for using local memory is to break up your input into tiles

Then each tile can be moved to local memory while the work-group is working on it

4 6 2 2 4 5 8 3

5 6 1 6 5 7 8 5

5 1 4 5 1 9 4 7

2 7 9 4 2 6 1 5

4 7 6 8 3 2 4 1

8 1 7 4 1 5 9 4

8 5 9 9 0 2 1 5

6 7 7 2 6 3 6 7

Page 88: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.96

Synchronise work-groups when necessary

➢ Synchronising with a work-group barrier waits for all work-items to reach the same point

➢ Use a work-group barrier if you are copying data to local memory that neighbouring work-items will need to access

➢ Use a work-group barrier if you have temporary results that will be shared with other work-items

Page 89: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.97

Remember that work-items are not

all guaranteed to execute

concurrently

Page 90: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.98

A work-item can share results with

other work-items via local and global

memory

Page 91: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.99

This means that it’s possible for a

work-item to read a result that hasn’t

yet been written to yet, you have a

data racedata race

Page 92: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.100

This problem can be solved by a

synchronisation primitive called a

work-group barrier

Page 93: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.101

Work-items will block until all

work-items in the work-group have

reached that point

Page 94: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.102

Work-items will block until all

work-items in the work-group have

reached that point

Page 95: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.103

So now you can be sure that all of

the results that you want to read

from have been written to

Page 96: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.104

However this does not apply across

work-group boundaries, and you

have a data race again

work-group 1work-group 0

data race

Page 97: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.105

cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),

[=](cl::sycl::nd_item<2> item) {

int globalRowOffset = item.get_global_id(1) * WIDTH;

int global = item.get_global_id(0) + globalRowOffset;

int localRowOffset = item.get_local_id(1) * WIDTH;

int local = item.get_local_id(0) + localRowOffset;

int fIndex = 0;

cl::sycl::float4 sum = cl::sycl::float4{0.0f};

copy_tile(scratchpad, inputAcc, local, global);

item.barrier(cl::sycl::access::fence_space::local);

for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {

int curRow = local + r * WIDTH

for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE; c++) {

sum += scratchpad[curRow + c] * filterAcc[fIndex];

fIndex++;

}

}

outputAcc[global] = sum;

});

To use local memory we need to also calculate the linear position in the current work-group

We can then use this to copy a tile from global memory into the local memory of the current work-group

Now the multiply operators within the loop are reading from local memory

Page 98: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.106

cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),

[=](cl::sycl::nd_item<2> item) {

int globalRowOffset = item.get_global_id(1) * WIDTH;

int global = item.get_global_id(0) + globalRowOffset;

int localRowOffset = item.get_local_id(1) * WIDTH;

int local = item.get_local_id(0) + localRowOffset;

int fIndex = 0;

cl::sycl::float4 sum = cl::sycl::float4{0.0f};

copy_tile(scratchspace, inputAcc, local, global);

item.barrier(cl::sycl::access::fence_space::global_and_local);

for (int r = -HALF_FILTER_SIZE; r <= HALF_FILTER_SIZE; r++) {

int curRow = local + r * WIDTH

for (int c = -HALF_FILTER_SIZE; c <= HALF_FILTER_SIZE; c++) {

sum += scratchspace[curRow + c] * filterAcc[fIndex];

fIndex++;

}

}

outputAcc[global] = sum;

});

Since we’re moving a tile into local memory and then performing operations on it there we need a barrier to ensure all elements of the tile are copied

Page 99: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.107

0

20

40

60

80

100

120

3x3 5x5 7x7 9x9 11x11

Kernel time (% of base)

Vectorised Local mem

Image conv

512x512 source image

Intel HD Graphics 530

Page 100: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.108

Choosing an good work-group size

➢ The occupancy of a kernel can be limited by a number of factors of the GPU

➢ Total number of processing elements

➢ Total number of compute units

➢ Total registers available to the kernel

➢ Total local memory available to the kernel

➢ You can query the preferred work-group size once the kernel is compiled

➢ However this is not guaranteed to give you the best performance

➢ It’s good practice to benchmark various work-group sizes and choose the best

Page 101: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.109

0

20

40

60

80

100

120

3x3 5x5 7x7 9x9 11x11

Kernel time (% of base)

Vectorised Local mem (8x8) Local mem (16x16)

Image conv

512x512 source image

Intel HD Graphics 530

Page 102: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.110

Ideas for further optimisations

Page 103: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.111

Use constant memory

➢ Some GPUs provide a region of global memory that is read-only

➢ This can be faster to access as it doesn’t require caching

Private memory Local memoryGlobal memory< <

Constant memory

Page 104: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.112

Private memory Local memory Global memory< <

Texture memory

Use texture memory

➢ Most GPUs have texture memory

➢ This can be faster to access for data that is represented as pixels

➢ This also provides sampling operations

Page 105: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.113

Batch work together

➢ Hitting occupancy limitations of a GPU can lead to drops in performance gain

➢ This is because single work-items are having to do more chunks of work

➢ Batching work for each work-item allows reusing cached data

➢ Batching work that share neighbouring data allows you to further share local memory and registers

Page 106: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.114

Local memory

1,0

0,1 1,1

0,0

Use double buffering

➢ If you hit occupancy limitations you will have more tiles than can be computed at once

➢ This means each work-group will compute more than one tile

Page 107: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.115

Copy{0, 0}

Compute{0, 0}

Copy{1, 0}

Compute{1, 0}

Copy{0, 1}

Compute{0, 1}

Copy{1, 1}

Compute{1, 1}

Copy

Compute

Page 108: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.116

Copy{0, 0}

Compute{0, 0}

Copy{1, 0}

Compute{1, 0}

Copy{0, 1}

Compute{0, 1}

Copy{1, 1}

Compute{1, 1}

Copy

Compute

Page 109: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.117

Copy{0, 0}

Compute{0, 0}

Copy{1, 0}

Compute{1, 0}

Copy{0, 1}

Compute{0, 1}

Copy{1, 1}

Compute{1, 1}

Copy

Compute

Page 110: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.118

Copy{0, 0}

Compute{0, 0}

Copy{1, 0}

Compute{1, 0}

Copy{0, 1}

Compute{0, 1}

Copy{1, 1}

Compute{1, 1}

Copy

Compute

Overlapping copy and compute within kernels allows for better utilisation of GPU processing elements and therefore better throughput

Page 111: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.119

cgh.parallel_for<naive>(cl::sycl::nd_range<2>(globalRange, localRange),

[=](cl::sycl::nd_item<2> item) {

int rowOffset = item.get_global_id(1) * WIDTH;

int my = item.get_global_id(0) + rowOffset;

int fIndex = 0;

cl::sycl::float4 sum = cl::sycl::float4{0.0f};

sum += inputAcc[(my – 1 * WIDTH) - 1] * filterAcc[0];

sum += inputAcc[(my – 1 * WIDTH)] * filterAcc[1];

sum += inputAcc[(my – 1 * WIDTH) + 1] * filterAcc[2];

sum += inputAcc[(my * WIDTH) - 1] * filterAcc[3];

sum += inputAcc[(my * WIDTH)] * filterAcc[4];

sum += inputAcc[(my * WIDTH) + 1] * filterAcc[5];

sum += inputAcc[(my + 1 * WIDTH) - 1] * filterAcc[6];

sum += inputAcc[(my + 1 * WIDTH)] * filterAcc[7];

sum += inputAcc[(my + 1 * WIDTH) + 1] * filterAcc[8];

outputAcc[my] = sum;

});

➢ Here we unroll the loop over the filter

➢ This allows the compiler more freedom in how it vectorises and allocates registers

➢ However this does make the code more obfuscated and less flexible

Loop unrolling

Page 112: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.120

Further tips

➢ Use profiling tools to gather more accurate information about your programs

➢ SYCL provides kernel profiling

➢ Most OpenCL implementations provide proprietary profiler tools

➢ Follow vendor optimisation guides

➢ Most OpenCL vendors provide optimisation guides that detail recommendations on how to optimise programs for their respective GPU

Page 113: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

© 2019 Codeplay Software Ltd.121

Takeaways

➢ Identify which parts of your code to offload and which algorithms to use

➢ Look for hotspots in your code that are bottlenecks

➢ Identify opportunity for parallelism

➢ Optimising GPU programs means maximising throughput

➢ Maximize compute operations

➢ Minimise time spent on memory operations

➢ Use profilers to analyse your GPU programs and consult optimisation guides

Page 114: Efficient GPU Programming in Modern C++€¦ · Efficient GPU Programming in Modern C++ Gordon Brown Principal Software Engineer, SYCL & C++ CppCon 2019 –Sep 2019

/codeplaysoft@codeplaysoft codeplay.com

Thank you for listening