CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS...

71
Stephen Jones, GTC 2017 CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES

Transcript of CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS...

Page 1: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

Stephen Jones, GTC 2017

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES

Page 2: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

2

The art of doing more with less

Page 3: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

3

RULE #1: DON’T TRY TOO HARDPerf

orm

ance

Time

Peak Performance

Page 4: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

4

RULE #1: DON’T TRY TOO HARDPerf

orm

ance

Time

Peak Performance

Unre

alist

ic E

ffort

/Rew

ard

Page 5: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

5

RULE #1: DON’T TRY TOO HARDPerf

orm

ance

Time

Peak Performance

Page 6: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

6

RULE #1: DON’T TRY TOO HARDPerf

orm

ance

Time

Peak Performance

Reduce this time

Don’t waste this time

Get onthis curve

Page 7: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

7

RULE #1: DON’T TRY TOO HARDPerf

orm

ance

Time

Peak Performance

Trough ofdespair

Point ofdiminishing

returns

Prematureexcitement

Wait, it’sgoing slower??

Hire anintern

Here be ninjas

Most peoplegive up here

4 weeks andthis is it?

Page 8: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

8

PERFORMANCE CONSTRAINTS

Memory75%

Occupancy10%

Instruction2%

Divergence3%

Compute Intensity10%

Page 9: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

9

PERFORMANCE CONSTRAINTS

CPU <> GPU Transfer

Coalescence

Cache Inefficiency

Register Spilling

Divergent Access

Occupancy

Instruction

Divergence

Compute Intensity

Chart Title

Page 10: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

10

MEMORY ORDERS OF MAGNITUDE

CPUDRAMGDRAML2 CacheL1$SM

150GB/sec

16GB/sec

300GB/sec

2,000GB/sec

20,000GB/sec

regs

shmem

PCIe

bus

regs

shmem

regs

shmem

Page 11: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

11

TALK BREAKDOWN

1. Why Didn’t I Think Of That?

2. CPU Memory to GPU Memory (the PCIe Bus)

3. GPU Memory to the SM

4. Registers & Shared Memory

5. Occupancy, Divergence & Latency

6. Weird Things You Never Thought Of (and probably shouldn’t try)

In no particular order

Page 12: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

12

WHERE TO BEGIN?

Page 13: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

13

THE OBVIOUS

Start with the Visual Profiler

NVIDIA Visual Profiler

Page 14: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

14

CPU <> GPU DATA MOVEMENT

Page 15: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

15

PCI ISSUES

regs

shmem

regs

shmem

regs

shmem

PCIe

bus

16GB/sec

Moving data over the PCIe bus

Page 16: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

16

PIN YOUR CPU MEMORY

CPU MemoryGPU Memory

DataCopy

Page 17: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

17

PIN YOUR CPU MEMORY

CPU MemoryGPU Memory

DataDMA

Controller

Page 18: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

18

PIN YOUR CPU MEMORY

CPU MemoryGPU Memory

Swap

DMA

Controller Data

Page 19: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

19

PIN YOUR CPU MEMORY

CPU MemoryGPU Memory

Data

DMA

Controller

Pinned

Copy of

Data

CPU allocates & pins page then copies locally before DMA

Page 20: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

20

GPU Memory

PIN YOUR CPU MEMORY

CPU Memory

User

Pinned

Data

DMA

Controller

cudaHostAlloc( &data, size, cudaHostAllocMapped );cudaHostRegister( &data, size, cudaHostRegisterDefault );

Page 21: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

21

PIN YOUR CPU MEMORY

Page 22: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

22

REMEMBER: PCIe GOES BOTH WAYS

Page 23: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

23

Operations in a single stream are ordered

But hardware can copy and compute at the same time

STREAMS & CONCURRENCY

ComputeCopy data

to Host

Copy data

to GPU

Time

SingleStream

Hiding the cost of data transfer

Page 24: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

24

STREAMS & CONCURRENCY

ComputeCopy data

to Host

Copy data

to GPU

Time

WorkCopy

back

Copy

up

WorkCopy

back

Copy

up Saved TimeStream 2

Stream 1

SingleStream

Page 25: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

25

STREAMS & CONCURRENCY

8streams

2streams

1stream

Can keep on breaking work into smaller chunks and saving time

Page 26: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

26

SMALL PCIe TRANSFERS

PCIe is designed for large data transfers

But fine-grained copy/compute overlap prefers small transfers

So how small can we go?

8Toomany

2 1

Page 27: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

27

APPARENTLY NOT THAT SMALL

Page 28: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

28

FROM GPU MEMORY TO GPU THREADS

Page 29: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

29

FEEDING THE MACHINE

regs

shmem

regs

shmem

regs

shmem

PCIe

bus

From GPU Memory to the SMs

Page 30: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

30

USE THE PARALLEL ARCHITECTURE

Cache is sized to servicesets of 32 requests at a time

L2 Cache Line

Threads runin groups of 32

High-speed GPU memoryworks best with linear access

Hardware is optimized to use all SIMT threads at once

Page 31: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

31

VECTORIZE MEMORY LOADS

T0-T32

int

Multi-Word as well as Multi-Thread

Page 32: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

32

VECTORIZE MEMORY LOADS

T0-T15

T16-T31

int2

Fill multiple cache lines in a single fetch

Page 33: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

33

VECTORIZE MEMORY LOADS

T0-T7

T8-T15

T16-T23

T24-T31

int4

Fill multiple cache lines in a single fetch

Page 34: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

34

VECTORIZE MEMORY LOADS

Page 35: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

35

DO MULTIPLE LOADS PER THREAD

__global__ void copy(int2 *input,int2 *output, int max) {

int id = threadIdx.x +blockDim.x * blockIdx.x;

if( id < max ) {output[id] = input[id];

}}

__global__ void copy(int2 *input,int2 *output, int max,int loadsPerThread) {

int id = threadIdx.x +blockDim.x * blockIdx.x;

for(int n=0; n<loadsPerThread; n++) {if( id >= max ) {break;

}output[id] = input[id];id += blockDim.x * gridDim.x;

}}

One copy per threadMaximum overhead

Multiple copies per threadAmortize overhead

Multi-Thread, Multi-Word AND Multi-Iteration

Page 36: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

36

“MAXIMAL” LAUNCHES ARE BEST

Page 37: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

37

COALESCED MEMORY ACCESS

1 2 3 4Coalesced: Sequential memory accesses are adjacent

Uncoalesced: Sequential memory accesses are unassociated1

2

3

4

It’s not just good enough to use all SIMT threads

Page 38: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

38

SIMT PENALTIES WHEN NOT COALESCED

x = data[threadIdx.x] x = data[rand()]

Single 32-wide operation 32 one-wide operations

Page 39: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

39

SCATTER & GATHER

1

2

3

4

1 2 3 4

1

2

3

4

1 2 3 4

Scattering

Reading randomlyWriting sequentially

Gathering

Reading sequentiallyWriting randomly

Page 40: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

40

AVOID SCATTER/GATHER IF YOU CAN

Page 41: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

41

AVOID SCATTER/GATHER IF YOU CAN

Page 42: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

42

SORTING MIGHT BE AN OPTION

If reading non-sequential data is expensive, is it worth sorting it to make it sequential?

1 2 3 4

Coalesced Read

1 2 3 4Sort

1 2 3 4

2 4 1 3

Gathering

Slow Fast

Page 43: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

43

SORTING MIGHT BE AN OPTION

Even if you’re only going to read it twice, then yes!

Page 44: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

44

PRE-SORTING TURNS OUT TO BE GOOD

Page 45: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

45

DATA LAYOUT: “AOS vs. SOA”

Array-of-Structures

#define NPTS 1024 * 1024

struct Coefficients_AOS {double u[3];double x[3][3];double p;double rho;double eta;

};

Coefficients_AOS gridData[NPTS];

#define NPTS 1024 *1024

struct Coefficients_SOA {double u[3][NPTS];double x[3][3][NPTS];double p[NPTS];double rho[NPTS];double eta[NPTS];

};

Coefficients_SOA gridData;

Structure-of-Arrays

Single-thread code prefers arrays of structures, for cache efficiency

SIMT code prefers structures of arrays, for execution & memory efficiency

Sometimes you can’t just sort your data

Page 46: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

46

DATA LAYOUT: “AOS vs. SOA”

#define NPTS 1024 * 1024

struct Coefficients_AOS {double u[3];double x[3][3];double p;double rho;double eta;

};

Coefficients_AOS gridData[NPTS];

u0 u1 u2

x00 x01 x02

x10 x11 x12

x20 x21 x22

p

rho

eta

Structure Definition Conceptual Layout

Page 47: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

47

SOA: STRIDED ARRAY ACCESS

u0 u1 u2

x00 x01 x02

x10 x11 x12

x20 x21 x22

p

rho

eta

Conceptual Layout

Array-of-Structures Memory Layout

double u0 = gridData[threadIdx.x].u[0];

GPU reads data one element at a time, but in parallel by 32 threads in a warp

Page 48: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

48

AOS: COALESCED BUT COMPLEX

u0 u1 u2

x00 x01 x02

x10 x11 x12

x20 x21 x22

p

rho

eta

Conceptual Layout

Array-of-Structures Memory Layout

GPU reads data one element at a time, but in parallel by 32 threads in a warp

double u0 = gridData.u[0][threadIdx.x];

Structure-of-Arrays Memory Layout

Page 49: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

49

BLOCK-WIDE LOAD VIA SHARED MEMORY

Read data linearly as bytes. Use shared memory to convert to struct

Block copies datato shared memory

Device Memory

Shared Memory

Page 50: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

50

BLOCK-WIDE LOAD VIA SHARED MEMORY

Read data linearly as bytes. Use shared memory to convert to struct

Threads which own the datagrab it from shared memory

Device Memory

Shared Memory

Page 51: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

51

CLEVER AOS/SOA TRICKS

Page 52: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

52

CLEVER AOS/SOA TRICKSHelps for any data size

Page 53: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

53

HANDY LIBRARY TO HELP YOU

Trove – A utility library for fast AOS/SOA access and transpositionhttps://github.com/bryancatanzaro/trove

Page 54: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

54

(AB)USING THE CACHE

Page 55: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

55

MAKING THE MOST OF L2-CACHE

L2 cache is fast but small: GDRAML2 Cache

300GB/sec

2,000GB/sec

Architecture L2 Cache

Size

Total

Threads

Cache Bytes

per Thread

Kepler 1536 KB 30,720 51

Maxwell 3072 KB 49,152 64

Pascal 4096 KB 114,688 36

Page 56: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

56

TRAINING DEEP NEURAL NETWORKS

Page 57: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

57

LOTS OF PASSES OVER DATA

FFT

3x3

convolution

5x5

convolution

7x7

convolution

+

W1

W2

W3

Cat!

Page 58: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

58

MULTI-RESOLUTION CONVOLUTIONS

Pass 1 : 3x3

Pass 2: 5x5

Pass 3: 7x7

Page 59: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

59

TILED, MULTI-RESOLUTION CONVOLUTION

Do 3 passes per-tile Each tile sized to fit in L2 cache

Pass 1 : 3x3

Pass 2: 5x5

Pass 3: 7x7

Page 60: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

60

LAUNCHING FEWER THAN MAXIMUM THREADS

Page 61: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

61

SHARED MEMORY: DEFINITELY WORTH IT

Page 62: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

62

USING SHARED MEMORY WISELY

Shared memory arranged into “banks” for concurrent SIMT access▪ 32 threads can read simultaneously so long as into separate banks

Shared memory has 4-byte and 8-byte “bank” sizes

Page 63: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

63

STENCIL ALGORITHM

Many algorithms have high data re-use: potentially good for shared memory

“Stencil” algorithms accumulate data from neighbours onto a central point▪ Stencil has width “W” (in the above case, W=5)

Adjacent threads will share (W-1) items of data – good potential for data re-use

Page 64: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

64

STENCILS IN SHARED MEMORY

Page 65: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

65

SIZE MATTERS

Page 66: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

66

PERSISTENT KERNELS

Avoid multiple kernel launches by caching in shared memory instead of L2

void tiledConvolution() {convolution<3><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize);convolution<5><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize);convolution<7><<< numblocks, blockdim, 0, s >>>(ptr, chunkSize);

}

__global__ void convolutionShared(int *data, int count, int sharedelems) {extern __shared__ int shdata[];shdata[threadIdx.x] = data[threadIdx.x + blockDim.x*blockIdx.x];__syncthreads();

convolve<3>(threadIdx.x, shdata, sharedelems);__syncthreads();convolve<5>(threadIdx.x, shdata, sharedelems);__syncthreads();convolve<7>(threadIdx.x, shdata, sharedelems);

}

Separate kernellaunches withL2 re-use

Single kernellaunch withpersistent kernel

Revisiting the tiled convolutions

Page 67: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

67

PERSISTENT KERNELS

Page 68: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

68

OPERATING DIRECTLY FROM CPU MEMORY

Can save memory copies. It’s obvious when you think about it ...

ComputeCopy data

to Host

Copy data

to GPU

ComputeRead f

rom

CPU

Wri

te t

o

CPU

Compute only begins when 1st copyhas finished. Task only ends when2nd copy has finished.

Compute begins after first fetch.Uses lots of threads to coverhost-memory access latency. Takes advantage of bi-directional PCI.

Page 69: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

69

OPERATING DIRECTLY FROM CPU MEMORY

Page 70: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

70

OCCUPANCY AND REGISTER LIMITATIONS

Register file is bigger than shared memory and L1 cache!

Occupancy can kill you if you use too many registers

Often worth forcing fewer registers to allow more blocks per SM

But watch out for math functions!

Function float double

log 7 18

cos 16 28

acos 6 18

cosh 7 10

tan 15 28

erfc 14 22

exp 7 10

log10 6 18

normcdf 16 26

cbrt 8 20

sqrt 6 12

rsqrt 5 12

y0 20 30

y1 22 30

fdivide 11 20

pow 11 24

grad. desc. 14 22

__launch_bounds__(maxThreadsPerBlock,minBlocksPerMultiprocessor)

__global__ void compute() {y = acos(pow(log(fdivide(tan(cosh(erfc(x))), 2)), 3);

}

Page 71: CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES · 2017. 5. 18. · CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. 2 The art of doing more with less. 3 RULE #1: DON’T TRY TOO HARD

THANK YOU!