CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012...

Post on 17-Sep-2020

48 views 0 download

Transcript of CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012...

CUDA OPTIMIZATION WITH

NVIDIA NSIGHT™ VISUAL STUDIO EDITION

Julien Demouth, NVIDIA

WHAT WILL YOU LEARN? An iterative method to optimize your GPU code

A way to conduct that method with NVIDIA Nsight VSE

https://github.com/jdemouth/nsight-gtc2014

Blur

A WORD ABOUT THE APPLICATION

Grayscale

Edges

A WORD ABOUT THE APPLICATION Grayscale Conversion

// r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;

A WORD ABOUT THE APPLICATION Blur: 7x7 Gaussian Filter

foreach pixel p: p = weighted sum of p and its 48 neighbors

16 12 8 4

9 6 3

6 4 2

3 2 1

6 3

4 2

9

6

3 2 1

4 8 12

3 6 9

2 4 6

1 2 3

3 6 9

2 4 6

1 2 3

12

8

4

4

8

12

Image from Wikipedia

A WORD ABOUT THE APPLICATION Edges: 3x3 Sobel Filters

foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(Gx + Gy)

-1 0 1

-2 0 2

-1 0 1

Weights for Gx:

1 2 1

0 0 0

-1 -2 -1

Weights for Gy:

OPTIMIZATION METHOD Trace the Application

Identify the Hot Spot and Profile it

Identify the Performance Limiter

— Memory Bandwidth

— Instruction Throughput

— Latency

Optimize the Code

Iterate

We focus on the Assess and Optimize steps of the APOD method. We do not talk about the Parallelize and Deploy steps http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#assess-parallelize-optimize-deploy

ENVIRONMENT NVIDIA Tesla K20c (GK110, SM3.5) without ECC

Microsoft Windows 7 x64

Microsoft Visual Studio 2012

NVIDIA CUDA 6.0

NVIDIA Nsight Visual Studio Edition 4.0

BEFORE WE START Some slides are for background

Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them, GTC 2013

http://on-demand.gputechconf.com/gtc/2013/video/S3466-Performance-Optimization-Guidelines-GPU-Architecture-Details.mp4

http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf

CUDA Best Practices Guide

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

Chameleon from http://www.vectorportal.com, Creative Commons

BEFORE WE START Instructions are executed by warps of threads

— It is a hardware concept

— There are 32 threads per warp

ITERATION 1

TRACE THE APPLICATION

Select

Trace

Application

Activate

CUDA

Launch

Verify

Parameters

TIMELINE

CUDA LAUNCH SUMMARY

The Hotspot is gaussian_filter_7x7_v0

Kernel Time Speedup

Original version 6.265ms

Hotspot

PROFILE THE HOTSPOT

Select

Profile CUDA

Application

Select the Kernel

Launch

Select the

Experiments (All)

IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

MEMORY BANDWIDTH

SMEM/L1$

Registers

SM

SMEM/L1$

Registers

SM

Global Memory (Framebuffer)

L2$

208GB/s (K20)

MEMORY BANDWIDTH Utilization of L2$ Bandwidth (BW) limited and DRAM BW < 2%

Not limited by memory bandwidth

INSTRUCTION THROUGHPUT

Each SM has 4 schedulers (Kepler)

Schedulers issue instructions to pipes

A scheduler issues up to 2 instructions/cycle — Sustainable peak is 7 instructions/cycle per SM (not 4x2 = 8)

A scheduler issues inst. from a single warp

Cannot issue to a pipe if its issue slot is full

SMEM/L1$

Registers

SM

Pipes Pipes Pipes Pipes

Sched Sched Sched Sched

INSTRUCTION THROUGHPUT

Sched Sched Sched Sched

Schedulers saturated

Utilization: 90%

Load

Store Texture

Control

Flow ALU

11% 8%

65%

6%

Sched Sched Sched Sched

Schedulers and pipe

saturated

4%

27%

Utilization: 92%

Load

Store Texture

Control

Flow ALU

90%

Sched Sched Sched Sched

Pipe saturated

78%

Utilization: 64%

Load

Store Texture

Control

Flow ALU

24%

4%

WARP ISSUE EFFICIENCY

Percentage of issue slots used (blue)

Aggregated over all the schedulers

PIPE UTILIZATION

Percentages of issue slots used per pipe

Accounts for pipe throughputs

Four groups of pipes:

— Load/Store

— Texture

— Control Flow

— Arithmetic (ALU)

INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated

Not limited by the instruction throughput

LATENCY GPUs cover latencies by having a lot of work in flight

warp 0

warp 1

warp 2

warp 3

warp 4

warp 5

warp 6

warp 7

warp 8

warp 9

The warp issues

The warp waits (latency)

Fully covered latency Exposed latency

No warp issues

LATENCY: LACK OF OCCUPANCY Not enough active warps

The schedulers cannot find eligible warps at every cycle

warp 0

warp 1

warp 2

warp 3

No warp issues

LATENCY

50% of theoretical occupancy

31.2 active warps per cycle

3.57 warps eligible per cycle

Hard to tell. Let’s start with occupancy

OCCUPANCY Each SM has limited resources

64K Registers (32 bit) shared by threads

Up to 48KB of Shared memory

16 slots to execute Blocks

Full occupancy: 2048 threads per SM (64 warps)

Values for SM30/SM35. They vary with Compute Capability

OCCUPANCY (BLOCK DIMENSION) Limited by the number of blocks

Blocks are too small (64 threads/block)

OCCUPANCY (BLOCK DIMENSION)

Increase Block Size

for (slightly) better

occupancy

OCCUPANCY (BLOCK DIMENSION) Increase the block size to 128 threads (8x16)

It runs slightly faster: 6.074ms

Kernel Time Speedup

Original version 6.263ms

Larger blocks 6.074ms 1.03x

ITERATION 2

TRACE THE APPLICATION

The hotspot is still gaussian_filter_7x7_v0

Hotspot

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

MEMORY BANDWIDTH Utilization of L2$ BW limited and DRAM BW < 2%

Not limited by memory bandwidth

INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated

Not limited by the instruction throughput

LATENCY (OCCUPANCY) Limited occupancy (56%) but 4.62 eligible warps/cycle (> 4)

There is probably something else limiting our performance

MEMORY TRANSACTIONS Warp of threads (32 threads)

L1 transaction: 128B – Alignment: 128B (0, 128, 256, …)

L2 transaction: 32B – Alignment: 32B (0, 32, 64, 96, …)

MEMORY TRANSACTIONS A warp issues 32x4B aligned and consecutive loads/stores

Threads read different elements of the same 128B segment

1x L1 transaction: 128B needed / 128B transferred

4x L2 transactions: 128B needed / 128B transferred

1x 128B L1 transaction per warp

4x 32B L2 transactions per warp

MEMORY TRANSACTIONS Threads in a warp read/write 4B words, 128B between words

Each thread reads the first 4B of a 128B segment

32x L1 transactions: 128B needed / 32x 128B transferred

32x L2 transactions: 128B needed / 32x 32B transferred

1x 128B L1 transaction per thread

1x 32B L2 transaction per thread

32x

Threads 24-31 Threads 0-7

TRANSACTIONS AND REPLAYS A warp reads from addresses spanning 3 lines of 128B

1 instr. executed and 2 replays = 1 request and 3 transactions

Threads 8-15

Threads 16-23

Time

Instruction issued Instruction re-issued

1st replay

Threads

0-7/24-31

Threads

8-15

Instruction re-issued

2nd replay

Threads

16-23

TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources

— More instructions issued

— More memory traffic

— Increased execution time

Inst. 0

Issued

Inst. 1

Issued

Inst. 2

Issued

Execution time

Threads

0-7/24-31

Threads

8-15

Threads

16-23

Inst. 0

Completed

Inst. 1

Completed

Inst. 2

Completed

Threads

0-7/24-31

Threads

8-15

Threads

16-23

Transfer data for inst. 0

Transfer data for inst. 1

Transfer data for inst. 2

Extra latency Extra work (SM)

Extra memory traffic

TRANSACTIONS PER REQUEST Transactions per Request: 4.20 (Load) / 4.00 (Store)

Too many memory transactions (too much pressure on LSU)

TRANSACTIONS PER REQUEST Our blocks are 8x16

We should use blocks of size 32x4 (or 32x8)

Warp 0

Warp 1

Warp 2

Warp 3

27 28 29 30

36 37 38

44 45 46

52 53 54

21 22

13 14

20

12

4 5 6

24 25 26

32 33 34

40 41 42

48 49 50

16 17 18

8 9 10

0 1 2

19

11

3

51

43

35

31

39

47

55

23

15

7

60 61 62 56 57 58 59 63

4 5 6 0 1 2 3 7 13 14 12 8 9 10 11 15 21 22 20 16 17 18 19 23 27 28 29 30 24 25 26 31

36 37 38 32 33 34 35 39 44 45 46 40 41 42 43 47 52 53 54 48 49 50 51 55 60 61 62 56 57 58 59 63

threadIdx.x

… 64 65 66

IMPROVED MEMORY ACCESSES Improved memory accesses: Blocks of size 32x4

It runs faster: 3.605ms

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

ITERATION 3

TRACE THE APPLICATION

The hotspot is still gaussian_filter_7x7_v0

Hotspot

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

We use the same block size for the Sobel filter kernel. That’s the reason why it also improves (2nd row of the Nsight table).

MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 2%

Not limited by memory bandwidth

INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated

Not limited by instruction throughput

LATENCY (OCCUPANCY) Limited occupancy and not enough eligible warps/cycle (2.0)

Need more active warps (i.e. occupancy)

LATENCY (OCCUPANCY) Limited by register usage

REDUCE REGISTER USAGE Use the __launch_bounds__ attribute

10 gives the best results (48 registers): 1.949ms

__global__ __launch_bounds__(128, 10) void gaussian_filter_7x7_v1(int w, int h, const uchar *src, uchar *dst)

Number of threads per block Minimum number of blocks

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

ITERATION 4

TRACE THE APPLICATION

The hotspot is gaussian_filter_7x7_v1

Hotspot

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 2%

Not limited by memory bandwidth

INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated

Not limited by instruction throughput

LATENCY (OCCUPANCY) Enough active and eligible warps per cycle

Not limited by a lack of occupancy

BRANCH DIVERGENCE Threads of a warp take different branches of a conditional

if( threadIdx.x < 12 ) {}

else {}

Time

Threads execute the “if” branch Threads execute the “else” branch

Execution time = “if” branch + “else” branch

BRANCH DIVERGENCE But no divergence in our code

OTHER IDEAS Shared memory bank conflicts

— Conflicts: Two threads read different addresses from a same bank

Excessive usage of synchronizations (__syncthreads)

But those symptoms do not affect our case

MEMORY TRANSFERS Our image size: 2560x1600 = 4MB

We read 385MB from L2$: Too much traffic!

MEMORY TRANSFERS We do not saturate the issue slot of the Load-Store unit

But we saturate inside the Load-Store unit

Unfortunately, we cannot detect that with Nsight (yet)

Adjacent pixels access have neighbors in common

We should use shared memory to store those common pixels

SHARED MEMORY

__shared__ unsigned char smem_pixels[10][64];

USE SHARED MEMORY Use shared memory to keep data on the SM: 1.211ms

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

Shared memory 1.211ms 5.17x

ITERATION 5

TRACE THE APPLICATION

The hotspot is gaussian_filter_7x7_v2

Hotspot

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

Shared memory 1.211ms 5.17x

MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 4%

Not limited by memory bandwidth

INSTRUCTION THROUGHPUT Not saturating the schedulers

But we use 73% of the Load-Store issue slot

LOAD-STORE INSTRUCTIONS LSU executes global and shared memory instructions

Change global loads to use the Read-Only path

READ-ONLY CACHE (TEXTURE UNITS)

SMEM/L1$

Registers

SM

SMEM/L1$

Registers

SM

Global Memory (Framebuffer)

L2$

Texture Units Texture Units Skip LSU

Cache loads

READ-ONLY PATH Annotate our pointer with const __restrict

The compiler generates LDG instructions: 1.019ms

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

Shared memory 1.211ms 5.17x

Read-Only path 1.019ms 6.15x

__global__ void gaussian_filter_7x7_v3(int w, int h, const uchar *__restrict src, uchar *dst)

INSTRUCTION THROUGHPUT We are doing much better

Things to investigate next:

— Improve memory efficiency

— Reduce computational intensity (separable filter)

MORE IN OUR COMPANION CODE

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

Shared memory 1.211ms 5.17x

Read-Only path 1.019ms 6.15x

Separable filter 0.656ms 9.55x

Process two pixels per thread (improve memory efficiency + add ILP) 0.511ms 12.26x

Use 64-bit shared memory (remove bank conflicts) 0.499ms 12.56x

Use float instead of int (increase instruction throughput) 0.434ms 14.44x

Your next idea!!!

https://github.com/jdemouth/nsight-gtc2014

CONCLUSION

OPTIMIZATION METHOD Trace the Application

Identify the Hot Spot and Profile it

Identify the Performance Limiter

— Memory Bandwidth

— Instruction Throughput

— Latency

Optimize the Code

Iterate

https://github.com/jdemouth/nsight-gtc2014