CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012...

CUDA OPTIMIZATION WITH

NVIDIA NSIGHT™ VISUAL STUDIO EDITION

Julien Demouth, NVIDIA

WHAT WILL YOU LEARN? An iterative method to optimize your GPU code

A way to conduct that method with NVIDIA Nsight VSE

https://github.com/jdemouth/nsight-gtc2014

A WORD ABOUT THE APPLICATION

Grayscale

A WORD ABOUT THE APPLICATION Grayscale Conversion

// r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;

A WORD ABOUT THE APPLICATION Blur: 7x7 Gaussian Filter

foreach pixel p: p = weighted sum of p and its 48 neighbors

16 12 8 4

4 8 12

Image from Wikipedia

A WORD ABOUT THE APPLICATION Edges: 3x3 Sobel Filters

foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(Gx + Gy)

-1 0 1

-2 0 2

-1 0 1

Weights for Gx:

-1 -2 -1

Weights for Gy:

OPTIMIZATION METHOD Trace the Application

Identify the Hot Spot and Profile it

Identify the Performance Limiter

— Memory Bandwidth

— Instruction Throughput

— Latency

Optimize the Code

Iterate

We focus on the Assess and Optimize steps of the APOD method. We do not talk about the Parallelize and Deploy steps http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#assess-parallelize-optimize-deploy

ENVIRONMENT NVIDIA Tesla K20c (GK110, SM3.5) without ECC

Microsoft Windows 7 x64

Microsoft Visual Studio 2012

NVIDIA CUDA 6.0

NVIDIA Nsight Visual Studio Edition 4.0

BEFORE WE START Some slides are for background

Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them, GTC 2013

http://on-demand.gputechconf.com/gtc/2013/video/S3466-Performance-Optimization-Guidelines-GPU-Architecture-Details.mp4

http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf

CUDA Best Practices Guide

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

Chameleon from http://www.vectorportal.com, Creative Commons

BEFORE WE START Instructions are executed by warps of threads

— It is a hardware concept

— There are 32 threads per warp

ITERATION 1

TRACE THE APPLICATION

Select

Application

Activate

Launch

Verify

Parameters

TIMELINE

CUDA LAUNCH SUMMARY

The Hotspot is gaussian_filter_7x7_v0

Kernel Time Speedup

Original version 6.265ms

Hotspot

PROFILE THE HOTSPOT

Select

Profile CUDA

Application

Select the Kernel

Launch

Select the

Experiments (All)

IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

MEMORY BANDWIDTH

SMEM/L1$

Registers

SMEM/L1$

Registers

Global Memory (Framebuffer)

208GB/s (K20)

MEMORY BANDWIDTH Utilization of L2$ Bandwidth (BW) limited and DRAM BW < 2%

Not limited by memory bandwidth

INSTRUCTION THROUGHPUT

Each SM has 4 schedulers (Kepler)

Schedulers issue instructions to pipes

A scheduler issues up to 2 instructions/cycle — Sustainable peak is 7 instructions/cycle per SM (not 4x2 = 8)

A scheduler issues inst. from a single warp

Cannot issue to a pipe if its issue slot is full

SMEM/L1$

Registers

Pipes Pipes Pipes Pipes

Sched Sched Sched Sched

INSTRUCTION THROUGHPUT

Schedulers saturated

Utilization: 90%

Store Texture

Control

Flow ALU

11% 8%

Schedulers and pipe

saturated

Utilization: 92%

Store Texture

Control

Flow ALU

Pipe saturated

Utilization: 64%

Store Texture

Control

Flow ALU

WARP ISSUE EFFICIENCY

Percentage of issue slots used (blue)

Aggregated over all the schedulers

PIPE UTILIZATION

Percentages of issue slots used per pipe

Accounts for pipe throughputs

Four groups of pipes:

— Load/Store

— Texture

— Control Flow

— Arithmetic (ALU)

INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated

Not limited by the instruction throughput

LATENCY GPUs cover latencies by having a lot of work in flight

warp 0

warp 1

warp 2

warp 3

warp 4

warp 5

warp 6

warp 7

warp 8

warp 9

The warp issues

The warp waits (latency)

Fully covered latency Exposed latency

No warp issues

LATENCY: LACK OF OCCUPANCY Not enough active warps

The schedulers cannot find eligible warps at every cycle

warp 0

warp 1

warp 2

warp 3

No warp issues

LATENCY

50% of theoretical occupancy

31.2 active warps per cycle

3.57 warps eligible per cycle

Hard to tell. Let’s start with occupancy

OCCUPANCY Each SM has limited resources

64K Registers (32 bit) shared by threads

Up to 48KB of Shared memory

16 slots to execute Blocks

Full occupancy: 2048 threads per SM (64 warps)

Values for SM30/SM35. They vary with Compute Capability

OCCUPANCY (BLOCK DIMENSION) Limited by the number of blocks

Blocks are too small (64 threads/block)

OCCUPANCY (BLOCK DIMENSION)

Increase Block Size

for (slightly) better

occupancy

OCCUPANCY (BLOCK DIMENSION) Increase the block size to 128 threads (8x16)

It runs slightly faster: 6.074ms

Kernel Time Speedup

Larger blocks 6.074ms 1.03x

ITERATION 2

The hotspot is still gaussian_filter_7x7_v0

Hotspot

Kernel Time Speedup

IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

MEMORY BANDWIDTH Utilization of L2$ BW limited and DRAM BW < 2%

Not limited by the instruction throughput

LATENCY (OCCUPANCY) Limited occupancy (56%) but 4.62 eligible warps/cycle (> 4)

There is probably something else limiting our performance

MEMORY TRANSACTIONS Warp of threads (32 threads)

L1 transaction: 128B – Alignment: 128B (0, 128, 256, …)

L2 transaction: 32B – Alignment: 32B (0, 32, 64, 96, …)

MEMORY TRANSACTIONS A warp issues 32x4B aligned and consecutive loads/stores

Threads read different elements of the same 128B segment

1x L1 transaction: 128B needed / 128B transferred

4x L2 transactions: 128B needed / 128B transferred

1x 128B L1 transaction per warp

4x 32B L2 transactions per warp

MEMORY TRANSACTIONS Threads in a warp read/write 4B words, 128B between words

Each thread reads the first 4B of a 128B segment

32x L1 transactions: 128B needed / 32x 128B transferred

32x L2 transactions: 128B needed / 32x 32B transferred

1x 128B L1 transaction per thread

1x 32B L2 transaction per thread

Threads 24-31 Threads 0-7

TRANSACTIONS AND REPLAYS A warp reads from addresses spanning 3 lines of 128B

1 instr. executed and 2 replays = 1 request and 3 transactions

Threads 8-15

Threads 16-23

Instruction issued Instruction re-issued

1st replay

Threads

0-7/24-31

Threads

Instruction re-issued

2nd replay

Threads

TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources

— More instructions issued

— More memory traffic

— Increased execution time

Inst. 0

Issued

Inst. 1

Issued

Inst. 2

Issued

Execution time

Threads

0-7/24-31

Threads

Inst. 0

Completed

Inst. 1

Completed

Inst. 2

Completed

Threads

0-7/24-31

Threads

Transfer data for inst. 0

Extra latency Extra work (SM)

Extra memory traffic

TRANSACTIONS PER REQUEST Transactions per Request: 4.20 (Load) / 4.00 (Store)

Too many memory transactions (too much pressure on LSU)

TRANSACTIONS PER REQUEST Our blocks are 8x16

We should use blocks of size 32x4 (or 32x8)

Warp 0

Warp 1

Warp 2

Warp 3

27 28 29 30

36 37 38

44 45 46

52 53 54

24 25 26

32 33 34

40 41 42

48 49 50

16 17 18

8 9 10

60 61 62 56 57 58 59 63

4 5 6 0 1 2 3 7 13 14 12 8 9 10 11 15 21 22 20 16 17 18 19 23 27 28 29 30 24 25 26 31

36 37 38 32 33 34 35 39 44 45 46 40 41 42 43 47 52 53 54 48 49 50 51 55 60 61 62 56 57 58 59 63

threadIdx.x

… 64 65 66

IMPROVED MEMORY ACCESSES Improved memory accesses: Blocks of size 32x4

It runs faster: 3.605ms

Kernel Time Speedup

Better memory accesses 3.605ms 1.74x

ITERATION 3

The hotspot is still gaussian_filter_7x7_v0

Hotspot

Kernel Time Speedup

We use the same block size for the Sobel filter kernel. That’s the reason why it also improves (2nd row of the Nsight table).

MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 2%

Not limited by instruction throughput

LATENCY (OCCUPANCY) Limited occupancy and not enough eligible warps/cycle (2.0)

Need more active warps (i.e. occupancy)

LATENCY (OCCUPANCY) Limited by register usage

REDUCE REGISTER USAGE Use the __launch_bounds__ attribute

10 gives the best results (48 registers): 1.949ms

__global__ __launch_bounds__(128, 10) void gaussian_filter_7x7_v1(int w, int h, const uchar *src, uchar *dst)

Number of threads per block Minimum number of blocks

Kernel Time Speedup

Fewer registers 1.949ms 3.21x

ITERATION 4

The hotspot is gaussian_filter_7x7_v1

Hotspot

Kernel Time Speedup

Not limited by instruction throughput

LATENCY (OCCUPANCY) Enough active and eligible warps per cycle

Not limited by a lack of occupancy

BRANCH DIVERGENCE Threads of a warp take different branches of a conditional

if( threadIdx.x < 12 ) {}

else {}

Threads execute the “if” branch Threads execute the “else” branch

Execution time = “if” branch + “else” branch

BRANCH DIVERGENCE But no divergence in our code

OTHER IDEAS Shared memory bank conflicts

— Conflicts: Two threads read different addresses from a same bank

Excessive usage of synchronizations (__syncthreads)

But those symptoms do not affect our case

MEMORY TRANSFERS Our image size: 2560x1600 = 4MB

We read 385MB from L2$: Too much traffic!

MEMORY TRANSFERS We do not saturate the issue slot of the Load-Store unit

But we saturate inside the Load-Store unit

Unfortunately, we cannot detect that with Nsight (yet)

Adjacent pixels access have neighbors in common

We should use shared memory to store those common pixels

SHARED MEMORY

__shared__ unsigned char smem_pixels[10][64];

USE SHARED MEMORY Use shared memory to keep data on the SM: 1.211ms

Kernel Time Speedup

Shared memory 1.211ms 5.17x

ITERATION 5

The hotspot is gaussian_filter_7x7_v2

Hotspot

Kernel Time Speedup

INSTRUCTION THROUGHPUT Not saturating the schedulers

But we use 73% of the Load-Store issue slot

LOAD-STORE INSTRUCTIONS LSU executes global and shared memory instructions

Change global loads to use the Read-Only path

READ-ONLY CACHE (TEXTURE UNITS)

SMEM/L1$

Registers

SMEM/L1$

Registers

Global Memory (Framebuffer)

Texture Units Texture Units Skip LSU

Cache loads

READ-ONLY PATH Annotate our pointer with const __restrict

The compiler generates LDG instructions: 1.019ms

Kernel Time Speedup

Read-Only path 1.019ms 6.15x

__global__ void gaussian_filter_7x7_v3(int w, int h, const uchar *__restrict src, uchar *dst)

INSTRUCTION THROUGHPUT We are doing much better

Things to investigate next:

— Improve memory efficiency

— Reduce computational intensity (separable filter)

CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012...

Documents

Transcript of CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012...

Nsight Eclipse Edition - Nvidia · Nsight Eclipse Edition 5.5 is based on Eclipse Platform 3.8.2 and Eclipse CDT 8.1.2 introducing a number of new features and enhancements to existing

在VR开发中使用Nsight Visual Studio Edition

CUDA: NEW FEATURES AND BEYOND - NVIDIA · Upcoming limited decoupling of display driver and CUDA release for ease of deployment ... Nsight Visual Studio/Eclipse Edition –editor,

PRACTICAL NSIGHT EDITATION

Windowsで始めるCUDA入門 - on-demand.gputechconf.comon-demand.gputechconf.com/gtc/2013/jp/sessions/8001.pdf · 1. Nsight Visual Studio Edition Visual StudioでのCUDA開発 —ビルド・デバッグ・プロファイル

CUDA DEVELOPER TOOLS: OVERVIEW & NEW FEATURES...Graphics Profiling CUDA Kernel Profiling Gfx GPU crash dump CUDA GPU crash dump 3 Nsight Eclipse Edition Nsight Visual Studio Edition

Nsight Systems Introduction - Indico

Debugging Experience with CUDA-GDB and CUDA ......2 CUDA Debugging Solutions CUDA-GDB (Linux & Mac) CUDA-MEMCHECK (Linux, Mac, & Windows) NVIDIA® Nsight Eclipse Edition (NEW!)Visual

CUDA Optimization with NVIDIA® Nsight Visual Studio Edition 3on-demand.gputechconf.com/gtc/2013/presentations/S... · During the session we will \ use Nsight Visual Studio Edition

Nsight Compute Command Line Interface - Nvidia › nsight-compute › pdf › NsightComputeCli.pdf · Nsight Compute Command Line Interface v2020.1.0 | 4 You can also specify the

Debugging CUDA Kernel Code with NVIDIA NSight Visual ... · Overview and live demo of the latest debugging features available in NVIDIA Nsight Visual Studio Edition. GPU Technology

Creating Your Promise - Melynn Sight, nSight Marketing

Nsight Eclipse Edition - Nvidia...Nsight Eclipse Edition now includes Remote System Explorer plug-in. This plugin enables accessing remote systems for file transfer, shell access and

NVIDIA CUDA Toolkit v6developer.download.nvidia.com/.../docs/CUDA_Toolkit...NVIDIA CUDA Toolkit v6.0 RN-06722-001 _v6.0 | iii ERRATA CUDA Tools ‣ NVIDIA Nsight Eclipse Edition in

NATIVE ANDROID DEVELOPMENT WITH NSIGHT TEGRA NOVEMBER …files.meetup.com/1285270/Nsight_Tegra_Deck_SVAndroid_FINAL.pdf · NATIVE ANDROID DEVELOPMENT WITH NSIGHT TEGRA NOVEMBER 2012

v2019.5.1 | March 2020 NSIGHT COMPUTE Copyright and Licenses · nsight compute v2019.5.1 | 1 chapter 1. nvidia software license agreement nvidia corporation nvidia software license

Marketing & Communications: Right On and Relevant - nSight Marketing

Parallel Nsight for Accelerated DirectX 11 Development...Parallel Nsight GPU computing solution in Visual Studio Debug, profile and analyze graphics and GPGPU applications Direct3D,

Summit Introduction to Nsight Systems for

NVIDIA Parallel Nsight™ 2.0 andC++ new/delete and support for virtual functions Inline PTX assembly Thrust C++ Template Performance Primitives Libraries (sort, reduce, etc.) NVIDIA