(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

54
(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted

description

(3) Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3 CUDA Programming Guide  Get the CUDA Occupancy Calculator  Find at nvidia.com  Use web version at 3

Transcript of (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

Page 1: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(1)

ISA Execution and Occupancy

©Sudhakar Yalamanchili unless otherwise noted

Page 2: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(2)

Objectives

• Understand the sequencing of instructions in a kernel through a set of scalar pipelines (width =warp)

• Have a basic, high level understanding of instruction fetch, decode, issue, and memory access

Page 3: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(3)

Reading Assignment

• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3

• CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract

• Get the CUDA Occupancy Calculator Find at nvidia.com Use web version at http://lxkarthi.github.io/cuda-calculator/

3

Page 4: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(4)

Recap__global__void vecAddKernel(float *A_d, float *B_d, float *C_d, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x;

if( i<n ) C_d[i] = A_d[i]+B_d[i];}

__host__Void vecAdd(){ dim3 DimGrid = (ceil(n/256,1,1); dim3 DimBlock = (256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n);}

KernelBlk 0 Blk N-1• • •

GPUM0

RAM

Mk• • •

Schedule onto multiprocessors

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois,

Urbana-Champaign

Page 5: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(5)

Kernel LaunchCUDA streams

Host Processor

HW Queues

Kernel Management Unit (Device)

• Commands by host issued through streams Kernels in the same stream executed sequentially Kernels in different streams may be executed concurrently

• Streams are mapped to hardware queues in the device in the kernel management unit (KMU) Multiple streams mapped to each queue serializes some

kernels• Kernel launch distributes thread blocks to SMs

Kernel dispatch to SMs

Page 6: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(6)

TBs, Warps, & Scheduling

• Imagine one TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

SMXs

……

……

SMX0 SMX1 SMX12

Thread Blocks

Page 7: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(7)

TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

SMXs

……

……

SMX0 SMX1 SMX12

Thread Blocks

Page 8: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(8)

TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

SMXs

……

……

SMX0 SMX1 SMX12

Thread Blocks

Page 9: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(9)

TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

SMXs

……

……

SMX0 SMX1 SMX12

Thread Blocks Remaining TBs are queued

Page 10: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(10)

NVIDIA GK110 (Keplar)Thread Block Scheduler

Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

Page 11: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(11)

SMX Organization

Multiple Warp Schedulers

192 cores – 6 clusters of 32 cores each

64K 32-bit registers

Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

Page 12: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(12)

Execution Sequencing in a Single SM

• Cycle level view of the execution of a GPU kernel

Warp 6

Warp 1Warp 2

Decode

RFPRF

D-CacheDataAll Hit?

Writeback

Pending Warps

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

I-Fetch

Miss?

Execution Sequencing?

Page 13: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(13)

Example: VectorAdd on GPU

__global__ vector_add ( float *a, float *b, float *c, int N) {int index = blockIdx.x * blockDim.x + threadIdx.x;

if (index < N) c[index] = a[index]+b[index];}

setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N@p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; //r6 = &a[index]ld.global.f32 %f2, [%r7]; //r7 = &b[index]add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3; //r8 = &c[index]

L2:ret;

PTX (Assembly):CUDA:

Page 14: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(14)

Example: VectorAdd on GPU

• N=8, 8 Threads, 1 block, warp size = 4

• 1 SM, 4 Cores

• Pipeline: Fetch:

o One instruction from each warpo Round-robin through all warps

Execution:o In-order execution within warps

With proper data forwarding 1 Cycle each stage

• How many warps?

Page 15: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(15)

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

Execution Sequence

Page 16: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(16)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

setp W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

Execution Sequence (cont.)

Page 17: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(17)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

setp W1 setp W0

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

Execution Sequence (cont.)

Page 18: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(18)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

bra W0setp W1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp W0 setp W0 setp W0 setp W0

Execution Sequence (cont.)

Page 19: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(19)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

@p bra W1@p bra W0

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp W1 setp W1 setp W1 setp W1

setp W0 setp W0 setp W0 setp W0

Execution Sequence (cont.)

Page 20: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(20)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

bra L2@p bra W1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

bra W0 bra W0 bra W0 bra W0

setp W1 setp W1 setp W1 setp W1

setp W0 setp W0 setp W0 setp W0

Execution Sequence (cont.)

Page 21: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(21)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

bra L2 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

bra W1 bra W1 bra W1 bra W1

bra W0 bra W0 bra W0 bra W0

setp W1 setp W1 setp W1 setp W1

Execution Sequence (cont.)

Page 22: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(22)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

bra W1 bra W1 bra W1 bra W1

bra W0 bra W0 bra W0 bra W0

Execution Sequence (cont.)

Page 23: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(23)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBbra W1 bra W1 bra W1 bra W1

ld W0

Execution Sequence (cont.)

Page 24: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(24)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

ld W1ld W0 ld W0 ld W0 ld W0

Execution Sequence (cont.)

Page 25: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(25)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

ld W0ld W1 ld W1 ld W1 ld W1

ld W0 ld W0 ld W0 ld W0

Execution Sequence (cont.)

Page 26: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(26)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

add W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

ld W1

ld W1 ld W1 ld W1 ld W1

ld W0 ld W0 ld W0 ld W0

ld W0 ld W0 ld W0 ld W0

Execution Sequence (cont.)

Page 27: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(27)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

add W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBld W1 ld W1 ld W1 ld W1

ld W0 ld W0 ld W0 ld W0

ld W1 ld W1 ld W1 ld W1add W0

Execution Sequence (cont.)

Page 28: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(28)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

st W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBld W0 ld W0 ld W0 ld W0

ld W1 ld W1 ld W1 ld W1

add W1add W0 add W0 add W0 add W0

Execution Sequence (cont.)

Page 29: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(29)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

st W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBld W1 ld W1 ld W1 ld W1

add W0 add W0 add W0 add W0

add W1 add W1 add W1 add W1st W0

Execution Sequence (cont.)

Page 30: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(30)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ret FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBadd W0 add W0 add W0 add W0

add W1 add W1 add W1 add W1

st W1st W0 st W0 st W0 st W0

Execution Sequence (cont.)

Page 31: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(31)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ret FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBadd W1 add W1 add W1 add W1

ret

st W0 st W0 st W0 st W0

st W1 st W1 st W1 st W1

Execution Sequence (cont.)

Page 32: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(32)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ret FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBst W0 st W0 st W0 st W0

st W1 st W1 st W1 st W1

ret ret ret ret

Execution Sequence (cont.)

Page 33: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(33)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBst W1 st W1 st W1 st W1

ret ret ret ret

ret ret ret ret

Execution Sequence (cont.)

Page 34: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(34)

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBret ret ret ret

ret ret ret ret

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Execution Sequence (cont.)

Page 35: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(35)

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBret ret ret ret

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Execution Sequence (cont.)

Page 36: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(36)

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Execution Sequence (cont.)

Note: Idealized execution without memory or special function delays

Page 37: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(37)

Fine Grained Multithreading

• First introduced in the Denelcor HEP (1980’s)

• Can eliminate data bypassing in an instruction stream Maintain separation between successive instructions in a

warp

• Simplifies dependency between successive instructions E.g., have only one instruction in the pipeline at a time

• What happens when warp size > #lanes? Why have such a relationship?

Page 38: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(38)

Multi-Cycle Dispatch

scalarpipeline

scalarpipeline

scalarpipeline

scalarpipeline

scalarpipeline

scalarpipeline

scalarpipeline

scalarpipeline

Cycle 1

Cycle 2

dispatch

dispatch

One warp

warp

Issue the warp over multiple cycles

NB: Fetch BW

Page 39: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(39)

SIMD vs. SIMT

SISD SIMD

MISD MIMD

Data Streams

Inst

ruct

ion

Stre

ams

Register File

+

Loosely synchronized threadsMultiple threads

Synchronous operation

RFRF RF RF

Single Scalar Thread

SIMT

Flynn Taxonomy

e.g., pthreads

e.g., SSE/AVX

e.g., PTX, HSA

Page 40: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(40)

Comparison

Register File

+

RFRF RF RFMIMD SIMD SIMT

•Multiple independent threads and explicit synchronization

•TLP, DLP, ILP, SPMD

•Single thread + vector operations

•DLP

•Easily embedded in sequential code

•Multiple, synchronous threads

•TLP, DLP, ILP (more recent)

•Explicitly parallel

Page 41: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(41)

Performance Metrics: Occupancy

• Performance implications of programming model properties Warps, thread blocks, register usage

• Capture which resources can be dynamically shared, and how to reason about resource demands of a CUDA kernel Enable device-specific online tuning of kernel parameters

41

Page 42: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(42)

CUDA Occupancy

• Occupancy = (#Active Warps) /(#MaximumActive Warps) Measure of how well you are using max capacity

• Limits on the numerator? Registers/thread Shared memory/thread block Number of scheduling slots: thread blocks or warps

• Limits on the denominator? Memory bandwidth Scheduler slots

• What is the performance impact of varying kernel resource demands?

Page 43: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(43)

Performance Goals

Core

Util

izatio

n

BW U

tiliza

tion

What are the limiting Factors?

Page 44: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(44)

Resource Limits on Occupancy

SM Scheduler

Kernel Distributor

SM SM SM SM

DRAM

Limits the #threads

Limits the #thread blocks

Warp Schedulers

Warp ContextWarp ContextWarp ContextWarp Context

Register File

SPSPSPSP

SPSPSPSP

SPSPSPSP

SPSPSPSP

TB 0

Thread Block Control

L1/Shared MemoryLimits the #thread

blocks

Limits the #threads

SM – Stream MultiprocessorSP – Stream Processor

Page 45: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(45)

Programming Model AttributesGrid 1

Block (0, 0)

Block (1, 1)

Block (1, 0)

Block (0, 1)

Block (1,1)

Thread(0,0,0)Thread

(0,1,3)Thread(0,1,0)

Thread(0,1,1)

Thread(0,1,2)

Thread(0,0,0)

Thread(0,0,1)

Thread(0,0,2)

Thread(0,0,3)

(1,0,0) (1,0,1) (1,0,2) (1,0,3)

• #SMs• Max Threads/Block• Max Threads/Dim• Warp size

• What do we need to know about target processor?

• How do these map to kernel configuration parameters that we can control?

Page 46: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(46)

Impact of Thread Block Size

• Consider Fermi with 1536 threads/SM With 512 threads/block, can only up to 3 thread blocks

executing at an instant With 128 threads/block 12 thread blocks per SM Consider how many instructions can be in flight?

• Consider limit of 8 thread blocks/SM? Only 1024 active threads at a time Occupancy = 0.666

• To maximize utilization, thread block size should balance demand for thread blocks vs. thread slots

Page 47: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(47)

Impact of #Registers Per Thread

• Assume 10 registers/thread and a thread block size of 256

• Number of registers per SM = 16K

• A thread block requires 2560 registers for a maximum of 6 thread blocks per SM Uses all 1536 thread slots

• What is the impact of increasing number of registers by 2? Granularity of management is a thread block! Loss of concurrency of 256 threads!

Page 48: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(48)

Impact of Shared Memory

• Shared memory is allocated per thread block Can limit the number of thread blocks executing concurrently

per SM

• As we change the gridDim and blockDim parameters how does demand change for shared memory, number of thread slots, or number of thread block slots?

Page 49: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(49)

Thread Granularity

• How fine grained should threads be? Coarser grain threads

o Increased register pressure, shared memory pressure

o Lower pressure on thread block slots and thread slots

• Merge threads blocks

• Share row values Reduce memory BW

d_N

d_M d_P

Merge these two threads

Page 50: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(50)

Balance

#Threads/Block

#Thread Blocks

Shared memory/T

hread block

#Registers/Thread

• Navigate the tradeoffs to maximize core utilization and memory bandwidth utilization for the target device

• Goal: Increase occupancy until one or the other is saturated

Page 51: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(51)

Performance Tuning

• Auto-tuning to maximize performance for a device

• Query device properties to tune kernel parameters prior to launch

• Tune kernel properties to maximize efficiency of execution

• http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/index.html

Page 52: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(52)

Device Properties

Page 53: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(53)

Summary

• Sequence warps through the scalar pipelines

• Overlap execution from multiple warps to hide memory latency (or special functions)

• Out of order execution not shown need scoreboard for correctness

• Occupancy calculator as a first order estimate of correctness

Page 54: (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)

Questions?