(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

Post on 06-Jan-2018

230 views 1 download

description

(3) Reading Assignment Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3 CUDA Programming Guide  Get the CUDA Occupancy Calculator  Find at nvidia.com  Use web version at 3

Transcript of (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(1)

ISA Execution and Occupancy

©Sudhakar Yalamanchili unless otherwise noted

(2)

Objectives

• Understand the sequencing of instructions in a kernel through a set of scalar pipelines (width =warp)

• Have a basic, high level understanding of instruction fetch, decode, issue, and memory access

(3)

Reading Assignment

• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3

• CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract

• Get the CUDA Occupancy Calculator Find at nvidia.com Use web version at http://lxkarthi.github.io/cuda-calculator/

3

(4)

Recap__global__void vecAddKernel(float *A_d, float *B_d, float *C_d, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x;

if( i<n ) C_d[i] = A_d[i]+B_d[i];}

__host__Void vecAdd(){ dim3 DimGrid = (ceil(n/256,1,1); dim3 DimBlock = (256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n);}

KernelBlk 0 Blk N-1• • •

GPUM0

RAM

Mk• • •

Schedule onto multiprocessors

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois,

Urbana-Champaign

(5)

Kernel LaunchCUDA streams

Host Processor

HW Queues

Kernel Management Unit (Device)

• Commands by host issued through streams Kernels in the same stream executed sequentially Kernels in different streams may be executed concurrently

• Streams are mapped to hardware queues in the device in the kernel management unit (KMU) Multiple streams mapped to each queue serializes some

kernels• Kernel launch distributes thread blocks to SMs

Kernel dispatch to SMs

(6)

TBs, Warps, & Scheduling

• Imagine one TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

SMXs

……

……

SMX0 SMX1 SMX12

Thread Blocks

(7)

TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

SMXs

……

……

SMX0 SMX1 SMX12

Thread Blocks

(8)

TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

SMXs

……

……

SMX0 SMX1 SMX12

Thread Blocks

(9)

TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

Register FileCores

L1 Cache/Shared Memory

SMXs

……

……

SMX0 SMX1 SMX12

Thread Blocks Remaining TBs are queued

(10)

NVIDIA GK110 (Keplar)Thread Block Scheduler

Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

(11)

SMX Organization

Multiple Warp Schedulers

192 cores – 6 clusters of 32 cores each

64K 32-bit registers

Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

(12)

Execution Sequencing in a Single SM

• Cycle level view of the execution of a GPU kernel

Warp 6

Warp 1Warp 2

Decode

RFPRF

D-CacheDataAll Hit?

Writeback

Pending Warps

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

I-Fetch

Miss?

Execution Sequencing?

(13)

Example: VectorAdd on GPU

__global__ vector_add ( float *a, float *b, float *c, int N) {int index = blockIdx.x * blockDim.x + threadIdx.x;

if (index < N) c[index] = a[index]+b[index];}

setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N@p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; //r6 = &a[index]ld.global.f32 %f2, [%r7]; //r7 = &b[index]add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3; //r8 = &c[index]

L2:ret;

PTX (Assembly):CUDA:

(14)

Example: VectorAdd on GPU

• N=8, 8 Threads, 1 block, warp size = 4

• 1 SM, 4 Cores

• Pipeline: Fetch:

o One instruction from each warpo Round-robin through all warps

Execution:o In-order execution within warps

With proper data forwarding 1 Cycle each stage

• How many warps?

(15)

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

Execution Sequence

(16)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

setp W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

Execution Sequence (cont.)

(17)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

setp W1 setp W0

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

Execution Sequence (cont.)

(18)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

bra W0setp W1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp W0 setp W0 setp W0 setp W0

Execution Sequence (cont.)

(19)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

@p bra W1@p bra W0

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp W1 setp W1 setp W1 setp W1

setp W0 setp W0 setp W0 setp W0

Execution Sequence (cont.)

(20)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

bra L2@p bra W1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

bra W0 bra W0 bra W0 bra W0

setp W1 setp W1 setp W1 setp W1

setp W0 setp W0 setp W0 setp W0

Execution Sequence (cont.)

(21)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

bra L2 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

bra W1 bra W1 bra W1 bra W1

bra W0 bra W0 bra W0 bra W0

setp W1 setp W1 setp W1 setp W1

Execution Sequence (cont.)

(22)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

bra W1 bra W1 bra W1 bra W1

bra W0 bra W0 bra W0 bra W0

Execution Sequence (cont.)

(23)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBbra W1 bra W1 bra W1 bra W1

ld W0

Execution Sequence (cont.)

(24)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

ld W1ld W0 ld W0 ld W0 ld W0

Execution Sequence (cont.)

(25)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ld W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

ld W0ld W1 ld W1 ld W1 ld W1

ld W0 ld W0 ld W0 ld W0

Execution Sequence (cont.)

(26)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

add W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

ld W1

ld W1 ld W1 ld W1 ld W1

ld W0 ld W0 ld W0 ld W0

ld W0 ld W0 ld W0 ld W0

Execution Sequence (cont.)

(27)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

add W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBld W1 ld W1 ld W1 ld W1

ld W0 ld W0 ld W0 ld W0

ld W1 ld W1 ld W1 ld W1add W0

Execution Sequence (cont.)

(28)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

st W0 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBld W0 ld W0 ld W0 ld W0

ld W1 ld W1 ld W1 ld W1

add W1add W0 add W0 add W0 add W0

Execution Sequence (cont.)

(29)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

st W1 FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBld W1 ld W1 ld W1 ld W1

add W0 add W0 add W0 add W0

add W1 add W1 add W1 add W1st W0

Execution Sequence (cont.)

(30)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ret FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBadd W0 add W0 add W0 add W0

add W1 add W1 add W1 add W1

st W1st W0 st W0 st W0 st W0

Execution Sequence (cont.)

(31)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ret FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBadd W1 add W1 add W1 add W1

ret

st W0 st W0 st W0 st W0

st W1 st W1 st W1 st W1

Execution Sequence (cont.)

(32)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

ret FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBst W0 st W0 st W0 st W0

st W1 st W1 st W1 st W1

ret ret ret ret

Execution Sequence (cont.)

(33)

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBst W1 st W1 st W1 st W1

ret ret ret ret

ret ret ret ret

Execution Sequence (cont.)

(34)

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBret ret ret ret

ret ret ret ret

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Execution Sequence (cont.)

(35)

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WBret ret ret ret

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Execution Sequence (cont.)

(36)

Warp0 Warp1

FEDE

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

EXE

MEM

WB

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Execution Sequence (cont.)

Note: Idealized execution without memory or special function delays

(37)

Fine Grained Multithreading

• First introduced in the Denelcor HEP (1980’s)

• Can eliminate data bypassing in an instruction stream Maintain separation between successive instructions in a

warp

• Simplifies dependency between successive instructions E.g., have only one instruction in the pipeline at a time

• What happens when warp size > #lanes? Why have such a relationship?

(38)

Multi-Cycle Dispatch

scalarpipeline

scalarpipeline

scalarpipeline

scalarpipeline

scalarpipeline

scalarpipeline

scalarpipeline

scalarpipeline

Cycle 1

Cycle 2

dispatch

dispatch

One warp

warp

Issue the warp over multiple cycles

NB: Fetch BW

(39)

SIMD vs. SIMT

SISD SIMD

MISD MIMD

Data Streams

Inst

ruct

ion

Stre

ams

Register File

+

Loosely synchronized threadsMultiple threads

Synchronous operation

RFRF RF RF

Single Scalar Thread

SIMT

Flynn Taxonomy

e.g., pthreads

e.g., SSE/AVX

e.g., PTX, HSA

(40)

Comparison

Register File

+

RFRF RF RFMIMD SIMD SIMT

•Multiple independent threads and explicit synchronization

•TLP, DLP, ILP, SPMD

•Single thread + vector operations

•DLP

•Easily embedded in sequential code

•Multiple, synchronous threads

•TLP, DLP, ILP (more recent)

•Explicitly parallel

(41)

Performance Metrics: Occupancy

• Performance implications of programming model properties Warps, thread blocks, register usage

• Capture which resources can be dynamically shared, and how to reason about resource demands of a CUDA kernel Enable device-specific online tuning of kernel parameters

41

(42)

CUDA Occupancy

• Occupancy = (#Active Warps) /(#MaximumActive Warps) Measure of how well you are using max capacity

• Limits on the numerator? Registers/thread Shared memory/thread block Number of scheduling slots: thread blocks or warps

• Limits on the denominator? Memory bandwidth Scheduler slots

• What is the performance impact of varying kernel resource demands?

(43)

Performance Goals

Core

Util

izatio

n

BW U

tiliza

tion

What are the limiting Factors?

(44)

Resource Limits on Occupancy

SM Scheduler

Kernel Distributor

SM SM SM SM

DRAM

Limits the #threads

Limits the #thread blocks

Warp Schedulers

Warp ContextWarp ContextWarp ContextWarp Context

Register File

SPSPSPSP

SPSPSPSP

SPSPSPSP

SPSPSPSP

TB 0

Thread Block Control

L1/Shared MemoryLimits the #thread

blocks

Limits the #threads

SM – Stream MultiprocessorSP – Stream Processor

(45)

Programming Model AttributesGrid 1

Block (0, 0)

Block (1, 1)

Block (1, 0)

Block (0, 1)

Block (1,1)

Thread(0,0,0)Thread

(0,1,3)Thread(0,1,0)

Thread(0,1,1)

Thread(0,1,2)

Thread(0,0,0)

Thread(0,0,1)

Thread(0,0,2)

Thread(0,0,3)

(1,0,0) (1,0,1) (1,0,2) (1,0,3)

• #SMs• Max Threads/Block• Max Threads/Dim• Warp size

• What do we need to know about target processor?

• How do these map to kernel configuration parameters that we can control?

(46)

Impact of Thread Block Size

• Consider Fermi with 1536 threads/SM With 512 threads/block, can only up to 3 thread blocks

executing at an instant With 128 threads/block 12 thread blocks per SM Consider how many instructions can be in flight?

• Consider limit of 8 thread blocks/SM? Only 1024 active threads at a time Occupancy = 0.666

• To maximize utilization, thread block size should balance demand for thread blocks vs. thread slots

(47)

Impact of #Registers Per Thread

• Assume 10 registers/thread and a thread block size of 256

• Number of registers per SM = 16K

• A thread block requires 2560 registers for a maximum of 6 thread blocks per SM Uses all 1536 thread slots

• What is the impact of increasing number of registers by 2? Granularity of management is a thread block! Loss of concurrency of 256 threads!

(48)

Impact of Shared Memory

• Shared memory is allocated per thread block Can limit the number of thread blocks executing concurrently

per SM

• As we change the gridDim and blockDim parameters how does demand change for shared memory, number of thread slots, or number of thread block slots?

(49)

Thread Granularity

• How fine grained should threads be? Coarser grain threads

o Increased register pressure, shared memory pressure

o Lower pressure on thread block slots and thread slots

• Merge threads blocks

• Share row values Reduce memory BW

d_N

d_M d_P

Merge these two threads

(50)

Balance

#Threads/Block

#Thread Blocks

Shared memory/T

hread block

#Registers/Thread

• Navigate the tradeoffs to maximize core utilization and memory bandwidth utilization for the target device

• Goal: Increase occupancy until one or the other is saturated

(51)

Performance Tuning

• Auto-tuning to maximize performance for a device

• Query device properties to tune kernel parameters prior to launch

• Tune kernel properties to maximize efficiency of execution

• http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/index.html

(52)

Device Properties

(53)

Summary

• Sequence warps through the scalar pipelines

• Overlap execution from multiple warps to hide memory latency (or special functions)

• Out of order execution not shown need scoreboard for correctness

• Occupancy calculator as a first order estimate of correctness

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated)

Questions?