(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

ISA Execution and Occupancy

©Sudhakar Yalamanchili unless otherwise noted

Objectives

• Understand the sequencing of instructions in a kernel through a set of scalar pipelines (width =warp)

• Have a basic, high level understanding of instruction fetch, decode, issue, and memory access

Reading Assignment

• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 6.3

• CUDA Programming Guide http://docs.nvidia.com/cuda/cuda-c-programming-guide/#abstract

• Get the CUDA Occupancy Calculator Find at nvidia.com Use web version at http://lxkarthi.github.io/cuda-calculator/

Recap__global__void vecAddKernel(float *A_d, float *B_d, float *C_d, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x;

if( i<n ) C_d[i] = A_d[i]+B_d[i];}

__host__Void vecAdd(){ dim3 DimGrid = (ceil(n/256,1,1); dim3 DimBlock = (256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n);}

KernelBlk 0 Blk N-1• • •

Mk• • •

Schedule onto multiprocessors

Urbana-Champaign

Kernel LaunchCUDA streams

Host Processor

HW Queues

Kernel Management Unit (Device)

• Commands by host issued through streams Kernels in the same stream executed sequentially Kernels in different streams may be executed concurrently

• Streams are mapped to hardware queues in the device in the kernel management unit (KMU) Multiple streams mapped to each queue serializes some

kernels• Kernel launch distributes thread blocks to SMs

Kernel dispatch to SMs

TBs, Warps, & Scheduling

• Imagine one TB has 64 threads or 2 warps

• On K20c GK110 GPU: 13 SMX, max 64 warps/SMX, max 32 concurrent kernels

Register FileCores

L1 Cache/Shared Memory

Register FileCores

……

SMX0 SMX1 SMX12

Thread Blocks

TBs, Warps, & Utilization

• One TB has 64 threads or 2 warps

Register FileCores

……

SMX0 SMX1 SMX12

Thread Blocks

Register FileCores

……

SMX0 SMX1 SMX12

Thread Blocks

Register FileCores

……

SMX0 SMX1 SMX12

Thread Blocks Remaining TBs are queued

NVIDIA GK110 (Keplar)Thread Block Scheduler

Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

SMX Organization

Multiple Warp Schedulers

192 cores – 6 clusters of 32 cores each

64K 32-bit registers

Image from http://mandetech.com/2012/05/20/nvidia-new-gpu-and-visualization/

Execution Sequencing in a Single SM

• Cycle level view of the execution of a GPU kernel

Warp 6

Warp 1Warp 2

Decode

D-CacheDataAll Hit?

Writeback

Pending Warps

scalarPipeline

scalarpipeline

I-Buffer

I-Fetch

Execution Sequencing?

Example: VectorAdd on GPU

__global__ vector_add ( float *a, float *b, float *c, int N) {int index = blockIdx.x * blockDim.x + threadIdx.x;

if (index < N) c[index] = a[index]+b[index];}

setp.lt.s32 %p, %r5, %rd4; //r5 = index, rd4 = N@p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; //r6 = &a[index]ld.global.f32 %f2, [%r7]; //r7 = &b[index]add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3; //r8 = &c[index]

L2:ret;

PTX (Assembly):CUDA:

Example: VectorAdd on GPU

• N=8, 8 Threads, 1 block, warp size = 4

• 1 SM, 4 Cores

• Pipeline: Fetch:

o One instruction from each warpo Round-robin through all warps

Execution:o In-order execution within warps

With proper data forwarding 1 Cycle each stage

• How many warps?

setp.lt.s32 %p, %r5, %rd4; @p bra L1;bra L2;

L1:ld.global.f32 %f1, [%r6]; ld.global.f32 %f2, [%r7]; add.f32 %f3, %f1, %f2;

st.global.f32 [%r8], %f3;

L2:ret;

Warp0 Warp1

Execution Sequence

L2:ret;

Warp0 Warp1

setp W0 FEDE

Execution Sequence (cont.)

L2:ret;

Warp0 Warp1

setp W1 setp W0

L2:ret;

Warp0 Warp1

bra W0setp W1

setp W0 setp W0 setp W0 setp W0

L2:ret;

Warp0 Warp1

@p bra W1@p bra W0

L2:ret;

Warp0 Warp1

bra L2@p bra W1

bra W0 bra W0 bra W0 bra W0

L2:ret;

Warp0 Warp1

bra L2 FEDE

L2:ret;

Warp0 Warp1

ld W0 FEDE

L2:ret;

Warp0 Warp1

ld W1 FEDE

WBbra W1 bra W1 bra W1 bra W1

L2:ret;

Warp0 Warp1

ld W0 FEDE

ld W1ld W0 ld W0 ld W0 ld W0

L2:ret;

Warp0 Warp1

ld W1 FEDE

ld W0ld W1 ld W1 ld W1 ld W1

ld W0 ld W0 ld W0 ld W0

L2:ret;

Warp0 Warp1

add W0 FEDE

L2:ret;

Warp0 Warp1

add W1 FEDE

WBld W1 ld W1 ld W1 ld W1

ld W1 ld W1 ld W1 ld W1add W0

L2:ret;

Warp0 Warp1

st W0 FEDE

add W1add W0 add W0 add W0 add W0

L2:ret;

Warp0 Warp1

st W1 FEDE

add W0 add W0 add W0 add W0

add W1 add W1 add W1 add W1st W0

L2:ret;

Warp0 Warp1

ret FEDE

WBadd W0 add W0 add W0 add W0

add W1 add W1 add W1 add W1

st W1st W0 st W0 st W0 st W0

L2:ret;

Warp0 Warp1

ret FEDE

WBadd W1 add W1 add W1 add W1

st W0 st W0 st W0 st W0

L2:ret;

Warp0 Warp1

ret FEDE

WBst W0 st W0 st W0 st W0

ret ret ret ret

L2:ret;

Warp0 Warp1

WBst W1 st W1 st W1 st W1

ret ret ret ret

Warp0 Warp1

WBret ret ret ret

ret ret ret ret

L2:ret;

Warp0 Warp1

WBret ret ret ret

L2:ret;

Warp0 Warp1

L2:ret;

Note: Idealized execution without memory or special function delays

Fine Grained Multithreading

• First introduced in the Denelcor HEP (1980’s)

• Can eliminate data bypassing in an instruction stream Maintain separation between successive instructions in a

• Simplifies dependency between successive instructions E.g., have only one instruction in the pipeline at a time

• What happens when warp size > #lanes? Why have such a relationship?

Multi-Cycle Dispatch

scalarpipeline

Cycle 1

Cycle 2

dispatch

One warp

Issue the warp over multiple cycles

NB: Fetch BW

SIMD vs. SIMT

SISD SIMD

MISD MIMD

Data Streams

Register File

Loosely synchronized threadsMultiple threads

Synchronous operation

RFRF RF RF

Single Scalar Thread

Flynn Taxonomy

e.g., pthreads

e.g., SSE/AVX

e.g., PTX, HSA

Comparison

Register File

RFRF RF RFMIMD SIMD SIMT

•Multiple independent threads and explicit synchronization

•TLP, DLP, ILP, SPMD

•Single thread + vector operations

•DLP

•Easily embedded in sequential code

•Multiple, synchronous threads

•TLP, DLP, ILP (more recent)

•Explicitly parallel

Performance Metrics: Occupancy

• Performance implications of programming model properties Warps, thread blocks, register usage

• Capture which resources can be dynamically shared, and how to reason about resource demands of a CUDA kernel Enable device-specific online tuning of kernel parameters

CUDA Occupancy

• Occupancy = (#Active Warps) /(#MaximumActive Warps) Measure of how well you are using max capacity

• Limits on the numerator? Registers/thread Shared memory/thread block Number of scheduling slots: thread blocks or warps

• Limits on the denominator? Memory bandwidth Scheduler slots

• What is the performance impact of varying kernel resource demands?

Performance Goals

izatio

tiliza

What are the limiting Factors?

Resource Limits on Occupancy

SM Scheduler

Kernel Distributor

SM SM SM SM

Limits the #threads

Limits the #thread blocks

Warp Schedulers

Warp ContextWarp ContextWarp ContextWarp Context

Register File

SPSPSPSP

Thread Block Control

L1/Shared MemoryLimits the #thread

blocks

Limits the #threads

SM – Stream MultiprocessorSP – Stream Processor

Programming Model AttributesGrid 1

Block (0, 0)

Block (1, 1)

Block (1, 0)

Block (0, 1)

Block (1,1)

Thread(0,0,0)Thread

(0,1,3)Thread(0,1,0)

Thread(0,1,1)

Thread(0,1,2)

Thread(0,0,0)

Thread(0,0,1)

Thread(0,0,2)

Thread(0,0,3)

(1,0,0) (1,0,1) (1,0,2) (1,0,3)

• #SMs• Max Threads/Block• Max Threads/Dim• Warp size

• What do we need to know about target processor?

• How do these map to kernel configuration parameters that we can control?

Impact of Thread Block Size

• Consider Fermi with 1536 threads/SM With 512 threads/block, can only up to 3 thread blocks

executing at an instant With 128 threads/block 12 thread blocks per SM Consider how many instructions can be in flight?

• Consider limit of 8 thread blocks/SM? Only 1024 active threads at a time Occupancy = 0.666

• To maximize utilization, thread block size should balance demand for thread blocks vs. thread slots

Impact of #Registers Per Thread

• Assume 10 registers/thread and a thread block size of 256

• Number of registers per SM = 16K

• A thread block requires 2560 registers for a maximum of 6 thread blocks per SM Uses all 1536 thread slots

• What is the impact of increasing number of registers by 2? Granularity of management is a thread block! Loss of concurrency of 256 threads!

Impact of Shared Memory

• Shared memory is allocated per thread block Can limit the number of thread blocks executing concurrently

per SM

• As we change the gridDim and blockDim parameters how does demand change for shared memory, number of thread slots, or number of thread block slots?

Thread Granularity

• How fine grained should threads be? Coarser grain threads

o Increased register pressure, shared memory pressure

o Lower pressure on thread block slots and thread slots

• Merge threads blocks

• Share row values Reduce memory BW

d_M d_P

Merge these two threads

Balance

#Threads/Block

#Thread Blocks

Shared memory/T

hread block

#Registers/Thread

• Navigate the tradeoffs to maximize core utilization and memory bandwidth utilization for the target device

• Goal: Increase occupancy until one or the other is saturated

Performance Tuning

• Auto-tuning to maximize performance for a device

• Query device properties to tune kernel parameters prior to launch

• Tune kernel properties to maximize efficiency of execution

• http://developer.download.nvidia.com/compute/cuda/4_1/rel/toolkit/docs/online/index.html

Device Properties

Summary

• Sequence warps through the scalar pipelines

• Overlap execution from multiple warps to hide memory latency (or special functions)

• Out of order execution not shown need scoreboard for correctness

• Occupancy calculator as a first order estimate of correctness

Questions?

(1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

Documents

Transcript of (1) ISA Execution and Occupancy ©Sudhakar Yalamanchili unless otherwise noted.

(1) Basic Input and Output © Sudhakar Yalamanchili, Georgia Institute of Technology, 2006.

Center for Experimental Research in Computer Systems IAB Meeting and Industry Workshop Oct. 2006 Karsten Schwan, Calton Pu, Douglas Blough, Sudhakar Yalamanchili.

(1) ECE 8823: GPU Architectures Sudhakar Yalamanchili School of Electrical and Computer Engineering Georgia Institute of Technology NVIDIA Keplar.

(1) Basic Language Concepts © Sudhakar Yalamanchili, Georgia Institute of Technology, 2006.

Cheque Truncation Kaza Sudhakar RBI

A Universal Parallel Front End for Execution Driven ...€¦ · A Universal Parallel Front End for Execution Driven Microarchitecture Simulation Chad D. Kersey Sudhakar Yalamanchili

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) The Black Widow High Radix Clos Network S. Scott, D.Abts, J. Kim, and W.

Hall2C Wednesday 11h15 - Dr Sudhakar Muniyasamy

Sudhakar Energy Managment

06-13517 - Potluri v. Yalamanchili et al [PDF 79 KB]

What Do I want ??? Sudhakar Venturi ST4 Sudhakar Venturi ST4.

An Execution Model for Heterogeneous Multicore Architectures Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili Computer Architecture and Systems Laboratory.

Sharada Tilaka Tantram II - Sudhakar Malaviya

Sudhakar V. Joshi WEL-COME TO THE PRESENTATION BY SUDHAKAR V. JOSHI I.A.S. (RETD.) MANAGEMENT CONSULTANT.

BY: B. SUDHAKAR REDDY Sudhakar reddy...B. SUDHAKAR REDDY INTRODUCTION TO Definition: A simple business definition is to say that business occurs when a person or organization profits

Wi-Fi and Cellular Handoff Sowjanya Talasila Shilpa Pamidimukkala Sravanthi Yalamanchili.

(1) Modeling Digital Systems © Sudhakar Yalamanchili, Georgia Institute of Technology, 2006.

Sudhakar Horoscope

Scaling Data Warehousing Applications using GPUs · 2013-04-23 · Scaling Data Warehousing Applications using GPUs Sudhakar Yalamanchili School of Electrical and Computer Engineering

Green Clouds – Power Consumption as a First Order Criterion Karsten Schwan, Sudhakar Yalamanchili, Ada Gavrilovska, Hrishikesh Amur, Bhavani Krishnan,