Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke...

Equalizer: Dynamically Tuning GPU Resources for Efficient

Execution

Ankit Sethia* Scott MahlkeUniversity of Michigan

2

Graphics

Simulation

Linear Algebra

Data Analytics

Machine Learning

Computer Vision

Resource Requirements of GPU applications are

diverging

GPU usage is expanding

SM

L1 CacheMemory

Motivation: Imbalanced GPU resource utilization

3

1 thread imbalanced

30270 threads

imbalanced30270

threads want a resource

Resource saturated

GPUs use SIMT execution model

4

L1 Cache

Memory

SM

Memory

SM

L1 Cache

Memory

SM

L1Cache

Imbalanced GPU resource utilization

Large number of threads cause early saturation of some

resourcesand under-utilization of others

Compute Intensive Memory Intensive 5

6

Kernels saturate some resource much faster than

othersBoost bottleneck

resource for performance improvement

Opportunity 1:

Throttle under-utilized resources for energy

savings

Opportunity 2:

Modulating hardware resources

7

Resource Parameter Var.Compute Core

FPU, IALU,etc. Frequency

1 ±15%

Memory Memory L2, DRAM,etc. Frequency 2 ±15

% # of Thread

Blocks3 L1 Data Cache 1 -

Max

8

0.8 0.9 1 1.1 1.20.80.9

11.11.2

Energy Efficiency

Perfo

rman

ce

Boosting bottleneck resources

Memory Intensive

Legend:

Cache Sensitive

Compute Intensive

Kernel Type

Core Frequency

Memory Frequency

Number of Threads

ComputeMemoryCache

Actions for performance improvement

Example: Increasing core frequency

9

Kernel Type

Core Frequency

Memory Frequency

Number of Threads

ComputeMemoryCache

0.8 0.9 1 1.1 1.20.80.9

11.11.2

Energy Efficiency

Perfo

rman

ce

Throttling under-utilized resources

Memory Intensive

Legend:

Cache Sensitive

Compute Intensive

Example: Decreasing core frequency

Actions for energy savings

Dynamic v/s Static Decisions

10

Opt321

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 2 3 4 5 6 7 8 9 10 11 12

Relative Execution Time

# o

f thr

ead

bloc

ks

05

10152025

Execution Time

# of

Ext

ra

Mem

ory

War

ps

Inter-invocation performance variation

Intra-invocation variation

Dynamic decisions are needed to fully utilize resource modulation

Invocation number

11

Objectives• Modulate 3 key parameters– Core-side Frequency–Memory-side Frequency– Number of thread blocks

• 2 modes of operations– Performance Mode (Boosting resources)– Energy Mode (throttling resources)

• Make dynamic decisions

SM

Block Schedule

r

SM 0 SM 1 SM N

Equalizer

Counters

Frequency Manager

VF Regulato

r

New Parameter

s

Blocks

New Frequenc

y

. . .

Requests

Equalizer

Equalizer

Equalizer

Inst. Buffer

Warp Schedule

r

Equalizer Overview

12

• Samples of warp state taken over window of cycles

• New decisions made every window• Frequency manager arbitrates decision of all

cores

13

State of warps - System heartbeat

State of warps:• Waiting (W) – Waiting for data to be ready• Excess Mem (Xmem) – Ready to issue to memory

pipeline• Excess ALU (Xalu) – Ready to issue to arithmetic

pipeline

cutcp

lavaM

D

mri_g-3 pf

sgemm

cfd-1

histo-

3

leuko-

1 bfshis

to-1

mmerspm

vbp

-1

mri_g-2 sc

0%

20%

40%

60%

80%

100%

Waiting Excess Mem Excess ALU

Dist

ribut

ion

of

War

psCompute Memory Cache Unsaturate

d

Distinguishing memory and cache requirements

Assumption: i) Run maximum threads for memory intensive kernelsii) Reduce threads for cache sensitive kernels

14

Reducing threads for memory kernels won’t hurt as long as bandwidth is

fully utilized

# of threads

Perfo

rman

ce

Band

widt

hSa

tura

ted

Unde

r-ut

ilizat

ion

Memory intensive kernels

# of threadsPe

rform

ance

Cach

e th

rash

ing

Unde

r-ut

ilizat

ion

Optim

al

Cache sensitive kernels

Equalizer Algorithm

15

Input: Xmem, Xalu, Waiting, Active, WCTA

Check if highly compute intensive(Xalu > WCTA)

Check if highly memory intensive(Xmem > WCTA)

Check if memory intensive(Xmem > 2)

Check if majority warps are idle(Waiting < Active/2)

Check more compute or more memory(Xmem > Xalu)

N

N

N

N

N

Take cache sensitive actions

Take compute intensive actions

Take memory intensive actions

Take memory intensive actions

Take compute intensive actions

Y

Y

Y

Y

WCTA is the number of warps in a thread block

16

Experimental SetupSimulator GPGPUSim 3.2.2Kernels 27, Rodinia and ParboilPower Modelling GPUWattchVoltage Regulation On-chip, 512 cycles

latencySM/Memory Frequency

f - 15%, f, f + 15%

Observation window 4096 cyclesSampling rate 128 cycles

AVG AVG AVG AVG AVG-10%

-5%0%5%

10%15%20%25%

Ener

gy

Incr

ease

Results – Performance mode

17

cutcp

lavaM

D

mri_g-3 pf

sgemm

cfd-2 lbm

GMEAN bfs

histo-

1mmer

spmv

mri_g-1 sad

-1stn

cl0.8

11.21.41.61.8

Equalizer SM Boost Memory Boost

Spee

dup

Compute Memory Cache Unsaturated2.84

-67%-19% -11%Technique Performance EnergyEqualizer 22% 7%SM Boost 6% 12%

Mem Boost 7% 8%

Equalizer Dynamism

18

EqOpt

321

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 2 3 4 5 6 7 8 9 10 11 12

Relative Execution Time

# o

f con

-cu

rrent

Bl

ocks

01020304050

Waiting Equalizer Total WarpsTime

# o

f War

psInter-invocation adaptiveness

Intra-invocation adaptiveness

19

Conclusion• Critical to match hardware’s abilities

to kernel’s requirements

• Equalizer understands kernel’s requirement by watching state of warps

• By modulating hardware dynamically:– 22% performance at 6% energy

overhead– 15% energy savings at 5% performance

gain

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution

Questions?

Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke...

Documents

Transcript of Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke...