Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke...
-
Upload
cecilia-nelson -
Category
Documents
-
view
221 -
download
0
description
Transcript of Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke...
Equalizer: Dynamically Tuning GPU Resources for Efficient
Execution
Ankit Sethia* Scott MahlkeUniversity of Michigan
2
Graphics
Simulation
Linear Algebra
Data Analytics
Machine Learning
Computer Vision
Resource Requirements of GPU applications are
diverging
GPU usage is expanding
1 thread imbalanced
30270 threads
imbalanced30270
threads want a resource
Resource saturated
GPUs use SIMT execution model
4
L1 Cache
Memory
SM
Memory
SM
L1 Cache
Memory
SM
L1Cache
Imbalanced GPU resource utilization
Large number of threads cause early saturation of some
resourcesand under-utilization of others
Compute Intensive Memory Intensive 5
6
Kernels saturate some resource much faster than
othersBoost bottleneck
resource for performance improvement
Opportunity 1:
Throttle under-utilized resources for energy
savings
Opportunity 2:
Modulating hardware resources
7
Resource Parameter Var.Compute Core
FPU, IALU,etc. Frequency
1 ±15%
Memory Memory L2, DRAM,etc. Frequency 2 ±15
% # of Thread
Blocks3 L1 Data Cache 1 -
Max
8
0.8 0.9 1 1.1 1.20.80.9
11.11.2
Energy Efficiency
Perfo
rman
ce
Boosting bottleneck resources
Memory Intensive
Legend:
Cache Sensitive
Compute Intensive
Kernel Type
Core Frequency
Memory Frequency
Number of Threads
ComputeMemoryCache
Actions for performance improvement
Example: Increasing core frequency
9
Kernel Type
Core Frequency
Memory Frequency
Number of Threads
ComputeMemoryCache
0.8 0.9 1 1.1 1.20.80.9
11.11.2
Energy Efficiency
Perfo
rman
ce
Throttling under-utilized resources
Memory Intensive
Legend:
Cache Sensitive
Compute Intensive
Example: Decreasing core frequency
Actions for energy savings
Dynamic v/s Static Decisions
10
Opt321
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 2 3 4 5 6 7 8 9 10 11 12
Relative Execution Time
# o
f thr
ead
bloc
ks
05
10152025
Execution Time
# of
Ext
ra
Mem
ory
War
ps
Inter-invocation performance variation
Intra-invocation variation
Dynamic decisions are needed to fully utilize resource modulation
Invocation number
11
Objectives• Modulate 3 key parameters– Core-side Frequency–Memory-side Frequency– Number of thread blocks
• 2 modes of operations– Performance Mode (Boosting resources)– Energy Mode (throttling resources)
• Make dynamic decisions
SM
Block Schedule
r
SM 0 SM 1 SM N
Equalizer
Counters
Frequency Manager
VF Regulato
r
New Parameter
s
Blocks
New Frequenc
y
. . .
Requests
Equalizer
Equalizer
Equalizer
Inst. Buffer
Warp Schedule
r
Equalizer Overview
12
• Samples of warp state taken over window of cycles
• New decisions made every window• Frequency manager arbitrates decision of all
cores
13
State of warps - System heartbeat
State of warps:• Waiting (W) – Waiting for data to be ready• Excess Mem (Xmem) – Ready to issue to memory
pipeline• Excess ALU (Xalu) – Ready to issue to arithmetic
pipeline
cutcp
lavaM
D
mri_g-3 pf
sgemm
cfd-1
histo-
3
leuko-
1 bfshis
to-1
mmerspm
vbp
-1
mri_g-2 sc
0%
20%
40%
60%
80%
100%
Waiting Excess Mem Excess ALU
Dist
ribut
ion
of
War
psCompute Memory Cache Unsaturate
d
Distinguishing memory and cache requirements
Assumption: i) Run maximum threads for memory intensive kernelsii) Reduce threads for cache sensitive kernels
14
Reducing threads for memory kernels won’t hurt as long as bandwidth is
fully utilized
# of threads
Perfo
rman
ce
Band
widt
hSa
tura
ted
Unde
r-ut
ilizat
ion
Memory intensive kernels
# of threadsPe
rform
ance
Cach
e th
rash
ing
Unde
r-ut
ilizat
ion
Optim
al
Cache sensitive kernels
Equalizer Algorithm
15
Input: Xmem, Xalu, Waiting, Active, WCTA
Check if highly compute intensive(Xalu > WCTA)
Check if highly memory intensive(Xmem > WCTA)
Check if memory intensive(Xmem > 2)
Check if majority warps are idle(Waiting < Active/2)
Check more compute or more memory(Xmem > Xalu)
N
N
N
N
N
Take cache sensitive actions
Take compute intensive actions
Take memory intensive actions
Take memory intensive actions
Take compute intensive actions
Y
Y
Y
Y
WCTA is the number of warps in a thread block
16
Experimental SetupSimulator GPGPUSim 3.2.2Kernels 27, Rodinia and ParboilPower Modelling GPUWattchVoltage Regulation On-chip, 512 cycles
latencySM/Memory Frequency
f - 15%, f, f + 15%
Observation window 4096 cyclesSampling rate 128 cycles
AVG AVG AVG AVG AVG-10%
-5%0%5%
10%15%20%25%
Ener
gy
Incr
ease
Results – Performance mode
17
cutcp
lavaM
D
mri_g-3 pf
sgemm
cfd-2 lbm
GMEAN bfs
histo-
1mmer
spmv
mri_g-1 sad
-1stn
cl0.8
11.21.41.61.8
Equalizer SM Boost Memory Boost
Spee
dup
Compute Memory Cache Unsaturated2.84
-67%-19% -11%Technique Performance EnergyEqualizer 22% 7%SM Boost 6% 12%
Mem Boost 7% 8%
Equalizer Dynamism
18
EqOpt
321
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 2 3 4 5 6 7 8 9 10 11 12
Relative Execution Time
# o
f con
-cu
rrent
Bl
ocks
01020304050
Waiting Equalizer Total WarpsTime
# o
f War
psInter-invocation adaptiveness
Intra-invocation adaptiveness
19
Conclusion• Critical to match hardware’s abilities
to kernel’s requirements
• Equalizer understands kernel’s requirement by watching state of warps
• By modulating hardware dynamically:– 22% performance at 6% energy
overhead– 15% energy savings at 5% performance
gain