Christopher Batten Christopher Torng, Moyang …cbatten/pdfs/torng-aaws...100 150 200 250 300 350...
Transcript of Christopher Batten Christopher Torng, Moyang …cbatten/pdfs/torng-aaws...100 150 200 250 300 350...
Asymmetry-Aware Work-Stealing Runtimes
Christopher Torng, Moyang Wang, andChristopher Batten
School of Electrical and Computer EngineeringCornell University
43rd Int’l Symp. on Computer Architecture, June 2016
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing
Runtimes
Single-ISA
Heterogeneous
Architectures
Dynamic Voltage
and Frequency
Scaling
Dynamic
AsymmetryStatic
Asymmetry
How can we use asymmetry awareness to improve the
performance and energy efficiency of a work-stealing runtime?
Cornell University Christopher Torng 2 / 21
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work in
Progress
Task
Queues
Core 0 Core 1 Core 2 Core 3
Task B
Task A
Spawn Task B
Cornell University Christopher Torng 3 / 21
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work in
Progress
Task
Queues
Core 0 Core 1 Core 2 Core 3
Task B
Dequeue Task B
Cornell University Christopher Torng 3 / 21
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work in
Progress
Task
Queues
Core 0 Core 1 Core 2 Core 3
Task C
Task B
Spawn Task C
Cornell University Christopher Torng 3 / 21
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work in
Progress
Task
Queues
Core 0 Core 1 Core 2 Core 3
Task B
Spawn Task D
Task D
Task C
Cornell University Christopher Torng 3 / 21
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work in
Progress
Task
Queues
Core 0 Core 1 Core 2 Core 3
Task CTask D
Steal Task D Steal Task C
Task B
Cornell University Christopher Torng 3 / 21
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work in
Progress
Task
Queues
Core 0 Core 1 Core 2 Core 3
Task CTask D
Task E Task F
Spawn Task FSpawn Task E
Cornell University Christopher Torng 3 / 21
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing Runtimes
Work in
Progress
Task
Queues
Core 0 Core 1 Core 2 Core 3
Task CTask DTask E Task F
Steal Task E Steal Task F
I Work stealing has good performance, space requirements, andcommunication overheads in both theory and practice
I Supported in many popular concurrency platforms including:Intel’s Cilk Plus, Intel’s C++ TBB, Microsoft’s .NET Task ParallelLibrary, Java’s Fork/Join Framework, and OpenMP
Cornell University Christopher Torng 3 / 21
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Static Asymmetry vs. Dynamic Asymmetry
Samsung Exynos OctaMobile Processor
Big
ARM Cores
Little
ARM Cores
A7 A7 A15 A15
L2$ L2$
A7 A7 A15 A15
Cell ControlPower Mux
Test Chip with Four Integrated Voltage Regulators
Load
100 150 200 250 300 350 400
1.41.31.21.11.00.90.80.7
Time (ns)
Volta
ge (
V)
120 ns
150 ns
Integrated Voltage Regulation
From W, Godycki, C. Torng, I. Bukreyev, A. Apsel, C. Batten.“Enabling Realistic Fine-Grain Voltage Scaling with
Reconfigurable Power Distribution Networks” MICRO, 2014
Cornell University Christopher Torng 4 / 21
• Motivation • First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes Evaluation
Work-Stealing
Runtimes
Dynamic
Asymmetry
Static
Asymmetry
Single-ISA
Heterogeneous
Architectures
Dynamic Voltage
and Frequency
Scaling
Bender et al.
"Online Scheduling
of Parallel Programs on
Heterogeneous Sys ..."
SPAA 2002
Ribic et al.
"Energy-Efficient
Work-Stealing
Language Runtimes"
ASPLOS 2014
Azizi et al.
"Energy-performance Tradeoffs in Processor Architecture and Circuit Design:
A Marginal Cost Analysis" ISCA 2010
How can we use asymmetry awareness to improve the
performance and energy efficiency of a work-stealing runtime?
Cornell University Christopher Torng 5 / 21
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
L
L
B
B
Work-Stealing
Runtimes
Dynamic
AsymmetryStatic
Asymmetry
Work-Pacing
Work-Sprinting Work-Mugging
Talk Outline
Motivation
First-Order Modeling
Asymmetry-AwareWork-Stealing Runtimes
Evaluation
Cornell University Christopher Torng 6 / 21
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
Building Intuition by Exploring a 1 Big 1 Little SystemN
orm
aliz
ed
Pow
er
Normalized IPS
0.5 1.0 1.5 2.0 2.5 3.0
8
7
6
5
4
3
2
1
LB
System with 1 big 1 little
Little Core
Four-Way
Big Core
(2.0, 6.0)
(1.0, 1.0)B
L7.0
Power
B
L
3.0
IPS
Cornell University Christopher Torng 7 / 21
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
Building Intuition by Exploring a 1 Big 1 Little SystemN
orm
aliz
ed
Pow
er
Normalized IPS
0.5 1.0 1.5 2.0 2.5 3.0
8
7
6
5
4
3
2
1
LB
System with 1 big 1 little
Little Core
Four-Way
Big Core
(2.0, 6.0)
(1.0, 1.0)B
L7.0
Power
B
L
3.0
IPS
B
LL
B
Cornell University Christopher Torng 7 / 21
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
Building Intuition by Exploring a 1 Big 1 Little SystemN
orm
aliz
ed
Pow
er
Normalized IPS
0.5 1.0 1.5 2.0 2.5 3.0
8
7
6
5
4
3
2
1
LB
System with 1 big 1 little
Little Core
Four-Way
Big Core
(2.0, 6.0)
(1.0, 1.0)B
L7.0
Power
B
L
3.0
IPS
B
LL
B B
LL
B
Same Power
10% Performance Increase
Cornell University Christopher Torng 7 / 21
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
The Law of Equi-Marginal Utility
Normalized IPS
0.5 1.0 1.5 2.0 2.5 3.0
8
7
6
5
4
3
2
1
Alfred Marshall (1824 - 1924)
"Other things being equal, a consumer
gets maximum satisfaction when he
allocates his limited income to the
purchase of different goods in such a
way that the Marginal Utility derived
from the last unit of money spent on
each item of expenditure
tend to be equal."
British Economist
Norm
aliz
ed
Pow
er
Balance the ratio of
utility (IPS) to cost (power)
dy,cost
dx,utility
Slope
Slope
1.0 V
1.0 V
Cornell University Christopher Torng 8 / 21
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
The Law of Equi-Marginal Utility
Normalized IPS
0.5 1.0 1.5 2.0 2.5 3.0
8
7
6
5
4
3
2
1
Alfred Marshall (1824 - 1924)
"Other things being equal, a consumer
gets maximum satisfaction when he
allocates his limited income to the
purchase of different goods in such a
way that the Marginal Utility derived
from the last unit of money spent on
each item of expenditure
tend to be equal."
British Economist
Norm
aliz
ed
Pow
er
Balance the ratio of
utility (IPS) to cost (power)
Slope
Slope0.9 V
1.3 V
Arbitrage
"Buy Low, Sell High"
Cornell University Christopher Torng 8 / 21
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
Systematic Approach for Balancing Marginal UtilityN
orm
aliz
ed
En
erg
y E
ffic
ien
cy
Normalized IPS
0.6 0.8 1.0 1.2 1.4
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
isopo
wer1 Big 1 Little System
at Nominal voltage
Assumptions
Individual (VB, VL) pair
Pareto-Optimal Frontier
Perfectly parallel application
Ideal load balancing
Performance
at expense of
energy efficiency
Energy efficiency
at expense of
performance
Cornell University Christopher Torng 9 / 21
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
Systematic Approach for Balancing Marginal UtilityN
orm
aliz
ed
En
erg
y E
ffic
ien
cy
Normalized IPS
0.6 0.8 1.0 1.2 1.4
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
isopo
wer1 Big 1 Little System
at Nominal voltage
Assumptions
Individual (VB, VL) pair
Pareto-Optimal Frontier
Perfectly parallel application
Ideal load balancing
Improve both
performance and
energy efficiency
Cornell University Christopher Torng 9 / 21
Motivation • First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes Evaluation
Systematic Approach for Balancing Marginal UtilityN
orm
aliz
ed
En
erg
y E
ffic
ien
cy
Normalized IPS
0.6 0.8 1.0 1.2 1.4
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
isopo
wer1 Big 1 Little System
at Nominal voltage
Assumptions
Individual (VB, VL) pair
Pareto-Optimal Frontier
Perfectly parallel application
Ideal load balancing
Marginal Utility-Based
Optimization Problem
Constraint: isopower line
Objective: maximize performance
Solved numerically
Cornell University Christopher Torng 9 / 21
Motivation First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes • Evaluation
L
L
B
B
Work-Stealing
Runtimes
Dynamic
AsymmetryStatic
Asymmetry
Work-Pacing
Work-Sprinting Work-Mugging
Talk Outline
Motivation
First-Order Modeling
Asymmetry-AwareWork-Stealing Runtimes
Evaluation
Cornell University Christopher Torng 10 / 21
Motivation First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes • Evaluation
Work-Pacing: Building Intuition
Balance performance/power
across cores in the
high-parallel (HP) region
L
L
B
B
Busy Steal Loop2 Big, 2 Little
System with both big cores active and both little cores active
B
L
Norm
aliz
ed
Pow
er
Normalized IPS
0.0 0.5 1.0 1.5 2.0 2.5 3.00
1
2
3
4
5
6
7
VL
VB
Ma
rgin
al I
PS
/W
0.70 1.00 1.30 1.60 1.901.04 1.00 0.92 0.76 0.24
Aggregate
Throughput
Ag
gre
ga
te S
yste
m I
PS
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.70 1.00 1.30 1.60 1.901.04 1.00 0.92 0.76 0.24
VL
VB
B L
Cornell University Christopher Torng 11 / 21
Motivation First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes • Evaluation
Work-Pacing, Work-Sprinting, and Work-Mugging
L
L
B
B
Steal LoopBusy
Work-Pacing
Balance performance/power
across cores in the
high-parallel (HP) regionRest cores in the steal loop
to the lowest voltage
With additional power slack,
balance performance/power
across busy cores in the
low-parallel (LP) region
Work-Sprinting
Work-Mugging
Move work from
slow little cores
to fast big cores in thelow-parallel (LP) region
Cornell University Christopher Torng 12 / 21
Motivation First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes • Evaluation
Work-Pacing and Work-Sprinting Mechanisms
On-Chip Interconnect
L1$ L1$
B L
VoltageRegulators
DRAM Memory Controller
L1$ L1$
B L
Shared L2$ Cache Banks
DVFSController
Work in
Progress
Task
Queues
Big Big Little Little
Which cores are stealing? Big or little?
Task A Task B
Hints
Dec
isio
n
Vdd
B LB L
A A A AVB VL
A = Active, S = Stealing
0.91V 1.30VA A A s 0.98V 1.30V 0.70V
Stealing
Task C
A A s s
A s A AA s A sA s s s
s s A As s A ss s s s
1.03V 1.30V
1.04V 1.30V1.13V 1.30V1.21V 0.70V
0.70V 1.30V0.70V 1.30V0.70V 0.70V
0.70V
0.70V0.70V0.70V
0.70V0.70V0.70V
Activity Pattern Voltages
Cornell University Christopher Torng 13 / 21
Motivation First-Order Modeling • Asymmetry-Aware Work-Stealing Runtimes • Evaluation
Work-Mugging Mechanisms
On-Chip Interconnect
L1$ L1$
B L
DRAM Memory Controller
L1$ L1$
B L
Shared L2$ Cache Banks
VoltageRegulators
DVFSController
User-Level Interrupt Network
Initiate Mug MugInterrupt
Thread Context Swap
Mug InstructionI Thread ID to mugI Address of thread-swapping handler
User-Level Interrupt NetworkI Simple, low-bandwidth inter-core
networkI Latency on order of 20 cycles
Thread Context SwapI Threads store architectural state to
separate locations in shared memoryI Both threads syncI Threads load architectural state from
other location
Cornell University Christopher Torng 14 / 21
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
L
L
B
B
Work-Stealing
Runtimes
Dynamic
AsymmetryStatic
Asymmetry
Work-Pacing
Work-Sprinting Work-Mugging
Talk Outline
Motivation
First-Order Modeling
Asymmetry-AwareWork-Stealing Runtimes
Evaluation
Cornell University Christopher Torng 15 / 21
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Evaluation Methodology: Modeling
Work-Stealing RuntimeI State-of-the-art Intel TBB-inspired work-stealing schedulerI Chase-Lev task queues with occupancy-based victim selectionI Instrumented with activity hints
Cycle-Level ModelingI Heterogeneous system modeled in gem5 cycle-approximate simulatorI Support for scaling per-core frequencies + central DVFS Controller
Energy ModelingI Event-based energy modeling based on detailed RTL/gate-level sims
(Synopsys ASIC toolflow, TSMC LP, 65 nm 1.0 V)I Carefully selected subset of McPAT results tuned to our µarchitecture
Cornell University Christopher Torng 16 / 21
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Work-Pacing in cilk-sort
time
Big
Lit
tle
No AAWS Techniques
Busy Steal Loop
Activity Bar
DVFS Controller Decision
0.70 V1.04 V
1.24 V
1.00 V1.30 V
0.90 V
Work-Pacing
Big
Lit
tle
time
Speedup
of 1.11x
Norm
aliz
ed
En
erg
y E
ffic
ien
cy
0.9
1.0
1.1
1.2
1.3
1.4
1.5
Performance1.0 1.21.1 1.3 1.4 1.5
isopo
wer
Speedup 1.11x
Energy Efficiency 1.11x
Cornell University Christopher Torng 17 / 21
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Work-Sprinting in quicksort
1.00 V0.70 V
1.04 V
1.24 V
Busy Steal Loop
1.30 V
1.10 V
time
Big
Littl
e
No AAWS Techniques
DVFS Controller Decision
Activity Bar
Work-Sprinting
Big
Littl
e
time
Speedupof 1.34x No
rmal
ized
Ene
rgy
Effic
ienc
y0.9
1.0
1.1
1.2
1.3
1.4
1.5
Performance1.0 1.21.1 1.3 1.4 1.5
isopo
wer
Speedup 1.34xEnergy Efficiency 1.16x
Cornell University Christopher Torng 18 / 21
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Work-Mugging in radix sortB
igL
ittl
e
No AAWS Techniques
time
Busy Steal Loop
Activity Bar
DVFS Controller Decision
1.00 V
0.90 V
0.70 V1.04 V
1.24 V
1.30 V
Work-Mugging
Big
Lit
tle
time Speedup 1.17x
Norm
aliz
ed
En
erg
y E
ffic
ien
cy
0.9
1.0
1.1
1.2
1.3
1.4
1.5
Performance1.0 1.21.1 1.3 1.4 1.5
isopo
wer
Speedup 1.17x
Energy Efficiency 1.40x
Cornell University Christopher Torng 19 / 21
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Evaluation of Complete AAWS Runtime
0.0 0.2 0.4 0.6 0.8
Norm
aliz
ed
En
erg
y E
ffic
ien
cy
0.9
1.0
1.1
1.2
1.3
1.4
1.5
Performance1.0 1.21.1 1.3 1.4 1.5
isopo
werpbbs-bfs
pbbs-quicksort
pbbs-samplesort
pbbs-dictionary
pbbs-convex-hull
pbbs-radix-sort
pbbs-knn
pbbs-max-independent-set
pbbs-nbody
pbbs-remove-duplicates
pbbs-suffix-array
pbbs-spanning-tree
cilk-cholesky
cilk-cilksort
cilk-heat
cilk-knapsack
cilk-matrix-multiply
parsec-blackscholes
unbalanced-tree-search
Application Kernels
Median: 1.10 x Max: 1.32 xPerformanceMedian: 1.11 x Max: 1.53 xEnergy Efficiency
Cornell University Christopher Torng 20 / 21
Motivation First-Order Modeling Asymmetry-Aware Work-Stealing Runtimes • Evaluation •
Work-Stealing
Runtimes
Dynamic
AsymmetryStatic
Asymmetry
Take-Away Point
Holistically combining
• work-stealing runtimes• static asymmetry• dynamic asymmetry
through the use of
• work-pacing• work-sprinting• work-mugging
can improve both performance andenergy efficiency in future multicoresystems
This work was partially supported by the NSF, AFOSR, and donations from Intel and Synopsys
Cornell University Christopher Torng 21 / 21