Post on 24-Feb-2016
description
1
The Yin and Yang of Hardware Heterogeneity: Can Software Survive?
Kathryn S McKinley
2
3
ComputationTuring 1936
4
The TransistorShockley, Bardeen, Brattain 1947
5
Virtuous Cycle
Software Complexity
Sequential Interface
Device Innovation
Hardware Complexity
Sequential Interface
Software InnovationSoftware Innovation
Doubling of Transistorsfaster, smaller,
cheaper, …
6
Hardware
7
Dennard Scaling is overPower = Clock Speed × Voltage2
Performance
Power
Performance
Battery life
Electricity costs in U.S. Data Centers
2011 $7.4 billion
2006 $4.5 billion
$$$$$$
Electricity
[U.S. EPA 2007]
Dark silicon
[Goulding et al. Hot Chips 2010]
90nm 45nm 32nm0%
10%20%30%40%50%60%
50%
18%9%
Powered fraction of a chip @ 40mm2 30w
9
Multicore Hardware
Each die shown at correct scale
130nm
55M tran.131mm2
1 core2-way SMTNorthwood
2003
Pentium4(130)
65nm
291M tran.143mm2
2 coresno SMTConroe
Kentsfield
2006
C2D(65)C2Q(65)
45nm
731M tran.263mm2
4 cores2-way SMTBloomfield
2008
i7(45)
45nm
47M tran.36mm2
1 core2-way SMT
Diamondville
2008
Atom(45)
45nm
228M tran.82mm2
2 coresno SMT
Wolfdale
2009
C2D(45)
45nm
176M tran.87mm2
2 cores + GPU2-way SMTPineview
2009
AtomD(45)
32nm
382M tran.81mm2
2 cores2-way SMTClarkdale
2010
i5 (32)
10
Virtuous Cycle
Software Complexity
Sequential Interface
Device Innovation
Hardware Complexity
Sequential Interface
Software InnovationSoftware Innovation
Doubling of Transistorsfaster, smaller,
cheaper, …
End of Dennard Scaling
ParallelInterface
11
Software
12
PC Software era ending
12
1313
Software = Mobile + Web + Cloud
14
AppWriters
14Computer Scientists
Software People
15
Languages People Use
C++
C
PHP
16
Performance
Software
16
Fast enough
Productivity Managing complexity
Abstractions
17
Hardware & Softwarepulling in opposite directions
18
Hardware & Software Marriage
19
How’s this marriage working out?
20
What should we measure?
21
energy = power dt
22
Multicore Hardware
Each die shown at correct scale
130nm
55M tran.131mm2
1 core2-way SMTNorthwood
2003
Pentium4(130)
65nm
291M tran.143mm2
2 coresno SMTConroe
Kentsfield
2006
C2D(65)C2Q(65)
45nm
731M tran.263mm2
4 cores2-way SMTBloomfield
2008
i7(45)
45nm
47M tran.36mm2
1 core2-way SMT
Diamondville
2008
Atom(45)
45nm
228M tran.82mm2
2 coresno SMT
Wolfdale
2009
C2D(45)
45nm
176M tran.87mm2
2 cores + GPU2-way SMTPineview
2009
AtomD(45)
32nm
382M tran.81mm2
2 cores2-way SMTClarkdale
2010
i5 (32)
23
Java Non-scalable18 Java
SPEC jvm98, DaCapo,pjbb2005
Java Scalable
5 Java DaCapo
Native Scalable
11 C, C++ PARSEC
Native Non-scalable
27 C, C++, FortranSPEC CPU2006
Workloadsnative managed
non-
scal
able
scal
able
61 benchmarks
0.25
0.25 0.25
0.25
24
Power vs Performance
0.5 510
Performance / Reference Performance
Pow
er (
W)
20
40
80
100
60
1 2 3 4
??
2003Pentium 4 (130)
2008Core 2 Duo (45)
2006Core 2 Duo (65)
2008i7 (45)
2010i5 (32)
Power is benchmark dependent
25
Parallelism did not solve the power problem
amazon apple baidu bing ebay google msn paypal wikipedia youtube0
0.2
0.4
0.6
0.8
1
1.2
0
200000
400000
600000
800000
1000000
1200000
1400000
Load energyLoad data
Ener
gy (m
Wh)
Dat
a do
wnl
oade
d (b
ytes
)
Web Page Energy on Windows Phone
27
What’s next in hardware?
280.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
0.60
Group Performance / Group Reference Performance
Nor
mal
ized
Gro
up E
nerg
y
Pareto Analysis (45nm)Workload determines energy efficient architecture
29
Workload determines energy efficient architectureAtom(45)1C2T@1.7Ghz
Core2D(45)2C1T@1.6GHz
Core2D(45)2C1T@3.1GHz
I7(45)1C1T@2.7GHz no TB
I7(45)1C1T@2.7GHz
I7(45)1C2T@1.6GHz
I7(45)1C2T@2.4GHz
I7(45)2C1T@1.6GHz
I7(45)2c2T@1.6GHz
I7(45)4C1T@2.7GHz no TB
I7(45)4C1T@2.7GHz
I7(45)4C2T@1.6GHz
I7(45)4C2T@2.1GHz
I7(45)4C2T@2.7GHz no TB
I7(45)4C2T@2.7Ghz
Nativenon-scalable ✔ ✔ ✔ ✔
Native scalable ✔ ✔ ✔ ✔ ✔ ✔
Java non-scalable ✔ ✔ ✔ ✔ ✔ ✔ ✔
Java scalable ✔ ✔ ✔ ✔ ✔ ✔
30
Parallelism & Heterogeneity
big0 0
0
0 0
0
0
0 0
0
0 0
000
000
000
0 0
0 0
0 0
0 0
00 0
0000 0
0 00 00 0
0 00
00
0 000
00 0
small
000
000
0 0
0 00 00 0
custom
31
Single ISA HeterogeneityMotivation
NVIDIA Tegra35 Cortex A9 (4 x 1.4 GHz, 1 x 500 MHz)
Texas Instruments OMAP54322 Cortex A15 + 2 Cortex M4
32
Heterogeneous Hardware
Complexity
Energy Efficiency
Heterogeneous parallel hardware + software
33
?July 31, 1922. Train wreck at Laurel, Maryland [Washington Post, August 1, 1922]
34
Exploiting Heterogeneity
ParallelismUbiquity
Differentiation
35
Case Studies
Mobile UI [UIST’13]
Managed runtime [ISCA’12]
Always awake [NSDI’12]
Interactive cloud [ICAC’13]
36
Case Studies
Mobile UI [UIST’13]
Managed runtime [ISCA’12]
Always awake [NSDI’12]
Interactive cloud [ICAC’13]
37
User Interface I/O on OMAP big/little
Kihm, Guimbretière UIST’13
0
0
000
App Key board ScrollingInking
A9 big cores M3 little cores
000
38
CharacteristicsParallelismUbiquityDifferentiated
UI I/O
User Interface I/O& Heterogeneity
39
A9+M3 Heterogeneity
Scrolling Pen Inking Virtual Keyboard
Keyboard0
0.51
1.52
2.53
A9 A9 + display controlA9 + M3 Dispatch A9 + M3 execute
Batt
ery
life
Incr
ease
40
Case Studies
Mobile UI [UIST’13]
Managed runtime [ISCA’12]
Always awake [NSDI’12]
Interactive cloud [ICAC’13]
41
VM Services on little coresCao et al., ISCA’12
0
0
0
0
0
0
0
App VM ServicesGC + JIT
big cores little cores
42
CharacteristicsParallelismUbiquityDifferentiated
VM Services ?
VM Services & Heterogeneity
43
Performance Power Energy PPE0.92
0.96
1.00
1.04
1.08
1.12
1.16
1.20GC JIT GC & JIT
Norm
aliz
ed to
1 c
ore
at 2
.8 G
Hz
Better
Better Better
Better
Measured (fill) Model (empty)
2.8 GHz AMD + 2.8 GHz AMD | 2.8 GHz AMD + 1.66 GHz Atom
44
Case Studies
Mobile UI [UIST’13]
Managed runtime [ISCA’12]
Always awake [NSDI’12]
Interactive cloud [ICAC’13]
Somniloquydaemon
Host Big coreRAM, peripherals, …
OS Network stack
Apps
Network interface
Little core
Embedded OS Network stack
App stubsWakeup filters
Laptop
Applications filtering, notifications, downloads, keep alive
Somniloquy Application stubs on little cores
wake up big core as needed Agarwal et al., NSDI’12
CPU + DRAM + Flash
46
Big: Lenovo X60 Laptop + Little: gumstix
Little sleep Somniloquy Big @ low power
Big Core02468
10121416
Power Consumption in Watts
47
Case Studies
Mobile UI [UIST’13]
Managed runtime [ISCA’12]
Always awake [NSDI’12]
Interactive cloud [ICAC’13]
✔ One Time Improvements
48
Case Studies
Mobile UI [UIST’13]
Managed runtime [ISCA’12]
Always awake [NSDI’12]
Interactive cloud [ICAC’13]
49
CharacteristicsParallelismUbiquityDifferentiated
Interactive Services ?
Interactive Cloud Services Bing, Finance, Recommendations, Games
Ren et al. ICAC’13
1. Responsiveness deadline ~100ms = interactive2. Partial execution trades quality for responsiveness
Interactive ServicesWorkload Characterization of Bing
Search
Completion vs Quality
0.0 0.2 0.4 0.6 0.8 1.00.000.200.400.600.801.00
Completion Ratio
Qua
lity
51
~88 Watts/core00000
00000
00000
00000
00000
5 SMT Nehalum coresi5 670 32nm
00 00 00000 00 000000 00 000000 00 00
0 0000000000
00000000
22 SMT AtomD coresBonnell 32nm
Data Centers are Power Limited
Homogeneous
52
Homogeneous CoresThroughput or quality, but not both!
30 60 90 120 150 180 210 240 270 3000.95
0.96
0.97
0.98
0.99
1.00Quality Goal 5 Nehalums 22 Atoms
Qua
lity
Queries per second
00 00 00000 00 000000 00 000000 00 00
0 0000000000
00000000
0000000000
0000000000
00000
1. Responsiveness deadline ~100ms = interactive2. Partial execution trades quality for responsiveness3. Unknown, highly variable service demand
Interactive ServicesWorkload Characterization of Bing
Search
Completion vs Quality
0.0 0.2 0.4 0.6 0.8 1.00.000.200.400.600.801.00
Completion Ratio
Qua
lity
5 15 25 35 45 55 70 90 90 100
0.000.050.100.150.200.250.300.35
Demand Distribution (ms)
Demand (ms)
Differentiated
Heterogeneous00000
00000
00000
3 SMT Nehalum +9 SMT Atom cores
00 00 00000 00 0000
0 0000
00
Long jobShort job
Slow to Fast scheduling
54
Long jobs: no difference between slow-to-fast & fast-to-slow
Time
Short jobs: slow to fast consumes less energy
TimeLong jobShort
job
Unknown service demand: long jobs migrate to fast cores
55
BigMediumSmall
1. Schedule fastest core first 2. Promote older jobs to faster cores
FOF: Fast Old & First Scheduler
56
~88 Watts/core00000
00000
00000
00000
00000
5 SMT Nehalum coresi5 670 32nm
00 00 00000 00 000000 00 000000 00 00
0 0000000000
00000000
22 SMT AtomD coresBonnell 32nm
Data Centers are Power Limited
Homogeneous
Heterogeneous00000
00000
00000
3 SMT Nehalum +9 SMT Atom cores
00 00 00000 00 0000
0 0000
00
57
30 60 90 120 150 180 210 240 270 3000.95
0.96
0.97
0.98
0.99
1.00
Quality Goal 5 Nehalums 22 Atoms Heterogeneous
3 N + 9 Atoms
Queries per second
Qua
lity
FOF Heterogeneous Coresthroughput and quality!
50% throughput increaseor buy 33% fewer servers
58
Simultaneous Multithreading = Dynamic Heterogeneity
SMT
0
000000
000000
4 fast SMT off
000000
0000000000000
3 fast + 2 slow1 SMT core
0000000000
0000000000
000
000
2 fast + 4 slow2 SMT cores
0000000000
0000000000
00000000000000000000
8 slowAll SMT
59
FOF Scheduler for SMT
1. Schedule fastest core first Fastest = unshared, share with youngest job 2. Promote older jobs to faster cores free core? find sharing of (oldest, other), move other to free core
60
Slow to Fast on SMTMonte Carlo Finance Server
6 Core 2-way SMT 3.33 GHz Intel Xeon
4 8 12 16 20 27 31 35 39 43 47 52 55 590.970
0.975
0.980
0.985
0.990
0.995
1.000SMT off SMT + Share Old SMT + FOF
Queries per second
Qua
lity
16% Improvement
61
0.20 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.301.401.501.601.701.801.902.002.102.202.302.40
Homogeneous Deadline Only Slow to Fast
Average Latency (seconds)
Ener
gy (
Joul
es)
Slow to Fast for Energystart on slow
Monte Carlo Finance Server 6 Core 2-way SMT 3.33 GHz Intel Xeon
62
Key Analytical ResultsSlow to Fast TheoremAn optimal solution migrates jobs from slower to faster cores to minimize energy
Heterogeneity TheoremMore heterogeneity (ratio of fastest to slowest core) is desirable for higher p in Lp norm (the tail!)
63
64
65
Software Challenges in a Power Constrained World
Optimizing performance, power, & energySoftware/Hardware co-designSoftware portabilityProgramming models & abstractions OOPSLA’13
0 0
0
0 0
0
0
0 0
0
0 0
000
000
000
0 0
0 0
0 0
0 0
00 0
0000 0
0 00 00 0
0 00
00
0 000
00 0
000
000
0 0
0 00 00 0
66
Doubling of Transistorsfaster, smaller,
cheaper, …
New Virtuous Cycle?
Software Complexity
Sequential Interface
Device Innovation
Hardware Complexity
Sequential Interface
Software Innovation
New Interfaces & Abstractions
Specialization
67
Collaborators on this work
Steve Blackburn, Ting Cao, Hadi Esmaeilzadeh, Ivan Jibaja
Yuxiong He, Shaolei Ren, Sameh Elnikety, Xi Yang, Yong Hun Eom
Todd Mytkowicz, James Bornholt, Aman Kansal
68
The Future?
Thank you
69
Bibliography• Exploiting Processor Heterogeneity in Interactive Systems, S. Ren, Y. He, S. Elnikety, and K. S.
McKinley, The USENIX International Conference on Autonomic Computing (ICAC), San Jose, CA, June 2013.
• Asymmetric Cores for Low Power User Interface Systems, Jaeyeon Kihm and Francois Guimbretiere, ACM Symposium on User Interface Software and Technology (UIST), October, 2013. Poster.
• The Yin and Yang of Hardware Heterogeneity: Can Software Survive? K. S. McKinley, ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA), Indianapolis, IN, October 2013.
• The New Global Ecosystem in Advanced Computing: Implications for U.S. Competitiveness and National Security, Committee on Global Approaches to Advanced Computing (member), Board on Global Science and Technology Policy and Global Affairs Division, National Research Council of the National Academies, The National Academies Press, Washington, D.C., 2012.
• The Model Is Not Enough: Understanding Energy Consumption in Mobile Devices, J. Bornholt, T. Mytkowicz, and K. S. McKinley, Hot Chips, San Jose, CA, August 2012.
• Looking Back and Looking Forward: Power, Performance, and Upheaval, H. Esmaeilzadeh, T. Cao, X. Yang, S. M. Blackburn, and K. S. McKinley, Communications of the ACM (CACM), Research Highlights, 55(7), July, 2012.
• The Yin and Yang of Power and Performance for Asymmetric Hardware and Managed Software, T. Cao, T. Gao, S. M. Blackburn, and K.S. McKinley, ACM/IEEE International Symposium on Computer Architecture, Portland, OR, June, 2012.
• Yuvraj Agarwal, Steve Hodges, Ranveer Chandra, James Scott, Paramvir Bahl, and Rajesh Gupta, Somniloquy: Augmenting Network Interfaces to Reduce PC Energy Usage, in Networked Systems Design & Implementation (NSDI), USENIX, 22 April 2009
70
A Byte of My Story
ACM Fellow
Congressional Testimony
Mentors
Family
71
I fail a lot Rejected job applications
1984 (all), 1993 (8 of 11), 2011 (4 of 8) Failed PhD qualifying exam Rejected first three grant applications Rejected 3 times my most cited paper Rejected papers, grants, papers, …
I learn & persist