Dally's Keynote
Transcript of Dally's Keynote
Challenges for Future Computing Systems Bill Dally | Chief Scientist and SVP, Research NVIDIA | Professor (Research), EE&CS, Stanford
Exascale Computing Will Enable Transformational Science
Climate
Comprehensive Earth System Model at 1KM scale, enabling modeling of cloud convection and ocean eddies.
Combustion
First-principles simulation of combustion for new high-efficiency, low-emision engines.
Biology
Coupled simulation of entire cells at molecular, genetic, chemical and biological levels.
Astrophysics
Predictive calculations for thermonuclear and core-collapse supernovae, allowing confirmation of theoretical models.
Exascale Computing Will Enable Transformational Science
High-Performance Computers are
Scientific Instruments
18,688 NVIDIA Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack
TITAN
Tsubame KFC 4.5GFLOPS/W #1 on Green500 List
US to Build Two Flagship Supercomputers
Major Step Forward on the Path to Exascale
150-300 PFLOPS Peak Performance
IBM POWER9 CPU + NVIDIA Volta GPU
NVLink High Speed Interconnect
40 TFLOPS per Node, >3,400 Nodes
2017
SUMMIT SIERRA
Scientific Discovery and Business Analytics
Driving an Insatiable Demand for More Computing Performance
2 Trends
The End of Dennard Scaling
Pervasive Parallelism
2 Trends
The End of Dennard Scaling
Pervasive Parallelism
Moore’s Law doubles processor performance every few years, right?
Moore’s Law doubles processor performance every few years, right?
Wrong!
Moore’s Law gives us transistors Which we used to turn into scalar performance
Moore, Electronics 38(8) April 19, 1965
Moore’s Law is Only Part of the Story
2001: 42M Tx 2004: 275M Tx
1993: 3M Tx
2010: 3B Tx 2007: 580M Tx
1997: 7.5M Tx
2012: 7B Tx
1
1.5
2
2.5
3
1 1.5 2 2.5 3
Chip
Pow
er
Chip Capability
Classic Dennard Scaling 2.8x chip capability in same power
2x more transistors
Scale chip features down 0.7x per process generation
1.4x faster transistors
0.7x voltage
0.7x capacitance
But L3 energy scaling ended in 2005
Moore, ISSCC Keynote, 2003
Post Dennard Scaling 2x chip capability at 1.4x power
1.4x chip capability at same power
1
1.5
2
2.5
3
1 1.5 2 2.5 3
Chip
Pow
er
Chip Capability
2x more transistors
0.7x capacitance
Transistors are no faster Static leakage limits reduction in Vth => Vdd stays constant
Reality isn’t even this good 1.8x chip capability at 1.5x power 1.2x chip capability at same power
1
1.5
2
2.5
3
1 1.5 2 2.5 3
Chi
p Po
wer
Chip Capability
1.8x more transistors 0.85x energy
Restrictive geometry rules reduce transistor count Energy scales slower than linearly
“Moore’s Law gives us more transistors…
Dennard scaling made them useful.”
Bob Colwell, DAC 2013, June 4, 2013
ISAT LCC: 24
Also, ILP was ‘mined out’ in 2000
1e-41e-31e-21e-11e+01e+11e+21e+31e+41e+51e+61e+7
1980 1990 2000 2010 2020
Perf (ps/Inst)Linear (ps/Inst)
19%/year 30:1
1,000:1 30,000:1
Dally et al. “The Last Classical Computer”, ISAT Study, 2001
Result: The End of Historic Scaling
C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011
The End of Dennard Scaling
! Processors aren’t getting faster, just wider ! Future gains in performance are from parallelism
! Future systems are energy limited ! Efficiency IS Performance
! Process matters less ! One generation is 1.2x, not 2.8x
Its not about the FLOPs
16nm chip, 10mm on a side, 200W
DFMA 0.01mm2 10pJ/OP – 2GFLOPs A chip with 104 FPUs: 100mm2
200W 20TFLOPS Pack 50,000 of these in racks 1EFLOPS 10MW
Overhead
Locality
CPU 1690 pJ/flop Optimized for Latency
Caches
Westmere 32 nm
GPU 140 pJ/flop
Optimized for Throughput Explicit Management of On-chip Memory
Kepler 28 nm
Fixed-Function Logic is Even More Efficient
Energy/Op CPU 1.7nJ GPU 140pJ Fixed-Function 10pJ
How is Power Spent in a CPU?
In-order Embedded OOO Hi-perf
Clock + Control Logic 24%
Data Supply 17%
Instruction Supply 42%
Register File 11%
ALU 6% Clock + Pins
45%
ALU 4%
Fetch 11%
Rename 10%
Issue 11%
RF 14%
Data Supply 5%
Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)
Overhead 980pJ
Payload Arithmetic
20pJ
4/11/11 Milad Mohammadi 33
ORF ORFORF
LS/BRFP/IntFP/Int
To LD/ST
L0AddrL1Addr
Net
LM Bank
0
To LD/ST
LM Bank
3
RFL0AddrL1Addr
Net
RF
Net
DataPath
L0I$
Thre
ad P
CsAc
tive
PCs
Inst
ControlPath
Sche
duler
64 threads4 active threads2 DFMAs (4 FLOPS/clock)ORF bank: 16 entries (128 Bytes)L0 I$: 64 instructions (1KByte)LM Bank: 8KB (32KB total)
Simpler Cores = Energy Efficiency
Source: Azizi [PhD 2010]
Overhead 20pJ
Payload Arithmetic
20pJ
SIMT Lanes
Streaming Multiprocessor (SM)
Warp Scheduler
Shared Memory 32 KB
15% of SM Energy Main Register File 32 banks
ALU SFU MEM TEX
Hierarchical Register File
0%
20%
40%
60%
80%
100%
Perc
ent
of A
ll Va
lues
Pro
duce
d
Read >2 Times
Read 2 Times
Read 1 Time
Read 0 Times
0%
20%
40%
60%
80%
100%
Perc
ent
of A
ll Va
lues
Pro
duce
d
Lifetime >3
Lifetime 3
Lifetime 2
Lifetime 1
Register File Caching (RFC)
S F U
M E M
T E X
Operand Routing
Operand Buffering
MRF 4x128-bit Banks (1R1W)
RFC 4x32-bit (3R1W) Banks
ALU
Energy Savings from RF Hierarchy 54% Energy Reduction
Source: Gebhart, et. al (Micro 2011)
Temporal SIMT
32-wide datapath
time
1 cy
c
1 warp instruction = 32 threads
thread 0
thread 31
ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
ml ml ml ml ad ad ad ad st st st st
Spatial SIMT (current GPUs)
1-wide
time
1 cy
c
ld ld ld ld ld ld ld
0 (threads)
1 2 3 4 5 6 7 ld
ld ld
8 9
ld 10
Pure Temporal SIMT
Temporal SIMT Optimizations
Control divergence — hybrid MIMD/SIMT
Scalarization
Factor common instructions from multiple threads Execute once – place results in common registers
32-wide (41%)
4-wide (65%)
1-wide (100%)
Scalar Instructions in SIMT Lanes
Scalar instruction spanning warp
Scalar register visible to all threads
Temporal execution of Warp Multiple
lanes/warps Y. Lee, CGO 2013
Scalarization Eliminates Work
0.0!
0.2!
0.4!
0.6!
0.8!
1.0!
1.2!
mm!
nn!
lavaMD!
gaussian! bfs
!
backprop!
streamcluster!
lud!
srad-v1! fft!
srad-v2!
mri-q!
cutcp!
nw!
hotspot! lbm
!cfd!
stencil!
leukocyte!
pathfinder!
heartwall!
spmv!
kmeans!
average!
(b) O
ps e
xecu
ted!
!
Series2! Series3! Series4! Series5! Series1!
Each group shows results for warp sizes of 4,8,16,32
Microarchitecture Action Savings (WS=32)
Instructions issued 41%
Operations executed 29%
Registers read 31%
Registers written 30%
Memory addresses and cache tag checks 47%
Memory data elements accessed 38%
Y. Lee, CGO 2013
64-bit DP 20pJ 26 pJ 256 pJ
1 nJ
500 pJ Efficient off-chip link
256-bit buses
16 nJ DRAM Rd/Wr
256-bit access 8 kB SRAM 50 pJ
20mm
Communication Dominates Arithmetic
Processor Technology 40 nm 10nm
Vdd (nominal) 0.9 V 0.7 V
DFMA energy 50 pJ 7.6 pJ
64b 8 KB SRAM Rd 14 pJ 2.1 pJ
Wire energy (256 bits, 10mm) 310 pJ 174 pJ
Memory Technology 45 nm 16nm
DRAM interface pin bandwidth 4 Gbps 50 Gbps
DRAM interface energy 20-30 pJ/bit 2 pJ/bit
DRAM access energy 8-15 pJ/bit 2.5 pJ/bit
Keckler [Micro 2011], Vogelsang [Micro 2010]
Energy Shopping List
FP Op lower bound =
4 pJ
● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●
●●●
●●
●
●●
●
●
0.6 0.8 1.0 1.2 1.4 1.6 1.8
050
100
150
200
250
300
Maximum Frequency (GHz)
En
erg
y p
er
bit
per
mm
(fJ
)
● FSI
LSI (200 mV)
LSI (400 mV)
CDI
SCI
DRAM Wordline Segmentation ! Local Wordline Segmentation
(LWS) ! Activate a portion of each mat ! Activate the same segment in all
mats
! Master Wordline Segmentation (MWS) ! Activate subset of mats ! New data layout such that a DRAM
atom resides in the activated mats ! Multiple narrow subchannels in
each channel
LWS
MWS
Activation Energy Reduction
! LWS+MWS: up to 73%, average 68%
! LWS: 26% Reductions in DRAM Row Energy
! MWS: 49%
0
2
4
6
8
10
12
bitonic-sort
debayer
fft-1d_k0
fft-2d_k0
fft-2d_k1
fft-2d_k2
fft-2d_k3
gpusv
m_k0
gpusv
m_k1
gpusv
m_k2
gpusv
m_k3
img_
reg_
k0
img_
reg_
k1
img_
reg_
k2
img_
reg_
k3
PERFECT (avg
)
Rodin
ia (avg
)
pJ/bit
Row Energy Consumption (per bit transferred to/from DRAM)
Baseline LWS_4 MWS_4 MWS_4 + LWS_4
GRS Test Chips
Probe Station
Test Chip #1 on Board
Test Chip #2 fabricated on production GPU
Eye Diagram from Probe Poulton et al. ISSCC 2013, JSSCC Dec 2013
Circuits: Ground Referenced Signaling
On-chip Signaling ! Testsite on GM2xx GPU ! <45fJ/bit-mm
TX
Repeater
RX
Goal: reduce Energy/bit 200fJ/bit-mm g 20fJ/bit-mm
With GPU noise
Without GPU noise
Measurement Results
The Locality Challenge Data Movement Energy 77pJ/F
0
50
100
150
200
250
0
1
2
3
4
5
6
7
8
9
1K 32K 1M 32M 1G
pJ/B
pJ
/Fx5
B/F
B/F
pJ/B
pJ/F x5
Optimized Architecture & Circuits 77pJ/F -> 18pJ/F
0
5
10
15
20
25
30
0
1
2
3
4
5
6
7
8
9
1K 32K 1M 32M 1G
pJ/B
pJ
/Fx5
B/F
B/F
pJ/B
pJ/F x5
Moore’s Law is Only Part of the Story
2001: 42M Tx 2004: 275M Tx
1993: 3M Tx
2010: 3B Tx 2007: 580M Tx
1997: 7.5M Tx
2012: 7B Tx
2 Trends
The End of Dennard Scaling
Pervasive Parallelism
Processors aren’t getting faster just wider
Thread Count (CPU+GPU)
2013 (28nm) 2020 (7nm)
Cell Phone 4+1,000 16+4,000
Server Node 12+10,000 32+40,000
Supercomputer 105+108 106+109
Parallel programming is not inherently any more difficult than serial programming However, we can make it a lot more difficult
A simple parallel program
!forall molecule in set { // launch a thread array! forall neighbor in molecule.neighbors { // nested! forall force in forces { // doubly nested! molecule.force = ! reduce_sum(force(molecule, neighbor))! }! }!}!
Why is this easy?
!forall molecule in set { // launch a thread array! forall neighbor in molecule.neighbors { // nested! forall force in forces { // doubly nested! molecule.force = ! reduce_sum(force(molecule, neighbor))! }! }!}!
No machine details All parallelism is expressed Synchronization is semantic (in reduction)
We could make it hard
!pid = fork() ; // explicitly managing threads!!lock(struct.lock) ; // complicated, error-prone synchronization!// manipulate struct!unlock(struct.lock) ;!!code = send(pid, tag, &msg) ; // partition across nodes!
Programmers, tools, and architecture Need to play their positions
Programmer
Architecture Tools
!forall molecule in set { // launch a thread array! forall neighbor in molecule.neighbors { //! forall force in forces { // doubly nested! molecule.force = ! reduce_sum(force(molecule, neighbor))! }! }!}!
Map foralls in time and space Map molecules across memories Stage data up/down hierarchy Select mechanisms
Exposed storage hierarchy Fast comm/sync/thread mechanisms
Target-Independent
Source
Mapping Tools
Target-Dependent Executable
Profiling & Visualization
Mapping Directives
Optimized Circuits 77pJ/F -> 18pJ/F
0
5
10
15
20
25
30
0
1
2
3
4
5
6
7
8
9
1K 32K 1M 32M 1G
pJ/B
pJ
/Fx5
B/F
B/F
pJ/B
pJ/F x5
Autotuned Software 18pJ/F -> 9pJ/F
0
5
10
15
20
25
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1K 32K 1M 32M 1G
pJ/B
pJ
/Fx5
B/F
B/F
pJ/B
pJ/F x5
Conclusion
! Power-limited: from data centers to cell phones ! Perf/W is Perf
! Throughput cores ! Reduce overhead
! Data movement ! Circuits: 200 -> 20 ! Optimized software
! Parallel programming is simple – we can make it hard ! Target-independent programming – mapping via tools