Dally's Keynote

66
Challenges for Future Computing Systems Bill Dally | Chief Scientist and SVP, Research NVIDIA | Professor (Research), EE&CS, Stanford

Transcript of Dally's Keynote

Page 1: Dally's Keynote

Challenges for Future Computing Systems Bill Dally | Chief Scientist and SVP, Research NVIDIA | Professor (Research), EE&CS, Stanford

Page 2: Dally's Keynote

Exascale Computing Will Enable Transformational Science

Page 3: Dally's Keynote

Climate

Comprehensive Earth System Model at 1KM scale, enabling modeling of cloud convection and ocean eddies.

Page 4: Dally's Keynote

Combustion

First-principles simulation of combustion for new high-efficiency, low-emision engines.

Page 5: Dally's Keynote

Biology

Coupled simulation of entire cells at molecular, genetic, chemical and biological levels.

Page 6: Dally's Keynote

Astrophysics

Predictive calculations for thermonuclear and core-collapse supernovae, allowing confirmation of theoretical models.

Page 7: Dally's Keynote

Exascale Computing Will Enable Transformational Science

High-Performance Computers are

Scientific Instruments

Page 8: Dally's Keynote

18,688 NVIDIA Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack

TITAN

Page 9: Dally's Keynote

Tsubame KFC 4.5GFLOPS/W #1 on Green500 List

Page 10: Dally's Keynote

US to Build Two Flagship Supercomputers

Major Step Forward on the Path to Exascale

150-300 PFLOPS Peak Performance

IBM POWER9 CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect

40 TFLOPS per Node, >3,400 Nodes

2017

SUMMIT SIERRA

Page 11: Dally's Keynote

Scientific Discovery and Business Analytics

Driving an Insatiable Demand for More Computing Performance

Page 12: Dally's Keynote
Page 13: Dally's Keynote

2 Trends

The End of Dennard Scaling

Pervasive Parallelism

Page 14: Dally's Keynote

2 Trends

The End of Dennard Scaling

Pervasive Parallelism

Page 15: Dally's Keynote

Moore’s Law doubles processor performance every few years, right?

Page 16: Dally's Keynote

Moore’s Law doubles processor performance every few years, right?

Wrong!

Page 17: Dally's Keynote

Moore’s Law gives us transistors Which we used to turn into scalar performance

Moore, Electronics 38(8) April 19, 1965

Page 18: Dally's Keynote

Moore’s Law is Only Part of the Story

2001: 42M Tx 2004: 275M Tx

1993: 3M Tx

2010: 3B Tx 2007: 580M Tx

1997: 7.5M Tx

2012: 7B Tx

Page 19: Dally's Keynote

1

1.5

2

2.5

3

1 1.5 2 2.5 3

Chip

Pow

er

Chip Capability

Classic Dennard Scaling 2.8x chip capability in same power

2x more transistors

Scale chip features down 0.7x per process generation

1.4x faster transistors

0.7x voltage

0.7x capacitance

Page 20: Dally's Keynote

But L3 energy scaling ended in 2005

Moore, ISSCC Keynote, 2003

Page 21: Dally's Keynote

Post Dennard Scaling 2x chip capability at 1.4x power

1.4x chip capability at same power

1

1.5

2

2.5

3

1 1.5 2 2.5 3

Chip

Pow

er

Chip Capability

2x more transistors

0.7x capacitance

Transistors are no faster Static leakage limits reduction in Vth => Vdd stays constant

Page 22: Dally's Keynote

Reality isn’t even this good 1.8x chip capability at 1.5x power 1.2x chip capability at same power

1

1.5

2

2.5

3

1 1.5 2 2.5 3

Chi

p Po

wer

Chip Capability

1.8x more transistors 0.85x energy

Restrictive geometry rules reduce transistor count Energy scales slower than linearly

Page 23: Dally's Keynote

“Moore’s Law gives us more transistors…

Dennard scaling made them useful.”

Bob Colwell, DAC 2013, June 4, 2013

Page 24: Dally's Keynote

ISAT LCC: 24

Also, ILP was ‘mined out’ in 2000

1e-41e-31e-21e-11e+01e+11e+21e+31e+41e+51e+61e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)Linear (ps/Inst)

19%/year 30:1

1,000:1 30,000:1

Dally et al. “The Last Classical Computer”, ISAT Study, 2001

Page 25: Dally's Keynote

Result: The End of Historic Scaling

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011

Page 26: Dally's Keynote

The End of Dennard Scaling

!   Processors aren’t getting faster, just wider !   Future gains in performance are from parallelism

!   Future systems are energy limited !   Efficiency IS Performance

!   Process matters less !   One generation is 1.2x, not 2.8x

Page 27: Dally's Keynote

Its not about the FLOPs

16nm chip, 10mm on a side, 200W

DFMA 0.01mm2 10pJ/OP – 2GFLOPs A chip with 104 FPUs: 100mm2

200W 20TFLOPS Pack 50,000 of these in racks 1EFLOPS 10MW

Page 28: Dally's Keynote

Overhead

Locality

Page 29: Dally's Keynote

CPU 1690 pJ/flop Optimized for Latency

Caches

Westmere 32 nm

GPU 140 pJ/flop

Optimized for Throughput Explicit Management of On-chip Memory

Kepler 28 nm

Page 30: Dally's Keynote

Fixed-Function Logic is Even More Efficient

Energy/Op CPU 1.7nJ GPU 140pJ Fixed-Function 10pJ

Page 31: Dally's Keynote

How is Power Spent in a CPU?

In-order Embedded OOO Hi-perf

Clock + Control Logic 24%

Data Supply 17%

Instruction Supply 42%

Register File 11%

ALU 6% Clock + Pins

45%

ALU 4%

Fetch 11%

Rename 10%

Issue 11%

RF 14%

Data Supply 5%

Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)

Page 32: Dally's Keynote

Overhead 980pJ

Payload Arithmetic

20pJ

Page 33: Dally's Keynote

4/11/11 Milad Mohammadi 33

Page 34: Dally's Keynote

ORF ORFORF

LS/BRFP/IntFP/Int

To LD/ST

L0AddrL1Addr

Net

LM Bank

0

To LD/ST

LM Bank

3

RFL0AddrL1Addr

Net

RF

Net

DataPath

L0I$

Thre

ad P

CsAc

tive

PCs

Inst

ControlPath

Sche

duler

64 threads4 active threads2 DFMAs (4 FLOPS/clock)ORF bank: 16 entries (128 Bytes)L0 I$: 64 instructions (1KByte)LM Bank: 8KB (32KB total)

Page 35: Dally's Keynote

Simpler Cores = Energy Efficiency

Source: Azizi [PhD 2010]

Page 36: Dally's Keynote

Overhead 20pJ

Payload Arithmetic

20pJ

Page 37: Dally's Keynote

SIMT Lanes

Streaming Multiprocessor (SM)

Warp Scheduler

Shared Memory 32 KB

15% of SM Energy Main Register File 32 banks

ALU SFU MEM TEX

Page 38: Dally's Keynote

Hierarchical Register File

0%

20%

40%

60%

80%

100%

Perc

ent

of A

ll Va

lues

Pro

duce

d

Read >2 Times

Read 2 Times

Read 1 Time

Read 0 Times

0%

20%

40%

60%

80%

100%

Perc

ent

of A

ll Va

lues

Pro

duce

d

Lifetime >3

Lifetime 3

Lifetime 2

Lifetime 1

Page 39: Dally's Keynote

Register File Caching (RFC)

S F U

M E M

T E X

Operand Routing

Operand Buffering

MRF 4x128-bit Banks (1R1W)

RFC 4x32-bit (3R1W) Banks

ALU

Page 40: Dally's Keynote

Energy Savings from RF Hierarchy 54% Energy Reduction

Source: Gebhart, et. al (Micro 2011)

Page 41: Dally's Keynote

Temporal SIMT

32-wide datapath

time

1 cy

c

1 warp instruction = 32 threads

thread 0

thread 31

ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st

Spatial SIMT (current GPUs)

1-wide

time

1 cy

c

ld ld ld ld ld ld ld

0 (threads)

1 2 3 4 5 6 7 ld

ld ld

8 9

ld 10

Pure Temporal SIMT

Page 42: Dally's Keynote

Temporal SIMT Optimizations

Control divergence — hybrid MIMD/SIMT

Scalarization

Factor common instructions from multiple threads Execute once – place results in common registers

32-wide (41%)

4-wide (65%)

1-wide (100%)

Page 43: Dally's Keynote

Scalar Instructions in SIMT Lanes

Scalar instruction spanning warp

Scalar register visible to all threads

Temporal execution of Warp Multiple

lanes/warps Y. Lee, CGO 2013

Page 44: Dally's Keynote

Scalarization Eliminates Work

0.0!

0.2!

0.4!

0.6!

0.8!

1.0!

1.2!

mm!

nn!

lavaMD!

gaussian! bfs

!

backprop!

streamcluster!

lud!

srad-v1! fft!

srad-v2!

mri-q!

cutcp!

nw!

hotspot! lbm

!cfd!

stencil!

leukocyte!

pathfinder!

heartwall!

spmv!

kmeans!

average!

(b) O

ps e

xecu

ted!

!

Series2! Series3! Series4! Series5! Series1!

Each group shows results for warp sizes of 4,8,16,32

Microarchitecture Action Savings (WS=32)

Instructions issued 41%

Operations executed 29%

Registers read 31%

Registers written 30%

Memory addresses and cache tag checks 47%

Memory data elements accessed 38%

Y. Lee, CGO 2013

Page 45: Dally's Keynote

64-bit DP 20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficient off-chip link

256-bit buses

16 nJ DRAM Rd/Wr

256-bit access 8 kB SRAM 50 pJ

20mm

Communication Dominates Arithmetic

Page 46: Dally's Keynote

Processor Technology 40 nm 10nm

Vdd (nominal) 0.9 V 0.7 V

DFMA energy 50 pJ 7.6 pJ

64b 8 KB SRAM Rd 14 pJ 2.1 pJ

Wire energy (256 bits, 10mm) 310 pJ 174 pJ

Memory Technology 45 nm 16nm

DRAM interface pin bandwidth 4 Gbps 50 Gbps

DRAM interface energy 20-30 pJ/bit 2 pJ/bit

DRAM access energy 8-15 pJ/bit 2.5 pJ/bit

Keckler [Micro 2011], Vogelsang [Micro 2010]

Energy Shopping List

FP Op lower bound =

4 pJ

Page 47: Dally's Keynote

● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●

●●●

●●

●●

0.6 0.8 1.0 1.2 1.4 1.6 1.8

050

100

150

200

250

300

Maximum Frequency (GHz)

En

erg

y p

er

bit

per

mm

(fJ

)

● FSI

LSI (200 mV)

LSI (400 mV)

CDI

SCI

Page 48: Dally's Keynote

DRAM Wordline Segmentation !   Local Wordline Segmentation

(LWS) !   Activate a portion of each mat !   Activate the same segment in all

mats

!   Master Wordline Segmentation (MWS) !   Activate subset of mats !   New data layout such that a DRAM

atom resides in the activated mats !   Multiple narrow subchannels in

each channel

LWS

MWS

Page 49: Dally's Keynote

Activation Energy Reduction

!   LWS+MWS: up to 73%, average 68%

!   LWS: 26% Reductions in DRAM Row Energy

!   MWS: 49%

0

2

4

6

8

10

12

bitonic-sort

debayer

fft-1d_k0

fft-2d_k0

fft-2d_k1

fft-2d_k2

fft-2d_k3

gpusv

m_k0

gpusv

m_k1

gpusv

m_k2

gpusv

m_k3

img_

reg_

k0

img_

reg_

k1

img_

reg_

k2

img_

reg_

k3

PERFECT (avg

)

Rodin

ia (avg

)

pJ/bit

Row Energy Consumption (per bit transferred to/from DRAM)

Baseline LWS_4 MWS_4 MWS_4 + LWS_4

Page 50: Dally's Keynote

GRS Test Chips

Probe Station

Test Chip #1 on Board

Test Chip #2 fabricated on production GPU

Eye Diagram from Probe Poulton et al. ISSCC 2013, JSSCC Dec 2013

Page 51: Dally's Keynote

Circuits: Ground Referenced Signaling

On-chip Signaling ! Testsite on GM2xx GPU !   <45fJ/bit-mm

TX

Repeater

RX

Goal: reduce Energy/bit 200fJ/bit-mm g 20fJ/bit-mm

With GPU noise

Without GPU noise

Measurement Results

Page 52: Dally's Keynote

The Locality Challenge Data Movement Energy 77pJ/F

0

50

100

150

200

250

0

1

2

3

4

5

6

7

8

9

1K 32K 1M 32M 1G

pJ/B

pJ

/Fx5

B/F

B/F

pJ/B

pJ/F x5

Page 53: Dally's Keynote

Optimized Architecture & Circuits 77pJ/F -> 18pJ/F

0

5

10

15

20

25

30

0

1

2

3

4

5

6

7

8

9

1K 32K 1M 32M 1G

pJ/B

pJ

/Fx5

B/F

B/F

pJ/B

pJ/F x5

Page 54: Dally's Keynote

Moore’s Law is Only Part of the Story

2001: 42M Tx 2004: 275M Tx

1993: 3M Tx

2010: 3B Tx 2007: 580M Tx

1997: 7.5M Tx

2012: 7B Tx

Page 55: Dally's Keynote

2 Trends

The End of Dennard Scaling

Pervasive Parallelism

Page 56: Dally's Keynote

Processors aren’t getting faster just wider

Page 57: Dally's Keynote

Thread Count (CPU+GPU)

2013 (28nm) 2020 (7nm)

Cell Phone 4+1,000 16+4,000

Server Node 12+10,000 32+40,000

Supercomputer 105+108 106+109

Page 58: Dally's Keynote

Parallel programming is not inherently any more difficult than serial programming However, we can make it a lot more difficult

Page 59: Dally's Keynote

A simple parallel program

!forall molecule in set { // launch a thread array! forall neighbor in molecule.neighbors { // nested! forall force in forces { // doubly nested! molecule.force = ! reduce_sum(force(molecule, neighbor))! }! }!}!

Page 60: Dally's Keynote

Why is this easy?

!forall molecule in set { // launch a thread array! forall neighbor in molecule.neighbors { // nested! forall force in forces { // doubly nested! molecule.force = ! reduce_sum(force(molecule, neighbor))! }! }!}!

No machine details All parallelism is expressed Synchronization is semantic (in reduction)

Page 61: Dally's Keynote

We could make it hard

!pid = fork() ; // explicitly managing threads!!lock(struct.lock) ; // complicated, error-prone synchronization!// manipulate struct!unlock(struct.lock) ;!!code = send(pid, tag, &msg) ; // partition across nodes!

Page 62: Dally's Keynote

Programmers, tools, and architecture Need to play their positions

Programmer

Architecture Tools

!forall molecule in set { // launch a thread array! forall neighbor in molecule.neighbors { //! forall force in forces { // doubly nested! molecule.force = ! reduce_sum(force(molecule, neighbor))! }! }!}!

Map foralls in time and space Map molecules across memories Stage data up/down hierarchy Select mechanisms

Exposed storage hierarchy Fast comm/sync/thread mechanisms

Page 63: Dally's Keynote

Target-Independent

Source

Mapping Tools

Target-Dependent Executable

Profiling & Visualization

Mapping Directives

Page 64: Dally's Keynote

Optimized Circuits 77pJ/F -> 18pJ/F

0

5

10

15

20

25

30

0

1

2

3

4

5

6

7

8

9

1K 32K 1M 32M 1G

pJ/B

pJ

/Fx5

B/F

B/F

pJ/B

pJ/F x5

Page 65: Dally's Keynote

Autotuned Software 18pJ/F -> 9pJ/F

0

5

10

15

20

25

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1K 32K 1M 32M 1G

pJ/B

pJ

/Fx5

B/F

B/F

pJ/B

pJ/F x5

Page 66: Dally's Keynote

Conclusion

!   Power-limited: from data centers to cell phones ! Perf/W is Perf

!   Throughput cores !   Reduce overhead

!   Data movement !   Circuits: 200 -> 20 !   Optimized software

!   Parallel programming is simple – we can make it hard !   Target-independent programming – mapping via tools