Dally's Keynote

Challenges for Future Computing Systems Bill Dally | Chief Scientist and SVP, Research NVIDIA | Professor (Research), EE&CS, Stanford

Exascale Computing Will Enable Transformational Science

Climate

Comprehensive Earth System Model at 1KM scale, enabling modeling of cloud convection and ocean eddies.

Combustion

First-principles simulation of combustion for new high-efficiency, low-emision engines.

Biology

Coupled simulation of entire cells at molecular, genetic, chemical and biological levels.

Astrophysics

Predictive calculations for thermonuclear and core-collapse supernovae, allowing confirmation of theoretical models.

Exascale Computing Will Enable Transformational Science

High-Performance Computers are

Scientific Instruments

18,688 NVIDIA Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs 17.59 Petaflops Sustained Performance on Linpack

TITAN

Tsubame KFC 4.5GFLOPS/W #1 on Green500 List

US to Build Two Flagship Supercomputers

Major Step Forward on the Path to Exascale

150-300 PFLOPS Peak Performance

IBM POWER9 CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect

40 TFLOPS per Node, >3,400 Nodes

2017

SUMMIT SIERRA

Scientific Discovery and Business Analytics

Driving an Insatiable Demand for More Computing Performance

2 Trends

The End of Dennard Scaling

Pervasive Parallelism

Moore’s Law doubles processor performance every few years, right?

Moore’s Law doubles processor performance every few years, right?

Wrong!

Moore’s Law gives us transistors Which we used to turn into scalar performance

Moore, Electronics 38(8) April 19, 1965

Moore’s Law is Only Part of the Story

2001: 42M Tx 2004: 275M Tx

1993: 3M Tx

2010: 3B Tx 2007: 580M Tx

1997: 7.5M Tx

2012: 7B Tx

1

1.5

2

2.5

3

1 1.5 2 2.5 3

Chip

Pow

er

Chip Capability

Classic Dennard Scaling 2.8x chip capability in same power

2x more transistors

Scale chip features down 0.7x per process generation

1.4x faster transistors

0.7x voltage

0.7x capacitance

But L3 energy scaling ended in 2005

Moore, ISSCC Keynote, 2003

Post Dennard Scaling 2x chip capability at 1.4x power

1.4x chip capability at same power

1

1.5

2

2.5

3

1 1.5 2 2.5 3

Chip

Pow

er

Chip Capability

2x more transistors

0.7x capacitance

Transistors are no faster Static leakage limits reduction in Vth => Vdd stays constant

Reality isn’t even this good 1.8x chip capability at 1.5x power 1.2x chip capability at same power

1

1.5

2

2.5

3

1 1.5 2 2.5 3

Chi

p Po

wer

Chip Capability

1.8x more transistors 0.85x energy

Restrictive geometry rules reduce transistor count Energy scales slower than linearly

“Moore’s Law gives us more transistors…

Dennard scaling made them useful.”

Bob Colwell, DAC 2013, June 4, 2013

ISAT LCC: 24

Also, ILP was ‘mined out’ in 2000

1e-41e-31e-21e-11e+01e+11e+21e+31e+41e+51e+61e+7

1980 1990 2000 2010 2020

Perf (ps/Inst)Linear (ps/Inst)

19%/year 30:1

1,000:1 30,000:1

Dally et al. “The Last Classical Computer”, ISAT Study, 2001

Result: The End of Historic Scaling

C Moore, Data Processing in ExaScale-ClassComputer Systems, Salishan, April 2011


!   Processors aren’t getting faster, just wider !   Future gains in performance are from parallelism

!   Future systems are energy limited !   Efficiency IS Performance

!   Process matters less !   One generation is 1.2x, not 2.8x

Its not about the FLOPs

16nm chip, 10mm on a side, 200W

DFMA 0.01mm2 10pJ/OP – 2GFLOPs A chip with 104 FPUs: 100mm2

200W 20TFLOPS Pack 50,000 of these in racks 1EFLOPS 10MW

Overhead

Locality

CPU 1690 pJ/flop Optimized for Latency

Caches

Westmere 32 nm

GPU 140 pJ/flop

Optimized for Throughput Explicit Management of On-chip Memory

Kepler 28 nm

Fixed-Function Logic is Even More Efficient

Energy/Op CPU 1.7nJ GPU 140pJ Fixed-Function 10pJ

How is Power Spent in a CPU?

In-order Embedded OOO Hi-perf

Clock + Control Logic 24%

Data Supply 17%

Instruction Supply 42%

Register File 11%

ALU 6% Clock + Pins

45%

ALU 4%

Fetch 11%

Rename 10%

Issue 11%

RF 14%

Data Supply 5%

Dally [2008] (Embedded in-order CPU) Natarajan [2003] (Alpha 21264)

Overhead 980pJ

Payload Arithmetic

20pJ

4/11/11 Milad Mohammadi 33

ORF ORFORF

LS/BRFP/IntFP/Int

To LD/ST

L0AddrL1Addr

Net

LM Bank

0

To LD/ST

LM Bank

3

RFL0AddrL1Addr

Net

RF

Net

DataPath

L0I$

Thre

ad P

CsAc

tive

PCs

Inst

ControlPath

Sche

duler

64 threads4 active threads2 DFMAs (4 FLOPS/clock)ORF bank: 16 entries (128 Bytes)L0 I$: 64 instructions (1KByte)LM Bank: 8KB (32KB total)

Simpler Cores = Energy Efficiency

Source: Azizi [PhD 2010]

Overhead 20pJ

Payload Arithmetic

20pJ

SIMT Lanes

Streaming Multiprocessor (SM)

Warp Scheduler

Shared Memory 32 KB

15% of SM Energy Main Register File 32 banks

ALU SFU MEM TEX

Hierarchical Register File

0%

20%

40%

60%

80%

100%

Perc

ent

of A

ll Va

lues

Pro

duce

d

Read >2 Times

Read 2 Times

Read 1 Time

Read 0 Times

0%

20%

40%

60%

80%

100%

Perc

ent

of A

ll Va

lues

Pro

duce

d

Lifetime >3

Lifetime 3

Lifetime 2

Lifetime 1

Register File Caching (RFC)

S F U

M E M

T E X

Operand Routing

Operand Buffering

MRF 4x128-bit Banks (1R1W)

RFC 4x32-bit (3R1W) Banks

ALU

Energy Savings from RF Hierarchy 54% Energy Reduction

Source: Gebhart, et. al (Micro 2011)

Temporal SIMT

32-wide datapath

time

1 cy

c

1 warp instruction = 32 threads

thread 0

thread 31

ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ld ml ml ml ml ad ad ad ad st st st st

ml ml ml ml ad ad ad ad st st st st







Spatial SIMT (current GPUs)

1-wide

time

1 cy

c

ld ld ld ld ld ld ld

0 (threads)

1 2 3 4 5 6 7 ld

ld ld

8 9

ld 10

Pure Temporal SIMT

Temporal SIMT Optimizations

Control divergence — hybrid MIMD/SIMT

Scalarization

Factor common instructions from multiple threads Execute once – place results in common registers

32-wide (41%)

4-wide (65%)

1-wide (100%)

Scalar Instructions in SIMT Lanes

Scalar instruction spanning warp

Scalar register visible to all threads

Temporal execution of Warp Multiple

lanes/warps Y. Lee, CGO 2013

Scalarization Eliminates Work

0.0!

0.2!

0.4!

0.6!

0.8!

1.0!

1.2!

mm!

nn!

lavaMD!

gaussian! bfs

!

backprop!

streamcluster!

lud!

srad-v1! fft!

srad-v2!

mri-q!

cutcp!

nw!

hotspot! lbm

!cfd!

stencil!

leukocyte!

pathfinder!

heartwall!

spmv!

kmeans!

average!

(b) O

ps e

xecu

ted!

!

Series2! Series3! Series4! Series5! Series1!

Each group shows results for warp sizes of 4,8,16,32

Microarchitecture Action Savings (WS=32)

Instructions issued 41%

Operations executed 29%

Registers read 31%

Registers written 30%

Memory addresses and cache tag checks 47%

Memory data elements accessed 38%

Y. Lee, CGO 2013

64-bit DP 20pJ 26 pJ 256 pJ

1 nJ

500 pJ Efficient off-chip link

256-bit buses

16 nJ DRAM Rd/Wr

256-bit access 8 kB SRAM 50 pJ

20mm

Communication Dominates Arithmetic

Processor Technology 40 nm 10nm

Vdd (nominal) 0.9 V 0.7 V

DFMA energy 50 pJ 7.6 pJ

64b 8 KB SRAM Rd 14 pJ 2.1 pJ

Wire energy (256 bits, 10mm) 310 pJ 174 pJ

Memory Technology 45 nm 16nm

DRAM interface pin bandwidth 4 Gbps 50 Gbps

DRAM interface energy 20-30 pJ/bit 2 pJ/bit

DRAM access energy 8-15 pJ/bit 2.5 pJ/bit

Keckler [Micro 2011], Vogelsang [Micro 2010]

Energy Shopping List

FP Op lower bound =

4 pJ

● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●

●●●

●●

●

●●

●

●

0.6 0.8 1.0 1.2 1.4 1.6 1.8

050

100

150

200

250

300

Maximum Frequency (GHz)

En

erg

y p

er

bit

per

mm

(fJ

)

● FSI

LSI (200 mV)

LSI (400 mV)

CDI

SCI

DRAM Wordline Segmentation !   Local Wordline Segmentation

(LWS) !   Activate a portion of each mat !   Activate the same segment in all

mats

!   Master Wordline Segmentation (MWS) !   Activate subset of mats !   New data layout such that a DRAM

atom resides in the activated mats !   Multiple narrow subchannels in

each channel

LWS

MWS

Activation Energy Reduction

!   LWS+MWS: up to 73%, average 68%

!   LWS: 26% Reductions in DRAM Row Energy

!   MWS: 49%

0

2

4

6

8

10

12

bitonic-sort

debayer

fft-1d_k0

fft-2d_k0

fft-2d_k1

fft-2d_k2

fft-2d_k3

gpusv

m_k0

gpusv

m_k1

gpusv

m_k2

gpusv

m_k3

img_

reg_

k0

img_

reg_

k1

img_

reg_

k2

img_

reg_

k3

PERFECT (avg

)

Rodin

ia (avg

)

pJ/bit

Row Energy Consumption (per bit transferred to/from DRAM)

Baseline LWS_4 MWS_4 MWS_4 + LWS_4

GRS Test Chips

Probe Station

Test Chip #1 on Board

Test Chip #2 fabricated on production GPU

Eye Diagram from Probe Poulton et al. ISSCC 2013, JSSCC Dec 2013

Circuits: Ground Referenced Signaling

On-chip Signaling ! Testsite on GM2xx GPU !   <45fJ/bit-mm

TX

Repeater

RX

Goal: reduce Energy/bit 200fJ/bit-mm g 20fJ/bit-mm

With GPU noise

Without GPU noise

Measurement Results

The Locality Challenge Data Movement Energy 77pJ/F

0

50

100

150

200

250

0

1

2

3

4

5

6

7

8

9

1K 32K 1M 32M 1G

pJ/B

pJ

/Fx5

B/F

B/F

pJ/B

pJ/F x5

Optimized Architecture & Circuits 77pJ/F -> 18pJ/F

0

5

10

15

20

25

30

0

1

2

3

4

5

6

7

8

9

1K 32K 1M 32M 1G

pJ/B

pJ

/Fx5

B/F

B/F

pJ/B

pJ/F x5

Moore’s Law is Only Part of the Story

2001: 42M Tx 2004: 275M Tx

1993: 3M Tx

2010: 3B Tx 2007: 580M Tx

1997: 7.5M Tx

2012: 7B Tx

2 Trends


Pervasive Parallelism

Processors aren’t getting faster just wider

Thread Count (CPU+GPU)

2013 (28nm) 2020 (7nm)

Cell Phone 4+1,000 16+4,000

Server Node 12+10,000 32+40,000

Supercomputer 105+108 106+109

Parallel programming is not inherently any more difficult than serial programming However, we can make it a lot more difficult

A simple parallel program

!forall molecule in set { // launch a thread array! forall neighbor in molecule.neighbors { // nested! forall force in forces { // doubly nested! molecule.force = ! reduce_sum(force(molecule, neighbor))! }! }!}!

Why is this easy?

!forall molecule in set { // launch a thread array! forall neighbor in molecule.neighbors { // nested! forall force in forces { // doubly nested! molecule.force = ! reduce_sum(force(molecule, neighbor))! }! }!}!

No machine details All parallelism is expressed Synchronization is semantic (in reduction)

We could make it hard

!pid = fork() ; // explicitly managing threads!!lock(struct.lock) ; // complicated, error-prone synchronization!// manipulate struct!unlock(struct.lock) ;!!code = send(pid, tag, &msg) ; // partition across nodes!

Programmers, tools, and architecture Need to play their positions

Programmer

Architecture Tools

!forall molecule in set { // launch a thread array! forall neighbor in molecule.neighbors { //! forall force in forces { // doubly nested! molecule.force = ! reduce_sum(force(molecule, neighbor))! }! }!}!

Map foralls in time and space Map molecules across memories Stage data up/down hierarchy Select mechanisms

Exposed storage hierarchy Fast comm/sync/thread mechanisms

Target-Independent

Source

Mapping Tools

Target-Dependent Executable

Profiling & Visualization

Mapping Directives

Optimized Circuits 77pJ/F -> 18pJ/F

0

5

10

15

20

25

30

0

1

2

3

4

5

6

7

8

9

1K 32K 1M 32M 1G

pJ/B

pJ

/Fx5

B/F

B/F

pJ/B

pJ/F x5

Autotuned Software 18pJ/F -> 9pJ/F

0

5

10

15

20

25

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1K 32K 1M 32M 1G

pJ/B

pJ

/Fx5

B/F

B/F

pJ/B

pJ/F x5

Conclusion

!   Power-limited: from data centers to cell phones ! Perf/W is Perf

!   Throughput cores !   Reduce overhead

!   Data movement !   Circuits: 200 -> 20 !   Optimized software

!   Parallel programming is simple – we can make it hard !   Target-independent programming – mapping via tools

Dally's Keynote

Documents

Transcript of Dally's Keynote