Exascale Computing a fact or a fiction? - IPDPS · 1 Exascale Computing— a fact or a fiction?...

49
1 Exascale Computinga fact or a fiction? Shekhar Borkar Intel Corp. May 21, 2013 This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. IPDPS 2013

Transcript of Exascale Computing a fact or a fiction? - IPDPS · 1 Exascale Computing— a fact or a fiction?...

1

Exascale Computing—

a fact or a fiction?

Shekhar Borkar

Intel Corp.

May 21, 2013

This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the

official policies, either expressed or implied, of the U.S. Government.

IPDPS 2013

2

Outline

Compute roadmap & technology outlook

Challenges & solutions for:

– Compute,

– Memory,

– Interconnect,

– Resiliency, and

– Software stack

Summary

3

Compute Performance Roadmap

1.E-04

1.E-02

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1960 1970 1980 1990 2000 2010 2020

GF

LO

P

Mega

Giga

Tera

Peta

Exa

12 Years 11 Years 10 Years

Client

Hand-held

4

From Giga to Exa, via Tera & Peta

1

10

100

1000

1986 1996 2006 2016

Re

lati

ve

Pro

c F

req

G

Tera

Peta

30X 250X

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1986 1996 2006 2016

G

Tera

Peta

36X

Exa

4,000X

Concurrency

2.5M X

Processor Performance

1.E+00

1.E+02

1.E+04

1.E+06

1.E+08

1986 1996 2006 2016

G

Tera

Peta

80X

Exa

4,000X

1M X

Power

System performance increases faster

Parallelism continues to increase

Power & energy challenge continues

5

Where is the Energy Consumed?

100pJ per FLOP 100W

150W

100W

100W

2.5KW

Decode and control Address translations… Power supply losses Bloated with inefficient architectural features

~3KW

Compute

Memory

Com

Disk

10TB disk @ 1TB/disk @10W

0.1B/FLOP @ 1.5nJ per Byte

100pJ com per FLOP

5W 2W

~5W ~3W 5W

Goal

~20W

Teraflop system today

The UHPC* Challenge

6

20 pJ/Operation

20MW, Exa 20W, Tera

20KW, Peta

*DARPA, Ubiquitous HPC Program

20 mW, Mega 2W, 100 Giga

20 mW, Giga

Technology Scaling Outlook

7

0

10

20

30

40

45nm 32nm 22nm 14nm 10nm 7nm 5nm

Rela

tive

Transistor Density

1.75 – 2X

0

0.5

1

1.5

45nm 32nm 22nm 14nm 10nm 7nm 5nm

Rela

tive

Frequency

Almost flat

0

0.2

0.4

0.6

0.8

1

1.2

45nm 32nm 22nm 14nm 10nm 7nm 5nm

Rela

tive

Supply Voltage

Almost flat

0.001

0.01

0.1

1

45nm 32nm 22nm 14nm 10nm 7nm 5nm

Re

lative

Energy

Some scaling

Ideal

Energy per Compute Operation

8

0

50

100

150

200

250

300

45nm 32nm 22nm 14nm 10nm 7nm

En

erg

y (

pJ)

pJ/bit Com

pJ/bit DRAM

DP RF Op

pJ/DP FP

FP Op

DRAM

Communication

Operands

75 pJ/bit

25 pJ/bit

10 pJ/bit

100 pJ/bit

Source: Intel

9

Voltage Scaling

0

2

4

6

8

10

0

0.2

0.4

0.6

0.8

1

1.2

0.3 0.5 0.7 0.9

No

rmali

zed

Vdd (Normal)

Freq

Total Power Leakage

Energy Efficiency

When designed to voltage scale

10

Near Threshold-Voltage (NTV)

1

101

103

104

102

10-2

10-1

1

101

102

0.2 0.4 0.6 0.8 1.0 1.2 1.4

Supply Voltage (V)

Maxim

um

Fre

qu

en

cy (

MH

z)

To

tal

Po

wer

(mW

)

320mV

65nm CMOS, 50°C

320mV S

ub

thre

sh

old

Reg

ion

9.6X

65nm CMOS, 50°C

10 -2

10 -1

1

10 1

0

50

100

150

200

250

300

350

400

450

0.2 0.4 0.6 0.8 1.0 1.2 1.4

Supply Voltage (V)

En

erg

y E

ffic

ien

cy (

GO

PS

/Watt

)

Acti

ve L

eakag

e P

ow

er

(mW

)

H. Kaul et al, 16.6: ISSCC08

4 O

rd

ers

< 3

Ord

ers

Subthreshold Leakage at NTV

11

0%

10%

20%

30%

40%

50%

60%

45nm 32nm 22nm 14nm 10nm 7nm 5nm

SD

Leakag

e P

ow

er

100% Vdd

75% Vdd

50% Vdd

40% Vdd Increasing

Variations

NTV operation reduces total power, improves energy efficiency

Subthreshold leakage power is substantial portion of the total

Mitigating Impact of Variation

12

1.Variation control with body biasing

Body effect is substantially reduced in advanced

technologies

Energy cost of body biasing could become substantial

Fully-depleted transistors have no body left

f

f

f f

f

f

f

f

f

f

f f

Example: Many-core System

f

f

f f

f/2

f/2

f/2

f/2

f/4

f/4

f/4 f/4

Running all cores at full frequency exceeds energy budget

Run cores at the native frequency Law of large numbers—averaging

2. Variation tolerance at the system level

Experimental NTV Processor

13

IA-32 Core

Logic

Sc

an

R

O M

L1$-I L1$-D

Level Shifters + clk spine

1.1 mm

1.8

mm

Custom Interposer 951 Pin FCBGA Package

Legacy Socket-7 Motherboard

Technology 32nm High-K Metal Gate

Interconnect 1 Poly, 9 Metal (Cu)

Transistors 6 Million (Core)

Core Area 2mm2

S. Jain, et al, “A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS”, ISSCC 2012

Wide Dynamic Range

14

NTV

EN

ER

GY

EF

FIC

EN

CY

HIGH

LOW

VOLTAGE ZERO MAX

~5x

Demonstrated

Normal operating range Subthreshold

Ultra-low Power Energy Efficient High Performance

280 mV 0.45 V 1.2 V

3 MHz 60 MHz 915 MHz

2 mW 10 mW 737 mW

1500 Mips/W 5830 Mips/W 1240 Mips/W

Observations

15

0%

20%

40%

60%

80%

100%

Sub-Vt NTV FullVdd

Mem Lkg

Mem Dyn

Logic Lkg

Logic Dyn

Leakage power dominates Fine grain leakage power management is required

Integration of Power Delivery

16

For efficiency and management

Standard OLGA Packaging

Technology

TOP

BOTTOM

Converter

chipLoad chip

Inductors

Input

CapacitorsOutput

Capacitors

5mmRF

Launch

Integrated Voltage Regulator Testchip

70

75

80

85

90

0 5 10 15 20

Load Current [A]E

ffic

ien

cy [

%]

L = 1.9nH

L = 0.8nH2.4V to 1.5V

2.4V to 1.2V

60MHz

100MHz80MHz

Schrom et al, “A 100MHz 8-Phase Buck Converter Delivering 12A in 25mm2 Using Air-Core Inductors”, APEC 2007

Power delivery closer to the load for 1. Improved efficiency 2. Fine grain power management

Fine-grain Power Management

Dynamic Chip Level

STANDBY:

• Memory retains data

• 50% less power/tile

FULL SLEEP:

• Memories fully off

• 80% less power/tile

Dynamic, within a core

21 sleep regions per tile

FP

Router

Data Memory

Instruction Memory

FP Engine 1 Sleeping: 90% less power

FP Engine 2 Sleeping: 90% less power

Router Sleeping: 10% less power (stays on to pass traffic)

Data Memory Sleeping: 57% less power

Instruction Memory Sleeping: 56% less power

17 Vangal et al, An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS, JSSC, January 2008

Mode Power Saving Wake up

Normal All active - -

Standby Logic off

Memory on

50% Fast

Sleep Logic and

Memory off

80% Slow

Energy efficiency increases by 60%,

Memory & Storage Technologies

18

0.01

0.1

1

10

100

1000

10000

100000

SRAM DRAM NAND, PCM Disk

Energy/bit

(Pico Joule)

Capacity

(G Bit)

Cost/bit

(Pico $)

(Endurance issue)

Source: Intel

Compare Memory Technologies

19

Source: Intel

DRAM for first level capacity memory NAND/PCM for next level storage

20

Revise DRAM Architecture

Page Page Page

RAS

CAS

Activates many pages Lots of reads and writes (refresh) Small amount of read data is used Requires small number of pins

Traditional DRAM New DRAM architecture

Page Page Page Addr

Addr

Activates few pages Read and write (refresh) what is needed All read data is used Requires large number of IO’s (3D)

1

10

100

1000

90 65 45 32 22 14 10 7

(nm)

Need

exponentially

increasing BW

(GB/sec)

Need

exponentially

decreasing

energy (pJ/bit)

Package

3D-Integration of DRAM and Logic

Logic Buffer

Logic Buffer Chip

Technology optimized for:

High speed signaling

Energy efficient logic circuits

Implement intelligence

DRAM Stack

Technology optimized for:

Memory density

Lower cost

3D Integration provides best of both worlds

1Tb/s HMC DRAM Prototype

• 3D integration technology • 1Gb DRAM Array • 512 MB total DRAM/cube • 128GB/s Bandwidth • <10 pj/bit energy

Bandwidth Energy Efficiency

DDR-3 (Today) 10.66 GB/Sec 50-75 pJ/bit

Hybrid Memory Cube 128 GB/Sec 8 pJ/bit

10X higher bandwidth, 10X lower energy Source: Micron

Communication Energy

23

0.01

0.1

1

10

100

0.1 1 10 100 1000

En

erg

y/b

it (

pJ/b

it)

Interconnect Distance (cm)

On Die

Board to Board

On Die

Chip to chip

Board to Board

Between

cabinets

On-die Interconnect

Interconnect energy (per mm) reduces slower than compute

On-die data movement energy will start to dominate

0

0.2

0.4

0.6

0.8

1

1.2

90 65 45 32 22 14 10 7

Technology (nm) Source: Intel

Compute energy

On die IC energy

24

Network On Chip (NoC)

25

IMEM +

DMEM

21% 10-port

RF

4%

Router +

Links

28%

Clock

dist.

11%

Dual

FPMACs

36%

Global Clocking

1%

Routers & 2D-mesh10%

MC & DDR3-

80019%

Cores70%

21.7

2m

m

12.64mmI/O Area

I/O Area

PLL

single tile

1.5mm

2.0mm

TAP

21.7

2m

m

12.64mmI/O Area

I/O Area

PLL

single tile

1.5mm

2.0mm

TAP

8 X 10 Mesh

32 bit links

320 GB/sec bisection BW @ 5 GHz

80 Core TFLOP Chip (2006)

VRC

21

.4m

m

26.5mm

System Interface + I/O

DD

R3

MC

D

DR

3 M

C

DD

R3

MC

D

DR

3 M

C

PLL

TILE

TILE

JTAG

C C

C C

2 Core clusters in 6 X 4 Mesh

(why not 6 x 8?)

128 bit links

256 GB/sec bisection BW @ 2 GHz

48 Core Single Chip Cloud (2009)

On-chip Interconnects

26

100

1000

10000

100000

0 5 10 15 20 25

Length (mm)

Dela

y (

ps)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

pJ/B

it

0.3u pitch, 0.5V

Wire Delay

Router Energy

Wire Energy

Repeated wire delay

Router Delay

Circuit Switched NoC

27 Anders et al, A 4.1Tb/s bisection-bandwidth 560Gb/s/W streaming circuit-switched 8×8 mesh network-on-chip in 45nm CMOS, ISSCC 2010

8x8 NoCClock I/O

Ports

Core

North

South

East

West

Traffic Generation

and Measurement

Process 45nm Hi-K/MG

CMOS, 9 Metal Cu

Nominal Supply 1.1V

Arbitration and

Router Logic Supports 512b data

Number of

Transistors 2.85M

Die Area 6.25mm2

Narrow, high freq packet switched

network establishes a circuit

Data is transferred using wide and

slower established circuit switched bus

Differential, low swing bus improves

energy efficiency

2 to 3X increased energy efficiency over packet switched network

28

Hierarchical & Heterogeneous

Bus

C C

C C

Bus to connect over short distances

Bus

C C

C C

Bus

C C

C C

Bus

C C

C C

Bus

C C

C C

2nd Level Bus

Hierarchy of Busses Or hierarchical circuit and packet switched networks

R R

R R

Electrical Interconnect < 1 Meter

29

0.1

1

10

100

1000

1.2u 0.5u 0.25u 0.13u 65nm 32nm

Energy (pJ/bit)

Data rate (Gb/sec)

Source: ISSCC papers

BW and Energy efficiency improves, but not enough

Electrical Interconnect Advances

Low-loss flex

connector

Low-loss twinax

Employ, new, low-loss, non-traditional interconnects

Top of the package

connector

Co-optimization of interconnects

and circuits for energy efficiency

O’Mahony et al, “A 47x10Gb/s 1.4mW/(Gb/s) Parallel Interface in 45nm CMOS “, ISSCC 2010; and J. Jaussi, RESS004, IDF 2010

0

1

2

3

4

0 100 200 300 400E

ne

rgy (p

J/b

it)

Channel length (cm)

HDI

Flex

Twinax

State of the art

-0.5 -0.25 0 0.25 0.5

-10

0

10

-12

-9

-6

-3

0

-0.5 -0.25 0 0.25 0.5

10

0

-10

Optical Interconnect > 1 Meter

31

0

2

4

6

8

10

1% 10% 20%

En

erg

y (

pJ

/bit

)

Laser efficiency

Laser

100% Link

utilization

1

10

100

100% 50% 10%

En

erg

y (

pJ

/bit

)

Link utilization

Laser efficiency

1%, 10%, 20%

Energy in supporting electronics is very low

Link energy dominated by laser (efficiency)

Sustained, high link utilization required

Source: PETE Study group

Straw-man System Interconnect

32

...

~0.8 PF

......... Cluster of 35

...

1,000 fibers each

...

............... 35 Clusters

............... 35 Clusters

8,000 fibers each

Assume: 40 Gbps, 10 pJ/b, $0.6/Gbps, 8B/FLOP, naïve tapering

$35M 217 MW

Bandwidth Tapering

33

24

6.65

1.13

0.19

0.03

0.0045

0.0005

0.00005

1.E-05

1.E-04

1.E-03

1.E-02

1.E-01

1.E+00

1.E+01

1.E+02

Core L1 L2 Chip Board Cab L1SysL2Sys

Byte

/FL

OP

8 Byte/Flop total

Naïve, 4X

Severe

0

0.5

1

1.5

L1 L2 Chip Board Cab Sys

Data

M

ovem

en

t P

ow

er

(MW

)

Total DM Power = 3 MW

0.1

1

10

100

1000

L1 L2 Chip Board Cab Sys

Data

M

ovem

en

t P

ow

er

(MW

)

Total DM Power = 217 MW

Intelligent BW tapering is necessary

34

Road to Unreliability?

From Peta to Exa Reliability Issues

1,000X parallelism More hardware for something to go wrong

>1,000X intermittent faults due to soft errors

Aggressive Vcc scaling

to reduce power/energy

Gradual faults due to increased variations

More susceptible to Vcc droops (noise)

More susceptible to dynamic temp variations

Exacerbates intermittent faults—soft errors

Deeply scaled

technologies

Aging related faults

Lack of burn-in?

Variability increases dramatically

Resiliency will be the corner-stone

Soft Errors and Reliability

35

0

0.2

0.4

0.6

0.8

1

0.5 1 1.5 2

Voltage (V)

n-S

ER

/ce

ll (

se

a-l

ev

el)

65nm

90nm

130nm

180nm

250nm

Soft error/bit reduces each generation Nominal impact of NTV on soft error rate

1

10

180 130 90 65 45 32

Technology (nm)

Rela

tive t

o 1

30n

m

Memory

Latch

Assuming 2X bit/latch

count increase per generation

Soft error at the system level will continue to increase

Positive impact of NTV on reliability

Low V lower E fields, low power lower temperature

Device aging effects will be less of a concern

Lower electromigration related defects

N. Seifert et al, "Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices", 44th Reliability Physics Symposium, 2006

36

Resiliency Faults Example

Permanent faults Stuck-at 0 & 1

Gradual faults Variability

Temperature

Intermittent faults Soft errors

Voltage droops

Aging faults Degradation

Faults cause errors (data & control)

Datapath errors Detected by parity/ECC

Silent data corruption Need HW hooks

Control errors Control lost (Blue screen)

Minimal overhead for resiliency

Circuit & Design

Microarchitecture

Microcode, Platform

Programming system

Applications Error detection Fault isolation Fault confinement Reconfiguration Recovery & Adapt

System Software

37

Architecture needs a Paradigm Shift

Must revisit and evaluate each (even legacy) architecture feature

Single thread performance Frequency

Programming productivity Legacy, compatibility

Architecture features for productivity

Constraints (1) Cost

(2) Reasonable Power/Energy

Throughput performance Parallelism, application specific HW

Power/Energy Architecture features for energy

Simplicity

Constraints (1) Programming productivity

(2) Cost

Architect’s past and present priorities—

Architect’s future priorities should be—

HW-SW Co-design

38

Circuits & Design

Applications and SW stack provide guidance for efficient system design

Architecture

Programming Sys

System SW

Applications

Limitations, issues and opportunities to exploit

Bottom-up Guidance

39

1. NTV reduces energy but exacerbates variations

Small & Fast cores Random distribution Temp dependent

3. On-die Interconnect energy (per mm) does not reduce as much as compute

0

0.2

0.4

0.6

0.8

1

1.2

45 32 22 14 10 7

Re

lati

ve

Technology (nm)

Compute Energy

Interconnect Energy6X compute 1.6X interconnect

2. Limited NTV for arrays (memory) due to stability issues

Perf

orm

ance

Voltage

Compute

MemoryDisproportionate Memory arrays can be made larger

4. At NTV, leakage power is substantial portion of the total power

0%

10%

20%

30%

40%

50%

60%

45nm 32nm 22nm 14nm 10nm 7nm 5nm

SD L

eak

age

Po

we

r

100% Vdd

75% Vdd

50% Vdd

40% VddIncreasing Variations

Expect 50% leakage Idle hardware consumes energy

5. DRAM energy scales, but not enough

1

10

100

1000

90nm 65nm 45nm 32nm 22nm 14nm 10nm 7nm

DRAM Energy (pJ/b)

3D Hybrid Memory Cube

50 pJ/b today 8 pJ/b demonstrated Need < 2pJ/b

6. System interconnect limited by laser energy and cost

0

50

100

150

200

250

300

45nm 32nm 22nm 14nm 10nm 7nm 5nm

MW

Data Movement Power System

Cabinet

Boards

Die

Clusters

Islands

40 Gbps Photonic links @ 10 pJ/b

BW tapering and locality awareness necessary

Today’s HW System Architecture

40

Socket

4GB

DRAM

Socket

4GB

DRAM

Socket

4GB

DRAM

Socket

4GB

DRAM

Coherent domain 1.5 TF Peak 660 pJ/F 660 MW/Exa

10 mB of DRAM/F

...

Non-coherent domain

Today’s programming model comprehends this system

architecture

Core

X8 FP

Small

RF

32K I$

32K D$

128K I$

128K D$

16 MB L3

Core

X8 FP

Small

RF

32K I$

32K D$

128K I$

128K D$

…X 16… ~ 3GHz ~100 W

384 GF/s Peak 260 pJ/F 260 MW/Exa 55 mB of local memory/F

Processor

Straw-man Exascale Processor

41

C C C C

C C C C Lo

gic

Shared Cache

First level of hierarchy Next level of hierarchy PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect ……

…..

……

…..

600K Transistors

Simplest Core

*

RF

Logic

Technology 7nm, 2018

Die area 500 mm2

Cores 2048

Frequency 4.2 GHz

TFLOPs 17.2

Power 600 Watts

E Efficiency 34 pJ/Flop

Computations alone consume 34 MW for Exascale

Processor

Next level of hierarchyPE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..

Next level of hierarchyPE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..

Next level of hierarchyPE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..

Next level of hierarchyPE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..

………..

………..

Last level cache

……

…..

……

…..

Interconnect

Straw-man Architecture at NTV

42

Full Vdd 50% Vdd

Technology 7nm, 2018

Die area 500 mm2

Cores 2048

Frequency 4.2 GHz 600 MHz

TFLOPs 17.2 2.5

Power 600 Watts 37 Watts

E Efficiency 34 pJ/Flop 15 pJ/Flop

Compute energy efficiency close to Exascale goal

600K Transistors

Simplest Core

*

RF

Logic

C C C C

C C C C

Logi

cShared Cache

First level of hierarchy Next level of hierarchyPE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..

Processor

Next level of hierarchyPE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..

Next level of hierarchyPE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..

Next level of hierarchyPE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..

Next level of hierarchyPE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

Next level cache

Interconnect……

…..

……

…..

………..

………..

Last level cache

……

…..

……

…..

Interconnect

Reduced frequency and flops

Reduced power and improved E-efficiency

Local Memory Capacity

43

1

10

100

1000

L1 L2 L3 L4

Mic

ro-B

yte

/FL

OP

s

Memory Hierarchy

Traditional

Exa-Strawman

At NTV

Higher local memory capacity promotes data locality

Interconnect Structures

44

Buses over short distance

Shared bus

1 to 10 fJ/bit 0 to 5mm Limited scalability

Multi-ported Memory

Shared memory

10 to 100 fJ/bit 1 to 5mm Limited scalability

X-Bar

Cross Bar Switch

0.1 to 1pJ/bit 2 to 10mm Moderate scalability

1 to 3pJ/bit >5 mm, scalable

Packet Switched Network Board Cabinet

...

Second LevelSwitch

...

...

First levelSwitch

...

...

………………………… …………………………

…………………………

CabinetCluster

SwitchCluster

System

SW Challenges

45

1.Extreme parallelism (1000X due to Exa, additional 4X due to NTV)

2.Data locality—reduce data movement

3.Intelligent scheduling—move thread to data if necessary

4.Fine grain resource management (objective function)

5.Applications and algorithms incorporate paradigm change

Pro

gra

mm

ing

mo

del

Exe

cution m

odel

Programming & Execution Model

46

Event driven tasks (EDT)

Dataflow inspired, tiny codelets (self contained)

Non blocking, no preemption

Programming model:

Separation of concerns: Domain specification & HW mapping

Express data locality with hierarchical tiling

Global, shared, non-coherent address space

Optimization and auto generation of EDTs (HW specific)

Execution model:

Dynamic, event-driven scheduling, non-blocking

Dynamic decision to move computation to data

Observation based adaption (self-awareness)

Implemented in the runtime environment

Separation of concerns:

User application, control, and resource management

Over-provisioning, Introspection, Self-awareness

47

F F

M F

S

S

F S S

Addressing variations

1. Provide more compute HW 2. Law of large numbers 3. Static profile

M

S

S

S M M F

1. Schedule threads based on objectives and resources

2. Dynamically control and manage resources

3. Identify sensors, functions in HW for implementation

System SW implements introspective execution model

F F

M F

S

S

F S S

M

S

S

S M M F

Dynamic reconfiguration: 1. Energy efficiency 2. Latency 3. Dynamic resource

management

Fine grain resource mgmt

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

8MB Shared LLC

Interconnect

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

8MB Shared LLC

Interconnect………..

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

8MB Shared LLC

Interconnect

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

PE PE PE PE

PE PE PE PEServ

ice

Co

re

1MB L2

………..

8MB Shared LLC

Interconnect………..

64MB Shared LLC

………..

………..

Interconnect

Processor Chip (16 Clusters)

Sensors for introspection

1. Energy consumption 2. Instantaneous power 3. Computations 4. Data movement

Over-provisioned Introspectively Resource

Managed System

48

Over-provisioned in design

Dynamically tuned for the given objective

0

20

40

60

80

100

120

140

45nm 32nm 22nm 14nm 10nm 7nm 5nm

MW

System Power Data MovementPower w/o DMSys Goal

20 MW

0

2

4

6

8

10

12

14

45nm 32nm 22nm 14nm 10nm 7nm 5nm

MW

Data Movement Power System

Cabinet

Boards

Die

Clusters

Islands

40 Gbps Photonic links @ 10 pJ/b

3.23 MW

49

Summary

Power & energy challenge continues

Opportunistically employ NTV operation

3D integration for DRAM

Communication energy will far exceed computation

Data locality will be paramount

Revolutionary software stack needed to make

Exascale real