Efficient Computing Beyond the Validity of Moore’s...

23
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom © Efficient Computing Beyond the Validity of Moore’s Law Per Stenström Chalmers University of Technology Sweden

Transcript of Efficient Computing Beyond the Validity of Moore’s...

Page 1: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Efficient Computing Beyond the Validity of Moore’s Law

Per Stenström

Chalmers University of Technology Sweden

Page 2: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Outline

•  Computer architecture beyond Moore… •  The MECCA 3P Vision for

– Parallelism management – Power management – Predictability management

•  Runtime assisted cache management •  Concluding remarks

Page 3: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Scaling (as we know it) is ending soon…

A radical new way of how we think about

compute efficiency is needed

Page 4: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Program

ISA

Problem

Back to Yale…

Algorithm

Microarchitecture

Circuits

Electrons

Page 5: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Program

ISA

Problem

Need to Put the Stake in the Ground

Algorithm

Microarchitecture

Circuits

Electrons

Page 6: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

MECCAs Approach – Interaction Across Layers

ISA + primitives

Resource Management

Microarchitecture

Parallel programming model plus semantic information

Page 7: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

•  Parallelism is ubiquitous but hard to deal with •  Power is heavily constraining performance growth •  Predictable time responses are becoming harder to

meet

MECCAs 3P

Page 8: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

MECCA’s Vision for Efficient Computing

•  P1: Parallelism: –  Programmers: Unlock parallelism and provide semantic

information –  Resource manager (runtime): Manage parallelism “under the

hood” •  P2: Power:

–  Programmers: Express quality of service attributes –  Resource manager (runtime): Translate them into efficient use

of hardware resources “under the hood” •  P3: Predictability:

–  Programmers: Express quality of service attributes –  Resource manager (runtime): Manage parallelism predictably

to meet timing constraints “under the hood”

Page 9: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Parallelism Management

Page 10: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Task-based Dataflow Prog. Models

TaskA

TaskC TaskB

#pragma css task output(a) void TaskA( float a[M][M]); #pragma css task input(a) void TaskB( float a[M][M]); #pragma css task input(a) void TaskC( float a[M][M]);

•  Programmer annotations for task dependences •  Annotations used by run-time for scheduling •  Dataflow task graph constructed dynamically

Convey semantic information to runtime for efficient scheduling

Page 11: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Runtime Management of Cache Hierarchy

View: Runtime is part of the chip and responsible for management of the cache hierarchy - in analogy with OS managing virtual memory

Runtime assisted cache management: •  Runtime-guided cache coherence optimizations •  Runtime-assisted dead-block management •  Runtime-assisted global cache management (ongoing)

Page 12: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Runtime Guided Cache Coherence Optimizations

Dependency annotations allow for optimizations with high accuracy (like in message passing)

Prefetching

Migratory sharing optimization

Bulk data transfer

Forwarding

Prod Cons

Output A[1000]

Input A[1000]

Prod Cons

Output A[1000]

Input A[1000]

Prod Cons

Output A[1000]

Input A[1000]

T1 T2

Inout A[1000]

Inout A[1000]

Page 13: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

•  Self-invalidation provides significant gains •  SP+D+I provides added gains

Runtime Guided Cache Coherence [Manivannan & Stenstrom IPDPS 2014]

Page 14: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

RADAR: Run-time Assisted Dead-block Management

[Manivannan et al., HPCA 2016]

Problem: Dead (cache) blocks waste precious cache space State-of-the-art: dynamic dead-block schemes do not respond to future changes RADAR: •  Dead blocks are identified based on prediction of future access

– limited success (“look-back” approach) •  Region annotations can give runtime an outlook of future access

(“look-ahead” approach) •  “Look-back” and “look-ahead” can be combined

Page 15: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

RADAR Overview

15

Application Layer

Runtime Layer

Architecture Layer

Prog. Model Dependency Annotations

Dep. Analysis Dep. Table

Ref. Counting

Monitoring LLC Miss Counts

Specify Task and its Region Accesses

Region History

Access Predictor

Is Region dead after current access?

DEAD

i) Look-Ahead (LA) ii) Look-back (LB) iii) Combined (CS + CA)

Evict (demote to LRU) blocks of region

from LLC

Region Eviction

Hints

Page 16: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Miss Rate Improvements

50

60

70

80

90

100 LL

C M

isse

s n

orm

aliz

ed to

LR

U (%

) Look-Ahead Look-Back RADAR-Combined

Look-Ahead consistently does better than Look-back RADAR-Combined provides added gains

Page 17: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Exec. Time Improvements

cholesky matmul sparselu gauss jacobi redblack Average 80

84

88

92

96

100 E

xecu

tion

Tim

e n

orm

aliz

ed to

LR

U (%

)

Look-Ahead Look-Back RADAR-Combined

Memory bound apps provide significant gains

Page 18: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Miss Reduction: SHIP vs RADAR

50 60 70 80 90

100 110

LLC

Mis

ses

no

rmal

ized

to L

RU

(%)

SHiP-PC RADAR-Combined

SHiP can evict data from LLCs prematurely RADAR outperforms the best performing predictor

SHiP-PC 2-bit SRRIP 16K SHCT

Page 19: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Power Management What if •  Users expressed how long time a computation can take? •  Resource manager could track progress against deadlines? •  Resource manager could predict the remaining time as a

function of resources? Opportunities: •  Controlled throttling of resources •  Controlled scheduling of computations on heterogeneous

substrates In general:

Considerable room for trading performance for reduced power consumption

Page 20: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Approach – Ongoing work

Compilation Layer

Identify Critical nodes

5

2 10

30 5

RT and Arch. Layer

Resource mgmt

Heterogeneous substrate

big LITTLE Core fusion

Application Layer

QoS & WCET requirements

•  Soft periodicity, deadline req.

•  WCET: Loop bounds etc.

Opportunity: Perfect match-making between resources and timing requirements

Page 21: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Predictability Management Context: Real-time applications Sequential processing: Establishing tight bounds on execution time (WCET) is fairly well understood Parallel processing: Unexplored terrain

Another execution T1

T2

T3

T4

One execution T1

T2

T3

T4

Time

Deterministic dynamic scheduling => New playground for trading performance for power under strict timing

guarantees

WCET: max (T1+T2, T3+T4)

Page 22: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Concluding Remarks

ISA + primitives

Resource Management

Microarchitecture

Parallel programming model plus semantic information

MECCA team members: Madhavan Manivannan, Mehrzad Nejat, Vasileios Papaefstathiou, Risat Pathan, Miquel Pericàs, Per Stenstrom, Mohammad Waqar, Petros Voudouris

Page 23: Efficient Computing Beyond the Validity of Moore’s Lawromol2016.bsc.es/_media/3.2.romol2016_pers.pdf · ROMOL2016_PerS.ppt Author: Per Stenström Created Date: 3/17/2016 3:42:28

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

?