Efficient Computing Beyond the Validity of Moore’s...

ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©

Efficient Computing Beyond the Validity of Moore’s Law

Per Stenström

Chalmers University of Technology Sweden


Outline

•  Computer architecture beyond Moore… •  The MECCA 3P Vision for

– Parallelism management – Power management – Predictability management

•  Runtime assisted cache management •  Concluding remarks


Scaling (as we know it) is ending soon…

A radical new way of how we think about

compute efficiency is needed


Program

ISA

Problem

Back to Yale…

Algorithm

Microarchitecture

Circuits

Electrons


Program

ISA

Problem

Need to Put the Stake in the Ground

Algorithm

Microarchitecture

Circuits

Electrons


MECCAs Approach – Interaction Across Layers

ISA + primitives

Resource Management

Microarchitecture

Parallel programming model plus semantic information


•  Parallelism is ubiquitous but hard to deal with •  Power is heavily constraining performance growth •  Predictable time responses are becoming harder to

meet

MECCAs 3P


MECCA’s Vision for Efficient Computing

•  P1: Parallelism: –  Programmers: Unlock parallelism and provide semantic

information –  Resource manager (runtime): Manage parallelism “under the

hood” •  P2: Power:

–  Programmers: Express quality of service attributes –  Resource manager (runtime): Translate them into efficient use

of hardware resources “under the hood” •  P3: Predictability:

–  Programmers: Express quality of service attributes –  Resource manager (runtime): Manage parallelism predictably

to meet timing constraints “under the hood”


Parallelism Management


Task-based Dataflow Prog. Models

TaskA

TaskC TaskB

#pragma css task output(a) void TaskA( float a[M][M]); #pragma css task input(a) void TaskB( float a[M][M]); #pragma css task input(a) void TaskC( float a[M][M]);

•  Programmer annotations for task dependences •  Annotations used by run-time for scheduling •  Dataflow task graph constructed dynamically

Convey semantic information to runtime for efficient scheduling


Runtime Management of Cache Hierarchy

View: Runtime is part of the chip and responsible for management of the cache hierarchy - in analogy with OS managing virtual memory

Runtime assisted cache management: •  Runtime-guided cache coherence optimizations •  Runtime-assisted dead-block management •  Runtime-assisted global cache management (ongoing)


Runtime Guided Cache Coherence Optimizations

Dependency annotations allow for optimizations with high accuracy (like in message passing)

Prefetching

Migratory sharing optimization

Bulk data transfer

Forwarding

Prod Cons

Output A[1000]

Input A[1000]

Prod Cons

Output A[1000]

Input A[1000]

Prod Cons

Output A[1000]

Input A[1000]

T1 T2

Inout A[1000]

Inout A[1000]


•  Self-invalidation provides significant gains •  SP+D+I provides added gains

Runtime Guided Cache Coherence [Manivannan & Stenstrom IPDPS 2014]


RADAR: Run-time Assisted Dead-block Management

[Manivannan et al., HPCA 2016]

Problem: Dead (cache) blocks waste precious cache space State-of-the-art: dynamic dead-block schemes do not respond to future changes RADAR: •  Dead blocks are identified based on prediction of future access

– limited success (“look-back” approach) •  Region annotations can give runtime an outlook of future access

(“look-ahead” approach) •  “Look-back” and “look-ahead” can be combined


RADAR Overview

15

Application Layer

Runtime Layer

Architecture Layer

Prog. Model Dependency Annotations

Dep. Analysis Dep. Table

Ref. Counting

Monitoring LLC Miss Counts

Specify Task and its Region Accesses

Region History

Access Predictor

Is Region dead after current access?

DEAD

i) Look-Ahead (LA) ii) Look-back (LB) iii) Combined (CS + CA)

Evict (demote to LRU) blocks of region

from LLC

Region Eviction

Hints


Miss Rate Improvements

50

60

70

80

90

100 LL

C M

isse

s n

orm

aliz

ed to

LR

U (%

) Look-Ahead Look-Back RADAR-Combined

Look-Ahead consistently does better than Look-back RADAR-Combined provides added gains


Exec. Time Improvements

cholesky matmul sparselu gauss jacobi redblack Average 80

84

88

92

96

100 E

xecu

tion

Tim

e n

orm

aliz

ed to

LR

U (%

)

Look-Ahead Look-Back RADAR-Combined

Memory bound apps provide significant gains


Miss Reduction: SHIP vs RADAR

50 60 70 80 90

100 110

LLC

Mis

ses

no

rmal

ized

to L

RU

(%)

SHiP-PC RADAR-Combined

SHiP can evict data from LLCs prematurely RADAR outperforms the best performing predictor

SHiP-PC 2-bit SRRIP 16K SHCT


Power Management What if •  Users expressed how long time a computation can take? •  Resource manager could track progress against deadlines? •  Resource manager could predict the remaining time as a

function of resources? Opportunities: •  Controlled throttling of resources •  Controlled scheduling of computations on heterogeneous

substrates In general:

Considerable room for trading performance for reduced power consumption


Approach – Ongoing work

Compilation Layer

Identify Critical nodes

5

2 10

30 5

RT and Arch. Layer

Resource mgmt

Heterogeneous substrate

big LITTLE Core fusion

Application Layer

QoS & WCET requirements

•  Soft periodicity, deadline req.

•  WCET: Loop bounds etc.

Opportunity: Perfect match-making between resources and timing requirements


Predictability Management Context: Real-time applications Sequential processing: Establishing tight bounds on execution time (WCET) is fairly well understood Parallel processing: Unexplored terrain

Another execution T1

T2

T3

T4

One execution T1

T2

T3

T4

Time

Deterministic dynamic scheduling => New playground for trading performance for power under strict timing

guarantees

WCET: max (T1+T2, T3+T4)


Concluding Remarks

ISA + primitives

Resource Management

Microarchitecture

Parallel programming model plus semantic information

MECCA team members: Madhavan Manivannan, Mehrzad Nejat, Vasileios Papaefstathiou, Risat Pathan, Miquel Pericàs, Per Stenstrom, Mohammad Waqar, Petros Voudouris


?

Efficient Computing Beyond the Validity of Moore’s...

Documents

Transcript of Efficient Computing Beyond the Validity of Moore’s...