Efficient Computing Beyond the Validity of Moore’s...
Transcript of Efficient Computing Beyond the Validity of Moore’s...
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Efficient Computing Beyond the Validity of Moore’s Law
Per Stenström
Chalmers University of Technology Sweden
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Outline
• Computer architecture beyond Moore… • The MECCA 3P Vision for
– Parallelism management – Power management – Predictability management
• Runtime assisted cache management • Concluding remarks
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Scaling (as we know it) is ending soon…
A radical new way of how we think about
compute efficiency is needed
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Program
ISA
Problem
Back to Yale…
Algorithm
Microarchitecture
Circuits
Electrons
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Program
ISA
Problem
Need to Put the Stake in the Ground
Algorithm
Microarchitecture
Circuits
Electrons
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
MECCAs Approach – Interaction Across Layers
ISA + primitives
Resource Management
Microarchitecture
Parallel programming model plus semantic information
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
• Parallelism is ubiquitous but hard to deal with • Power is heavily constraining performance growth • Predictable time responses are becoming harder to
meet
MECCAs 3P
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
MECCA’s Vision for Efficient Computing
• P1: Parallelism: – Programmers: Unlock parallelism and provide semantic
information – Resource manager (runtime): Manage parallelism “under the
hood” • P2: Power:
– Programmers: Express quality of service attributes – Resource manager (runtime): Translate them into efficient use
of hardware resources “under the hood” • P3: Predictability:
– Programmers: Express quality of service attributes – Resource manager (runtime): Manage parallelism predictably
to meet timing constraints “under the hood”
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Parallelism Management
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Task-based Dataflow Prog. Models
TaskA
TaskC TaskB
#pragma css task output(a) void TaskA( float a[M][M]); #pragma css task input(a) void TaskB( float a[M][M]); #pragma css task input(a) void TaskC( float a[M][M]);
• Programmer annotations for task dependences • Annotations used by run-time for scheduling • Dataflow task graph constructed dynamically
Convey semantic information to runtime for efficient scheduling
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Runtime Management of Cache Hierarchy
View: Runtime is part of the chip and responsible for management of the cache hierarchy - in analogy with OS managing virtual memory
Runtime assisted cache management: • Runtime-guided cache coherence optimizations • Runtime-assisted dead-block management • Runtime-assisted global cache management (ongoing)
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Runtime Guided Cache Coherence Optimizations
Dependency annotations allow for optimizations with high accuracy (like in message passing)
Prefetching
Migratory sharing optimization
Bulk data transfer
Forwarding
Prod Cons
Output A[1000]
Input A[1000]
Prod Cons
Output A[1000]
Input A[1000]
Prod Cons
Output A[1000]
Input A[1000]
T1 T2
Inout A[1000]
Inout A[1000]
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
• Self-invalidation provides significant gains • SP+D+I provides added gains
Runtime Guided Cache Coherence [Manivannan & Stenstrom IPDPS 2014]
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
RADAR: Run-time Assisted Dead-block Management
[Manivannan et al., HPCA 2016]
Problem: Dead (cache) blocks waste precious cache space State-of-the-art: dynamic dead-block schemes do not respond to future changes RADAR: • Dead blocks are identified based on prediction of future access
– limited success (“look-back” approach) • Region annotations can give runtime an outlook of future access
(“look-ahead” approach) • “Look-back” and “look-ahead” can be combined
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
RADAR Overview
15
Application Layer
Runtime Layer
Architecture Layer
Prog. Model Dependency Annotations
Dep. Analysis Dep. Table
Ref. Counting
Monitoring LLC Miss Counts
Specify Task and its Region Accesses
Region History
Access Predictor
Is Region dead after current access?
DEAD
i) Look-Ahead (LA) ii) Look-back (LB) iii) Combined (CS + CA)
Evict (demote to LRU) blocks of region
from LLC
Region Eviction
Hints
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Miss Rate Improvements
50
60
70
80
90
100 LL
C M
isse
s n
orm
aliz
ed to
LR
U (%
) Look-Ahead Look-Back RADAR-Combined
Look-Ahead consistently does better than Look-back RADAR-Combined provides added gains
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Exec. Time Improvements
cholesky matmul sparselu gauss jacobi redblack Average 80
84
88
92
96
100 E
xecu
tion
Tim
e n
orm
aliz
ed to
LR
U (%
)
Look-Ahead Look-Back RADAR-Combined
Memory bound apps provide significant gains
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Miss Reduction: SHIP vs RADAR
50 60 70 80 90
100 110
LLC
Mis
ses
no
rmal
ized
to L
RU
(%)
SHiP-PC RADAR-Combined
SHiP can evict data from LLCs prematurely RADAR outperforms the best performing predictor
SHiP-PC 2-bit SRRIP 16K SHCT
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Power Management What if • Users expressed how long time a computation can take? • Resource manager could track progress against deadlines? • Resource manager could predict the remaining time as a
function of resources? Opportunities: • Controlled throttling of resources • Controlled scheduling of computations on heterogeneous
substrates In general:
Considerable room for trading performance for reduced power consumption
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Approach – Ongoing work
Compilation Layer
Identify Critical nodes
5
2 10
30 5
RT and Arch. Layer
Resource mgmt
Heterogeneous substrate
big LITTLE Core fusion
Application Layer
QoS & WCET requirements
• Soft periodicity, deadline req.
• WCET: Loop bounds etc.
Opportunity: Perfect match-making between resources and timing requirements
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Predictability Management Context: Real-time applications Sequential processing: Establishing tight bounds on execution time (WCET) is fairly well understood Parallel processing: Unexplored terrain
Another execution T1
T2
T3
T4
One execution T1
T2
T3
T4
Time
Deterministic dynamic scheduling => New playground for trading performance for power under strict timing
guarantees
WCET: max (T1+T2, T3+T4)
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
Concluding Remarks
ISA + primitives
Resource Management
Microarchitecture
Parallel programming model plus semantic information
MECCA team members: Madhavan Manivannan, Mehrzad Nejat, Vasileios Papaefstathiou, Risat Pathan, Miquel Pericàs, Per Stenstrom, Mohammad Waqar, Petros Voudouris
ROMOL 2016 – Barcelona March 17, 2016. Per Stenstrom ©
?