Elad Alon , Krste Asanovic (Director) ,

Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency

Elad Alon, Krste Asanovic (Director),Jonathan Bachrach, Jim Demmel, Armando Fox, Kurt Keutzer, Borivoje Nikolic, David Patterson, Koushik Sen, John Wawrzynek

[email protected]://aspire.eecs.berkeley.edu

UC Berkeley Future Application Drivers

2

Pervasive Speech

BIG DATA

Augmented Reality

Personalized Medicine

Environment

Social Networks

Robotics

UC Berkeley

3

Compute Energy “Iron Law”

When power is constrained, need better energy efficiency for more performance

Where performance is constrained (real-time), want better energy efficiency to lower power

Improving energy Efficiency is critical goal for all future systems and workloads

Performance = Power * Energy Efficiency(Tasks/Second) (Joules/Second) (Tasks/Joule)

UC Berkeley Good News: Moore’s Law Continues

4

“Cramming more components onto integrated circuits”, Gordon E. Moore, Electronics, 1965

More Tr

ansisto

rs/Chip

Cheaper!

UC Berkeley

Bad News:Dennard (Voltage) Scaling Over

5

Dennard Scaling

Post-Dennard Scaling

Moore, ISSCC Keynote, 2003

UC Berkeley

1st Impact of End of Scaling:End of Sequential Processor Era

6

UC Berkeley

7

Parallelism: A one-time gain

Use more, slower cores for better energy efficiency.Either simpler cores, or run cores at lower Vdd/frequency

Even simpler general-purpose microarchitectures?- Limited by smallest sensible core

Even Lower Vdd/Frequency?- Limited by Vdd/Vt scaling, errors

Now what?

UC Berkeley

8

[Muller, ARM CTO, 2009]

2nd Impact of End of Scaling: “Dark Silicon” Cannot switch all transistors at full frequency!

No savior device technology on horizon.Future energy-efficiency innovations must be above transistor level.

UC Berkeley

9

The End of General-Purpose Processors?

Most computing happens in specialized, heterogeneous processors- Can be 100-1000X more

efficient than general-purpose processor

Challenges:- Hardware design costs- Software development costs

NVIDIA Tegra2

UC Berkeley

The Real Scaling Challenge:Communication

As transistors become smaller and cheaper, communication dominates performance and energy

10

All scales: Across chip Up and down

memory hierarchy Chip-to-chip Board-to-board Rack-to-rack

UC Berkeley

11

ASPIRE: From Better to Best

What is the best we can do?- For a fixed target technology (e.g., 7nm)

Can we prove a bound? Can we design implementation approaching bound?

Provably Optimal Implementations

Specialize and optimize communication and computation across whole stack from

applications to hardware

UC Berkeley

Communication-Avoiding Algorithms: Algorithm Cost Measures

1.Arithmetic (FLOPS)2.Communication: moving data between

- levels of a memory hierarchy (sequential case) - processors over a network (parallel case).

12

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

UC Berkeley Modeling Runtime & Energy

13

UC Berkeley

14

A few examples of speedups

Matrix multiplication- Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less communication

QR decomposition (used in least squares, data mining, …)- Up to 8x on 8-core dual-socket Intel Clovertown, for 10M x 10- Up to 6.7x on 16-proc. Pentium III cluster, for 100K x 200- Up to 13x on Tesla C2050 / Fermi, for 110k x 100- Up to 4x on Grid of 4 cities (Dongarra, Langou et al)- “infinite speedup” for out-of-core on PowerPC laptop - LAPACK thrashed virtual memory, didn’t finish

Eigenvalues of band symmetric matrices- Up to 17x on Intel Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequential)

Iterative sparse linear equations solvers (GMRES)- Up to 4.3x on Intel Clovertown, 8 core

N-body (direct particle interactions with cutoff distance)- Up to 10x on Cray XT-4 (Hopper), 24K particles on 6K procs.

UC Berkeley Modeling Energy: Dynamic

15

UC Berkeley Modeling Energy: Memory Retention

16

UC Berkeley Modeling Energy: Background Power

17

UC Berkeley Energy Lower Bounds

18

UC Berkeley

Early Result:Perfect Strong Scaling in Time and Energy

Every time you add processor, use its memory M too Start with minimal number of procs: PM = 3n2

Increase P by factor c total memory increases by factor c Notation for timing model:

- γt , βt , αt = secs per flop, per word_moved, per message of size m

T(cP) = n3/(cP) [ γT+ βt/M1/2 + αt/(mM1/2) ] = T(P)/c Notation for energy model:

- γe , βe , αe = Joules for same operations- δe = Joules per word of memory used per sec- εe = Joules per sec for leakage, etc.

E(cP) = cP { n3/(cP) [ γe+ βe/M1/2 + αe/(mM1/2) ] + δeMT(cP) + εET(cP) } = E(P)

Perfect scaling extends to n-body, Strassen, …[IPDPS, 2013]

19

UC Berkeley C-A Algorithms Not Just for HPC

In ASPIRE, apply to other key application areas: machine vision, databases, speech recognition, software-defined radio, …

Initial results on lower bounds of database join algorithms

20

UC Berkeley

From C-A Algorithms to Provably Optimal Systems?

1) Prove lower bounds on communication for a computation

2) Develop algorithm that achieves lower bound on a system

3) Find that communication time/energy cost is >90% of resulting implementation

4) We know we’re within 10% of optimal!

Supporting technique: Optimizing software stack and compute engines to reduce compute costs and expose unavoidable communication costs

21

UC Berkeley

ESP: An Applications Processor Architecture for ASPIRE

Future server and mobile SoCs will have many fixed-function accelerators and a general-purpose programmable multicore

Well-known how to customize hardware engines for specific task

ESP challenge is using specialized engines for general-purpose code

22

Intel Ivy Bridge (22nm)

Qualcomm Snapdragon MSM8960 (28nm)

ESP

ESP

UC Berkeley ESP: Ensembles of Specialized Processors

General-purpose hardware, flexible but inefficient Fixed-function hardware, efficient but inflexible Par Lab Insight: Patterns capture common operations

across many applications, each with unique communication & computation structure

Build an ensemble of specialized engines, each individually optimized for particular pattern but collectively covering application needs

Bet: Will give us efficiency plus flexibility- Any given core can have a different mix of these depending

on workload

23

UC Berkeley Par Lab: Motifs common across apps

24

Dense GraphSparse …

ApplicationsAudio Recognition

Object Recognition

Scene Analysis

Berkeley View “Dwarfs” or Motifs

UC Berkeley

25

Par Lab Apps

Motif (nee “Dwarf”) Popularity (Red Hot / Blue Cool)

Computing Domains

UC Berkeley

• Pipe-and-Filter• Agent-and-Repository• Event-based• Bulk Synchronous• Map-Reduce• Layered Systems• Model-view controller• Arbitrary Task Graphs• Puppeteer• Model-View-Controller

Application

• Graph Algorithms• Dynamic programming• Dense/Spare Linear Algebra • Un/Structured Grids• Graphical Models• Finite State Machines• Backtrack Branch-and-Bound• N-Body Methods• Circuits• Spectral Methods• Monte-Carlo

Architecting Parallel Software

Identify the Software Structure

Identify the Key Computations

UC Berkeley Mapping Software to ESP: Specializers

Capture desired functionality at high-level using patterns in a productive high-level language

Use pattern-specific compilers (Specializers) with autotuners to produce efficient low-level code

ASP specializer infrastructure, open-source download27

ILP Engine

Dense Engine

Sparse Engine

Graph Engine

ESP Core

Glue Code

Dense Code

SparseCode

Graph Code

ESP Code



Object Recognition

Scene Analysis

Berkeley View “Dwarfs” or Motifs

Specializers with SEJITS Implementations and Autotuning

UC Berkeley

Replacing Fixed Accelerators with Programmable Fabric

Future server and mobile SoCs will have many fixed-function accelerators and a general-purpose programmable multicore

Fabric challenge is retaining extreme energy efficiency while retaining programmability

28

Intel Ivy Bridge (22nm)

Qualcomm Snapdragon MSM8960 (28nm)

Fabric

Fabr

ic

Fabr

ic

Fabr

ic

UC Berkeley Strawman Fabric Architecture

29

M

ARM

ARM

ARM

AR

M

ARM

ARM

ARM

AR

M

ARM

ARM

ARM

AR

M

ARM

ARM

ARM

AR

Will never have a C compiler Only programmed using pattern-based

DSLs More dynamic, less static than earlier

approaches Dynamic dataflow-driven execution Dynamic routing Large memory support

UC Berkeley “Agile Hardware” Development

Current hardware design slow and arduous But now have huge design space to explore How to examine many design points efficiently?

Build parameterized generators, not point designs! Adopt and adapt best practices from Agile Software

- Complete LVS-DRC clean physical design of current version every ~ two weeks (“tapein”)

- Incremental feature addition- Test & Verification first step

30

UC Berkeley

31

Chisel: Constructing Hardware In a Scala Embedded Language

Embed a hardware-description language in Scala, using Scala’s extension facilities

A hardware module is just a data structure in Scala Different output routines can generate different types of

output (C, FPGA-Verilog, ASIC-Verilog) from same hardware representation

Full power of Scala for writing hardware generators- Object-Oriented: Factory objects, traits, overloading etc- Functional: Higher-order funcs, anonymous funcs, currying- Compiles to JVM: Good performance, Java interoperability

UC Berkeley

32

Chisel Design Flow

Chisel Program

C++ code

FPGA Verilog

ASIC Verilog

Software Simulator

C++ Compiler

Scala/JVM

FPGA Emulation

FPGA Tools

GDS Layout

ASIC Tools

UC Berkeley Chisel is much more than an HDL

The base Chisel system allows you to use the full power of Scala to describe the RTL of a design, then generate Verilog or C++ output from the RTL

But Chisel can be extended above with domain-specific languages (e.g., signal processing) for fabric

Importantly, Chisel can also be extended below with new backends or to add new tools or features (e.g., quantum computing circuits)

Only ~6,000 lines of code in current version including libraries!

BSD-licensed open source at:chisel.eecs.berkeley.edu

33

UC Berkeley

34

Many processor tapeouts in few years with small group (45nm, 28nm)

Clock test site

SRAM test site

DCDC test site

Processor Site

CO

RE

0V

C0

CO

RE

1V

C1

CO

RE

2V

C2

CO

RE

3V

C3

512

KB

L2

VF

IXE

D

Test

Sit

es

UC Berkeley Resilient Circuits & Modeling

Future scaled technologies have high variability but want to run with lowest-possible margins to save energy

Significant increase in soft errors, need resilient systems Technology modeling to determine tradeoff between MTBF

and energy per task for logic, SRAM, & interconnect.

35

Techniques to reduce operating voltage can be worse for energy due to rapid rise in errors

UC Berkeley

36

Har

dwar

eSo

ftw

are

Computational and Structural Patterns

Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency


ESP (Ensembles of Specialized Processors)

Architecture

C++ Simulation

FPGA Emulation

Validation/Verification


Object Recognition

Scene Analysis

Hardware Cache Coherence

ASICSoC

FPGA ComputerImplementation Technologies

Communication-Avoiding AlgorithmsC-A GEMM C-A BFSC-A SpMV

Deep HW/SW Design-Space Exploration

Pipe&Filter Map-Reduce…

Hardware Generators using Chisel HDL

ILP Engine

Dense Engine

Sparse Engine

Graph Engine

ESP Core

Local Stores + DMA

Glue Code

Dense Code

SparseCode

Graph Code

ESP Code

Specializers with SEJITS Implementations and Autotuning

UC Berkeley ASPIRE Project

Initial $15.6M/5.5 year funding from DARPA PERFECT program- Started 9/28/2012- Located in Par Lab space + BWRC

Looking for industrial affiliates (see Krste!) Open House today, 5th floor Soda Hall

37

Research funded by DARPA Award Number HR0011-12-2-0016. Approved for public release; distribution is unlimited. The content of this presentation does not necessarily reflect the position or the policy of the US government and no official endorsement should be inferred.

Elad Alon , Krste Asanovic (Director) ,

Documents

Transcript of Elad Alon , Krste Asanovic (Director) ,