Elad Alon , Krste Asanovic (Director) ,

37
A lgorithms and S pecializers for P rovably Optimal I mplementations with R esiliency and E fficiency Elad Alon, Krste Asanovic (Director), Jonathan Bachrach, Jim Demmel, Armando Fox, Kurt Keutzer, Borivoje Nikolic, David Patterson, Koushik Sen, John Wawrzynek

description

A lgorithms and S pecializers for P rovably Optimal I mplementations with R esiliency and E fficiency. Elad Alon , Krste Asanovic (Director) , Jonathan Bachrach , Jim Demmel , Armando Fox, Kurt Keutzer , Borivoje Nikolic , David Patterson, Koushik Sen , John Wawrzynek - PowerPoint PPT Presentation

Transcript of Elad Alon , Krste Asanovic (Director) ,

Page 1: Elad Alon ,  Krste Asanovic (Director) ,

Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency

Elad Alon, Krste Asanovic (Director),Jonathan Bachrach, Jim Demmel, Armando Fox, Kurt Keutzer, Borivoje Nikolic, David Patterson, Koushik Sen, John Wawrzynek

[email protected]://aspire.eecs.berkeley.edu

Page 2: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Future Application Drivers

2

Pervasive Speech

BIG DATA

Augmented Reality

Personalized Medicine

Environment

Social Networks

Robotics

Page 3: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

3

Compute Energy “Iron Law”

When power is constrained, need better energy efficiency for more performance

Where performance is constrained (real-time), want better energy efficiency to lower power

Improving energy Efficiency is critical goal for all future systems and workloads

Performance = Power * Energy Efficiency(Tasks/Second) (Joules/Second) (Tasks/Joule)

Page 4: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Good News: Moore’s Law Continues

4

“Cramming more components onto integrated circuits”, Gordon E. Moore, Electronics, 1965

More Tr

ansisto

rs/Chip

Cheaper!

Page 5: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

Bad News:Dennard (Voltage) Scaling Over

5

Dennard Scaling

Post-Dennard Scaling

Moore, ISSCC Keynote, 2003

Page 6: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

1st Impact of End of Scaling:End of Sequential Processor Era

6

Page 7: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

7

Parallelism: A one-time gain

Use more, slower cores for better energy efficiency.Either simpler cores, or run cores at lower Vdd/frequency

Even simpler general-purpose microarchitectures?- Limited by smallest sensible core

Even Lower Vdd/Frequency?- Limited by Vdd/Vt scaling, errors

Now what?

Page 8: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

8

[Muller, ARM CTO, 2009]

2nd Impact of End of Scaling: “Dark Silicon” Cannot switch all transistors at full frequency!

No savior device technology on horizon.Future energy-efficiency innovations must be above transistor level.

Page 9: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

9

The End of General-Purpose Processors?

Most computing happens in specialized, heterogeneous processors- Can be 100-1000X more

efficient than general-purpose processor

Challenges:- Hardware design costs- Software development costs

NVIDIA Tegra2

Page 10: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

The Real Scaling Challenge:Communication

As transistors become smaller and cheaper, communication dominates performance and energy

10

All scales: Across chip Up and down

memory hierarchy Chip-to-chip Board-to-board Rack-to-rack

Page 11: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

11

ASPIRE: From Better to Best

What is the best we can do?- For a fixed target technology (e.g., 7nm)

Can we prove a bound? Can we design implementation approaching bound?

Provably Optimal Implementations

Specialize and optimize communication and computation across whole stack from

applications to hardware

Page 12: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

Communication-Avoiding Algorithms: Algorithm Cost Measures

1.Arithmetic (FLOPS)2.Communication: moving data between

- levels of a memory hierarchy (sequential case) - processors over a network (parallel case).

12

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

Page 13: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Modeling Runtime & Energy

13

Page 14: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

14

A few examples of speedups

Matrix multiplication- Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less communication

QR decomposition (used in least squares, data mining, …)- Up to 8x on 8-core dual-socket Intel Clovertown, for 10M x 10- Up to 6.7x on 16-proc. Pentium III cluster, for 100K x 200- Up to 13x on Tesla C2050 / Fermi, for 110k x 100- Up to 4x on Grid of 4 cities (Dongarra, Langou et al)- “infinite speedup” for out-of-core on PowerPC laptop - LAPACK thrashed virtual memory, didn’t finish

Eigenvalues of band symmetric matrices- Up to 17x on Intel Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequential)

Iterative sparse linear equations solvers (GMRES)- Up to 4.3x on Intel Clovertown, 8 core

N-body (direct particle interactions with cutoff distance)- Up to 10x on Cray XT-4 (Hopper), 24K particles on 6K procs.

Page 15: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Modeling Energy: Dynamic

15

Page 16: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Modeling Energy: Memory Retention

16

Page 17: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Modeling Energy: Background Power

17

Page 18: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Energy Lower Bounds

18

Page 19: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

Early Result:Perfect Strong Scaling in Time and Energy

Every time you add processor, use its memory M too Start with minimal number of procs: PM = 3n2

Increase P by factor c total memory increases by factor c Notation for timing model:

- γt , βt , αt = secs per flop, per word_moved, per message of size m

T(cP) = n3/(cP) [ γT+ βt/M1/2 + αt/(mM1/2) ] = T(P)/c Notation for energy model:

- γe , βe , αe = Joules for same operations- δe = Joules per word of memory used per sec- εe = Joules per sec for leakage, etc.

E(cP) = cP { n3/(cP) [ γe+ βe/M1/2 + αe/(mM1/2) ] + δeMT(cP) + εET(cP) } = E(P)

Perfect scaling extends to n-body, Strassen, …[IPDPS, 2013]

19

Page 20: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley C-A Algorithms Not Just for HPC

In ASPIRE, apply to other key application areas: machine vision, databases, speech recognition, software-defined radio, …

Initial results on lower bounds of database join algorithms

20

Page 21: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

From C-A Algorithms to Provably Optimal Systems?

1) Prove lower bounds on communication for a computation

2) Develop algorithm that achieves lower bound on a system

3) Find that communication time/energy cost is >90% of resulting implementation

4) We know we’re within 10% of optimal!

Supporting technique: Optimizing software stack and compute engines to reduce compute costs and expose unavoidable communication costs

21

Page 22: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

ESP: An Applications Processor Architecture for ASPIRE

Future server and mobile SoCs will have many fixed-function accelerators and a general-purpose programmable multicore

Well-known how to customize hardware engines for specific task

ESP challenge is using specialized engines for general-purpose code

22

Intel Ivy Bridge (22nm)

Qualcomm Snapdragon MSM8960 (28nm)

ESP

ESP

Page 23: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley ESP: Ensembles of Specialized Processors

General-purpose hardware, flexible but inefficient Fixed-function hardware, efficient but inflexible Par Lab Insight: Patterns capture common operations

across many applications, each with unique communication & computation structure

Build an ensemble of specialized engines, each individually optimized for particular pattern but collectively covering application needs

Bet: Will give us efficiency plus flexibility- Any given core can have a different mix of these depending

on workload

23

Page 24: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Par Lab: Motifs common across apps

24

Dense GraphSparse …

ApplicationsAudio Recognition

Object Recognition

Scene Analysis

Berkeley View “Dwarfs” or Motifs

Page 25: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

25

Par Lab Apps

Motif (nee “Dwarf”) Popularity (Red Hot / Blue Cool)

Computing Domains

Page 26: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

• Pipe-and-Filter• Agent-and-Repository• Event-based• Bulk Synchronous• Map-Reduce• Layered Systems• Model-view controller• Arbitrary Task Graphs• Puppeteer• Model-View-Controller

Application

• Graph Algorithms• Dynamic programming• Dense/Spare Linear Algebra • Un/Structured Grids• Graphical Models• Finite State Machines• Backtrack Branch-and-Bound• N-Body Methods• Circuits• Spectral Methods• Monte-Carlo

Architecting Parallel Software

Identify the Software Structure

Identify the Key Computations

Page 27: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Mapping Software to ESP: Specializers

Capture desired functionality at high-level using patterns in a productive high-level language

Use pattern-specific compilers (Specializers) with autotuners to produce efficient low-level code

ASP specializer infrastructure, open-source download27

ILP Engine

Dense Engine

Sparse Engine

Graph Engine

ESP Core

Glue Code

Dense Code

SparseCode

Graph Code

ESP Code

Dense GraphSparse …

ApplicationsAudio Recognition

Object Recognition

Scene Analysis

Berkeley View “Dwarfs” or Motifs

Specializers with SEJITS Implementations and Autotuning

Page 28: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

Replacing Fixed Accelerators with Programmable Fabric

Future server and mobile SoCs will have many fixed-function accelerators and a general-purpose programmable multicore

Fabric challenge is retaining extreme energy efficiency while retaining programmability

28

Intel Ivy Bridge (22nm)

Qualcomm Snapdragon MSM8960 (28nm)

Fabric

Fabr

ic

Fabr

ic

Fabr

ic

Page 29: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Strawman Fabric Architecture

29

M

ARM

ARM

ARM

AR

M

ARM

ARM

ARM

AR

M

ARM

ARM

ARM

AR

M

ARM

ARM

ARM

AR

Will never have a C compiler Only programmed using pattern-based

DSLs More dynamic, less static than earlier

approaches Dynamic dataflow-driven execution Dynamic routing Large memory support

Page 30: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley “Agile Hardware” Development

Current hardware design slow and arduous But now have huge design space to explore How to examine many design points efficiently?

Build parameterized generators, not point designs! Adopt and adapt best practices from Agile Software

- Complete LVS-DRC clean physical design of current version every ~ two weeks (“tapein”)

- Incremental feature addition- Test & Verification first step

30

Page 31: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

31

Chisel: Constructing Hardware In a Scala Embedded Language

Embed a hardware-description language in Scala, using Scala’s extension facilities

A hardware module is just a data structure in Scala Different output routines can generate different types of

output (C, FPGA-Verilog, ASIC-Verilog) from same hardware representation

Full power of Scala for writing hardware generators- Object-Oriented: Factory objects, traits, overloading etc- Functional: Higher-order funcs, anonymous funcs, currying- Compiles to JVM: Good performance, Java interoperability

Page 32: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

32

Chisel Design Flow

Chisel Program

C++ code

FPGA Verilog

ASIC Verilog

Software Simulator

C++ Compiler

Scala/JVM

FPGA Emulation

FPGA Tools

GDS Layout

ASIC Tools

Page 33: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Chisel is much more than an HDL

The base Chisel system allows you to use the full power of Scala to describe the RTL of a design, then generate Verilog or C++ output from the RTL

But Chisel can be extended above with domain-specific languages (e.g., signal processing) for fabric

Importantly, Chisel can also be extended below with new backends or to add new tools or features (e.g., quantum computing circuits)

Only ~6,000 lines of code in current version including libraries!

BSD-licensed open source at:chisel.eecs.berkeley.edu

33

Page 34: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

34

Many processor tapeouts in few years with small group (45nm, 28nm)

Clock test site

SRAM test site

DCDC test site

Processor Site

CO

RE

0V

C0

CO

RE

1V

C1

CO

RE

2V

C2

CO

RE

3V

C3

512

KB

L2

VF

IXE

D

Test

Sit

es

Page 35: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley Resilient Circuits & Modeling

Future scaled technologies have high variability but want to run with lowest-possible margins to save energy

Significant increase in soft errors, need resilient systems Technology modeling to determine tradeoff between MTBF

and energy per task for logic, SRAM, & interconnect.

35

Techniques to reduce operating voltage can be worse for energy due to rapid rise in errors

Page 36: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley

36

Har

dwar

eSo

ftw

are

Computational and Structural Patterns

Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency

Dense GraphSparse …

ESP (Ensembles of Specialized Processors)

Architecture

C++ Simulation

FPGA Emulation

Validation/Verification

ApplicationsAudio Recognition

Object Recognition

Scene Analysis

Hardware Cache Coherence

ASICSoC

FPGA ComputerImplementation Technologies

Communication-Avoiding AlgorithmsC-A GEMM C-A BFSC-A SpMV

Deep HW/SW Design-Space Exploration

Pipe&Filter Map-Reduce…

Hardware Generators using Chisel HDL

ILP Engine

Dense Engine

Sparse Engine

Graph Engine

ESP Core

Local Stores + DMA

Glue Code

Dense Code

SparseCode

Graph Code

ESP Code

Specializers with SEJITS Implementations and Autotuning

Page 37: Elad Alon ,  Krste Asanovic (Director) ,

UC Berkeley ASPIRE Project

Initial $15.6M/5.5 year funding from DARPA PERFECT program- Started 9/28/2012- Located in Par Lab space + BWRC

Looking for industrial affiliates (see Krste!) Open House today, 5th floor Soda Hall

37

Research funded by DARPA Award Number HR0011-12-2-0016. Approved for public release; distribution is unlimited. The content of this presentation does not necessarily reflect the position or the policy of the US government and no official endorsement should be inferred.