Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View...

42
1 Parallel Applications Parallel Hardware Parallel Software The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic , Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Edward Lee, Nelson Morgan, George Necula, Dave Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Kathy Yelick Dagstuhl, September 3, 2007
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    231
  • download

    2

Transcript of Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View...

Page 1: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

1

Parallel Applications

Parallel Hardware

Parallel Software

The Parallel Computing Landscape: A View from Berkeley 2.0

Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz,

Edward Lee, Nelson Morgan, George Necula, Dave Patterson, Koushik Sen, John Wawrzynek, David

Wessel, and Kathy YelickDagstuhl, September 3, 2007

Page 2: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

2

A Parallel Revolution, Ready or Not Embedded: per product ASIC to programmable

platforms Multicore chip most competitive path Amortize design costs + Reduce design risk + Flexible platforms

PC, Server: Power Wall + Memory Wall = Brick Wall End of the way we’ve scaled uniprocessors for last 40 years

New Moore’s Law is 2X processors (“cores”) per chip every technology generation, but same clock rate “This shift toward increasing parallelism is not a triumphant stride

forward based on breakthroughs …; instead, this … is actually a retreat from even greater challenges that thwart efficient silicon implementation of traditional solutions.”

The Parallel Computing Landscape: A Berkeley View

Sea change for HW & SW industries since changing the model of programming and debugging

Page 3: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

3

0

50

100

150

200

250

1985 1995 2005 2015

Millions of PCs / year

P.S. Parallel Revolution May Fail John Hennessy, President, Stanford University, 1/07:

“…when we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.”

“A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 4:10, 1/07.

100% failure rate of Parallel Computer Companies Convex, Encore, MasPar, NCUBE, Kendall Square Research,

Sequent, (Silicon Graphics), Transputer, Thinking Machines, … What if IT goes from a

growth industry to areplacement industry? If SW can’t effectively use

8, 16, 32, ... cores per chip SW no faster on new computer Only buy if computer wears out

Page 4: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

4

Need a Fresh Approach to Parallelism

Berkeley researchers from many backgrounds meeting since Feb. 2005 to discuss parallelism Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John

Kubiatowicz, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, Kathy Yelick, …

Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis

Tried to learn from successes in high performance computing (LBNL) and parallel embedded (BWRC)

Led to “Berkeley View” Tech. Report and new Parallel Computing Laboratory (“Par Lab”)

Goal: Productive, Efficient, Correct Programming of 100+ cores & scale as double cores every 2 years (!)

Page 5: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

5

1

2

4

8

16

32

6464

128

256

512

1

10

100

1000

2003 2005 2007 2009 2011 2013 2015

Why Target 100+ Cores? 5-year research program aim 8+ years out Multicore: 2X / 2 yrs ≈ 64 cores in 8 years Manycore: 8X to 16X multicore

Automatic Parallelization,Thread Level Speculation

Page 6: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

6

Re-inventing Client/Server

Laptop/Handheld as future client, Datacenter as future server Focus on single socket (Laptop/Handheld or Blade)

“The Datacenter is the Computer”Building sized computers: Microsoft, …

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

“The Laptop/Handheld is the Computer”‘08: Dell no. laptops > desktops? ‘06 Profit?1B Cell phones/yr, increasing in functionOtellini demoed "Universal Communicator”

Combination cell phone, PC and video device

Apple iPhone

Page 7: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

7

Try Application Driven Research?

Conventional Wisdom in CS Research Users don’t know what they want Computer Scientists solve individual parallel

problems with clever language feature (e.g., futures), new compiler pass, or novel hardware widget (e.g., SIMD)

Approach: Push (foist?) CS nuggets/solutions on users Problem: Stupid users don’t learn/use proper solution

Another Approach Work with domain experts developing compelling apps Provide HW/SW infrastructure necessary to build, compose, and understand parallel software written in multiple languages Research guided by commonly recurring patterns actually observed while developing compelling app

Page 8: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

8

4 Themes of View 2.0/ Par Lab

1. Applications� Compelling Apps driving the research agenda top-down

2. Identify Computational Bottlenecks� Breaking through disciplinary boundaries

3. Developing Parallel Software with Productivity, Efficiency, and Correctness� 2 Layers + C&C Language, Autotuning

4. OS and Architecture� Composable Primitives, not Pre-Packaged Solutions

� Deconstruction, Fast Barrier Synchronization, Partition

Page 9: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

9

Personal Health

Image Retriev

al

Hearing, Music

Speech

Parallel Browse

rDwarfs

Sketching

Legacy Code

Schedulers

Communication & Synch.

PrimitivesEfficiency Language Compilers

Par Lab Research OverviewEasy to write correct programs that run efficiently on

manycore

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

HypervisorOS

Arch.

Productivi

ty Layer

Efficienc

y Layer Corr

ect

ness

Applicatio

nsComposition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verificatio

n

Dynamic Checkin

gDebugging

with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Page 10: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

10

“Who needs 100 cores to run M/S Word?” Need compelling apps that use 100s of cores

How did we pick applications?1. Enthusiastic expert application partner, leader in

field, promise to help design, use, evaluate our technology

2. Compelling in terms of likely market or social impact, with short term feasibility and longer term potential

3. Requires significant speed-up, or a smaller, more efficient platform to work as intended

4. As a whole, applications cover the most important

Platforms (handheld, laptop, games) Markets (consumer, business, health)

Theme 1. Applications. What are

the problems?

Page 11: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

11

Coronary Artery Disease

Modeling to help patient compliance?Modeling to help patient compliance?• 400k deaths/year, 16M w. symptom, 72MBP

Massively parallel, Real-time variationsMassively parallel, Real-time variations• CFD FECFD FE solid (non-linear), fluid (Newtonian), pulsatilesolid (non-linear), fluid (Newtonian), pulsatile• Blood pressure, activity, habitus, cholesterolBlood pressure, activity, habitus, cholesterol

Before After

Page 12: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

12

Content-Based Image Retrieval

Relevance Feedback

ImageImageDatabaseDatabase

Query by example

SimilarityMetric

CandidateResults Final ResultFinal Result

Built around Key Characteristics of personal databases Very large number of pictures (>5K) Non-labeled images Many pictures of few people Complex pictures including people, events,

places, and objects

1000’s of images

Page 13: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

13

Compelling Laptop/Handheld Apps

Meeting Diarist Laptops/

Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting

Teleconference speaker identifier, speech helper

L/Hs used for teleconference, identifies who is speaking, “closed caption” hint of what being said

Page 14: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

14

Compelling Laptop/Handheld Apps

Musicians have an insatiable appetite for computation More channels, instruments, more processing,

more interaction! Latency must be low (5 ms) Must be reliable (No clicks)

Music Enhancer Enhanced sound delivery systems for home

sound systems using large microphone and speaker arrays

Laptop/Handheld recreate 3D sound over ear buds

Hearing Augmenter Laptop/Handheld as accelerator for hearing

aide

Novel Instrument User Interface New composition and performance systems

beyond keyboards Input device for Laptop/Handheld

Berkeley Center for New Music and Audio Technology (CNMAT) created a compact loudspeaker array: 10-inch-diameter icosahedron incorporating 120 tweeters.

Page 15: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

15

Parallel Browser Goal: Desktop quality browsing on

handhelds Enabled by 4G networks, better output devices

Bottlenecks to parallelize Parsing, Rendering, Scripting

SkipJax Parallel replacement for JavaScript/AJAX Based on Brown’s FlapJax

Page 16: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

16

Theme 2. Use computational bottlenecks instead of conventional benchmarks? How invent parallel systems of future when tied to old code, programming models, CPUs of the past?

Look for common patterns of communication and computation

1. Embedded Computing (42 EEMBC benchmarks)2. Desktop/Server Computing (28 SPEC2006)3. Data Base / Text Mining Software4. Games/Graphics/Vision5. Machine Learning6. High Performance Computing (Original “7 Dwarfs”) Result: 13 “Dwarfs”

Page 17: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

17

How do compelling apps relate to 13 dwarfs?

Dwarf Popularity (Red Hot Blue Blue CoolCool)

EmbedSPECDB GamesML HPCHealth Image Speech Music Browser1 Finite State Mach.2 Combinational3 Graph Traversal4 Structured Grid5 Dense Matrix6 Sparse Matrix7 Spectral (FFT)8 Dynamic Prog9 N-Body

10 MapReduce11 Backtrack/ B&B12 Graphical Models13 Unstructured Grid

Page 18: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

18

What about I/O in Dwarfs? High speed serial lines

many parallel I/O channels per chip Storage will be much faster

“Flash is disk, disk is tape” At least for laptop/handheld iPod, camera, cellphone, laptop $ to rapidly improve flash

memory New technologies even more promising:

Phase-Change RAM denser, faster write, 100X longer wear-out Disk Flash means 10,000x improvement in latency (10

millisecond to 1 microsecond)

Network will be much faster 10 Gbit/s copper is standard; 100 Gbit/sec is coming Parallel serial channels to multiply bandwidth Already need multiple fat cores to handle TCP/IP at 10 Gbit/s Super 3G -> 4G, 100Mb/s->1Gb/s WAN wireless to handset

Page 19: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

19

4 Valuable Roles of Dwarfs1. “Anti-benchmarks”

� Dwarfs not tied to code or language artifacts encourage innovation in algorithms, languages, data structures, and/or hardware

2. Universal, understandable vocabulary � To talk across disciplinary boundaries

3. Bootstrapping: Parallelize parallel research � Allow analysis of HW & SW design without

waiting years for full apps

4. Targets for libraries (see later)

Page 20: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

20

Themes 1 and 2 Summary Application-Driven Research vs.

CS Solution-Driven Research Drill down on (initially) 5 app areas to

guide research agenda Dwarfs to represent broader set of apps

to guide research agenda Dwarfs help break through traditional

interfaces Benchmarking, multidisciplinary conversations,

parallelizing parallel research, and target for libraries

Page 21: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

21

Legacy OS

Personal Health

Image Retriev

al

Hearing, Music

Speech

Parallel Browse

rDwarfs

Sketching

Legacy Code

Schedulers

Communication & Synch.

Primitives

Theme 3: Developing Parallel Software

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

HypervisorOS

Arch.

Productivi

ty Layer

Efficienc

y Layer Corr

ect

ness

Applicatio

nsComposition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verificatio

n

Dynamic Checkin

gDebugging

with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Efficiency Language Compilers

Page 22: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

22

Dwarfs become Libraries or Frameworks

Observation: Dwarfs are of 2 types

Libraries Dense matrices Sparse matrices Spectral Combinational Finite state machines

Patterns/Frameworks MapReduce Graph traversal, graphical models Dynamic programming Backtracking/B&B N-Body (Un) Structured Grid

Algorithms in the dwarfs can either be implemented as:• Compact parallel computations within a traditional library• Compute/communicate pattern implemented as framework

� Computations may be viewed at multiple levels: e.g., an FFT library may be built by instantiating a Map-Reduce framework, mapping 1D FFTs and then transposing (generalized reduce)

Page 23: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

23

Two Layers for Two Types of Programmer

Efficiency Layer (10% of today’s programmers) Expert programmers build Frameworks & Libraries,

Hypervisors, … “Bare metal” efficiency possible at Efficiency Layer

Productivity Layer (90% of today’s programmers) Domain experts / Naïve programmers productively build

parallel apps using frameworks & libraries Frameworks & libraries composed to form app

frameworks Effective composition techniques allows the efficiency

programmers to be highly leveraged Create language for Composition and Coordination (C&C)

Page 24: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

24

Personal Health

Image Retriev

al

Hearing, Music

Speech

Parallel Browse

rDwarfs

Sketching

Legacy Code

Schedulers

Communication & Synch.

PrimitivesEfficiency Language Compilers

Productivity Layer

Legacy OS

Intel Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

HypervisorOS

Arch.

Productivi

ty Layer

Efficienc

y Layer Corr

ect

ness

Applicatio

nsComposition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verificatio

n

Dynamic Checkin

gDebugging

with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Page 25: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

25

C & C Language Requirements

Applications specify C&C language requirements: Constructs for creating application frameworks Primitive parallelism constructs:

Data parallelism Divide-and-conquer parallelism Event driven execution

Constructs for composing frameworks Frameworks require independence Independence is proven at instantiation with a variety of

techniques Needs to have low runtime overhead and ability to

measure and control performance

Page 26: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

26

Coordination & Composition

Composition is key to software reuse Solutions exist for libraries with hidden parallelism:

Partitions in OS help runtime composition Instantiating parallel frameworks is harder:

Framework specifies required independence E.g., operations in map, div&conq must not interfere

Guaranteed independence through types Type system extensions (side effects, ownership)

Extra: Image READ Array[double] Data decomposition may be implicit or explicit:

Partition: Array[T], … List [Array[T]] (well understood) Partition: Graph[T], … List [Graph[T]]

Efficiency layer code has these specifications at interfaces, which are verified, tested, or asserted

Independence is proven by checking side effects and overlap at instantiation

Page 27: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

27

DCT, etc.

Delaunay

Sparse LU

Coordination & Composition Coordination is used to create parallelism: Support parallelism patterns of applications

Data parallelism (degree = data size, not core count) May be nested: forall images, forall blocks in image

Divide-and-conquer (parallelism from recursion) Event-driven: nondeterminism at algorithm level

Branch&Bound dwarf, etc.

Serial semantics with limited nondeterminism Data parallelism comes from array/aggregate operations and

loops without side effects (serial) Divide-and-conquer parallelism comes from recursive function

invocations with non-overlapping side effects (serial) Event-driven programs are written as guarded atomic

commands, which may be implemented as transactions Discovered parallelism mapped to available resources

Techniques include static scheduling, autotuning, dynamic schedule and possibly hints

Page 28: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

28

Legacy OSOS Libraries &

ServicesHypervisor

Parallel Libraries

Parallel Frameworks

Sketching

Intel Multicore/GPGPU

RAMP ManycoreArch.

Productivi

ty Layer

Composition & Coordination Language (C&CL) Static Verificatio

n

Dynamic Checkin

gDebugging

with Replay

Directed Testing

C&CL Compiler/Interpreter

Type Systems

Personal Health

Image Retriev

al

Hearing, Music

Speech

Parallel Browse

rDwarfs

Legacy Code

Schedulers

Communication & Synch.

PrimitivesEfficiency Language Compilers

Efficiency Layer

OS

Efficienc

y Layer Corr

ect

ness

Applicatio

ns

Autotuners

Efficiency Languages

Page 29: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

29

Efficiency Layer Ground OS/HW Requirements

Create parallelism, manage resources: affinity, bandwidth Queries: #cores, #memory spaces, etc. Latency hiding (prefetch, DMA,…) to enable bandwidth

saturation, i.e., SPMV should run at peak bandwidth Programming in legacy languages with threads

For portability across local RAM vs. cache, global address space with memory hierarchy

x: 1y:

x: 5y:

x: 7y: 0

Shared partitioned on-chip

l: m: Private on-chip

Build by extending UPC

Shared off-chip DRAM

Page 30: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

30

Efficiency Layer: Software RAMP Efficiency layer is abstract machine model

+ selective virtualization Software RAMP provides add-ons

Schedulers Add runtime w/ dynamic scheduling

for dynamic task tree Add semi-static schedule for given graph

Memory movement / sharing primitives Support to manage local RAM

Synchronization primitives Use of verification and testing at this level

General task graph with weights & structure

Div&Conq with task stealing

Page 31: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

31

Synthesis via Sketching and Autotuning

Autotuning: Efficient by search Examples: Spectral (FFTW, SPIRAL),

Dense (PHiPAC, Atlas), Sparse (OSKI), Structured grids (Stencils)

Can select from algorithms/data structures changes not producible by compiler transform

Extensive tuning knobs at efficiency level Performance feedback from hardware and OS

Sketching: Correct by construction

Optimized code(tiled, prefetched,

time skewed)

Spec: simple implementation

(3 loop 3D stencil)

Sketch: optimizedskeleton

(5 loops, missing some index/bounds)

Optimization:1.5x more entries (zeros) 1.5x speedup

Compilers won’t do this!

Page 32: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

32

Autotuning Sparse Matrices Sparse Matrix-Vector Multiply (SPMV) tuning steps:

Register Block to compress matrix data structure (choose r1xr2) Cache Block so corresponding vectors fit in local memory (c1xc2) Parallelize by dividing matrix evenly (p1xp2) Prefetch for some distance d Machine-specific code for SSE, etc.

0

1

2

3

ProteinFEM1FEM2FEM3

EpidemCircuit

LP

ProteinFEM1FEM2FEM3

EpidemCircuit

LP

Opteron (2xdual) Clovertow n (2xquad)

GFlop/s

Tuned Parallel

Naïve ParallelNaïve Serial

Autotuning is more important than parallelism on these machines!

Page 33: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

33

Theme 3 Summary

Two Layers for two types of programmers Productivity layer: safety

Coordination and Composition language has serial semantics with controlled nondeterminism

Frameworks and libraries have enforced contracts, leverage expert programmers by reusing library/framework code

Efficiency layer: expressiveness and portability Base is abstract machine model Software RAMP for selective virtualization (add-ons) Sketching and autotuning for library/framework

development

Page 34: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

34

Efficiency Language Compilers

Personal Health

Image Retriev

al

Hearing, Music

Speech

Parallel Browse

rDwarfs

Sketching

Legacy Code

Schedulers

Communication & Synch.

Primitives

Theme 4: OS and Architecture

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

HypervisorOS

Arch.

Productivi

ty Layer

Efficienc

y Layer Corr

ect

ness

Applicatio

nsComposition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verificatio

n

Dynamic Checkin

gDebugging

with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Page 35: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

35

Why change OS and Architecture?Old world: OS is the only parallel code, user jobs are

sequentialNew world: Many user applications are highly parallel

Efficiency layer wants to control own scheduling of threads onto cores (e.g. select non-virtualized or virtualized)

Legacy OS + Virtual Machine layers makes this intractable: VMM scheduler + Guest OS scheduler + Application scheduler?

Parallel applications need efficient inter-thread communication and synchronization operations

Legacy architectures have poor inter-thread communication (cache-coherent shared memory) and unwieldy synchronization (spin locks)

Managing data movement important for efficiency layer, must be able to saturate memory bandwidth with independent requests

Legacy OS + caches/prefetching make bad decisions about where data should be relative to accessing thread within application

Page 36: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

36

Par Lab OS and Architecture ThemeDeconstruct OS and architectures into software-composable primitives not pre-packaged solutions

Very thin hypervisor layer providing hierarchical partitions Protection only, no scheduling/multiplexing in hypervisor. Traditional

OS functionalities in user libraries and service partitions.Efficiency layer takes over thread scheduling in partition

Low overhead hardware synchronization primitives Barriers+signals, atomic memory operations (fetch+op), user-level

active messages Memory read/write set tracking to support hybrid hardware/software

coherence, transactional memory (TM), race detection, checkpointingEfficiency layer uses and exports customized synchronization facilities

Configurable memory hierarchy Private and/or shared caches, local RAM, DMA engineEfficiency layer controls whether and how to cache data, and controls

data movement around system

Page 37: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

37

Partition-Based Deconstructed OS

Hypervisor is a very thin software layer over the hardware providing only a resource protection mechanism, no resource multiplexing

Traditional OS functionality provided as libraries (cf. Exokernel) and service partitions (e.g., device drivers)

Hypervisor allows partition to subdivide into hierarchical child partitions, where parent retains control over children. Motivating examples: Protecting browser from third-party plugins Predictable real-time for audio processing units in music application Support legacy parallel libraries/frameworks with runtime in own partition Trusted OS or VM (e.g., Singularity or JavaVM) can avoid mandatory hardware

protection checking within own partition

Device

Driver

SingularityPlug-in Plug-in

Web BrowserRoot Partition

UIAudio Unit

Music Framework

Audio Unit

Process Process

Legacy OS

Page 38: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

38

Hardware Partitioning SupportPartition can contain own cores, L1 and L2 $/RAM, DRAM, and interconnect bandwidth allocation

Inter-partition communication through protected shared memory

“Un-virtual” partitions give efficiency layer complete “bare metal” control of hardware

Per-partition performance counters give feedback to runtimes/autotuners

Benefits: Efficiency (fewer layers, custom OS) Enables new exposed HW primitives Performance isolation/predictability Robustness to faults/errors Security

CPUCPU

L1L1

L2L2BankBank

DRAMDRAM

CPUCPU

L1L1

L2L2BankBank

DRAMDRAM

CPUCPU

L1L1

L2L2BankBank

DRAMDRAM

CPUCPU

L1L1

L2L2BankBank

DRAMDRAM

CPUCPU

L1L1

L2L2BankBank

DRAMDRAM

L2 InterconnectL2 Interconnect

DRAM & I/O DRAM & I/O InterconnectInterconnect

Partition 2Partition 2Partition 1Partition 1 Protected Protected Shared Shared MemoryMemory

Page 39: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

39

Example: 16x16 Array of Cores

Customized Parallel Processors

Efficiency layer software uses hardware synchronization and communication primitives to glue cores together into a customized application “processor”

Partition always swapped in en masse Software protocols can support dynamic partition resizing

then “Partitions are the new Processors”

Partition Partition BoundariesBoundaries(only cores shown, (only cores shown, separately sizable separately sizable allocation for other allocation for other resources, e.g., L2 resources, e.g., L2 memory)memory)

If “Processors are the new Transistors”

Page 40: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

40

RAMP Blue, July 20071000+ RISC cores @90MHzWorks! Runs UPC version of NAS parallel benchmarks.

RAMP Manycore Prototype Separate multi-university RAMP

project building FPGA emulation infrastructure BEE3 boards with Chuck Thacker/Microsoft

Expect to fit hundreds of 64-bit cores with full instrumentation in one rack

Run at ~100MHz, fast enough for application software development

Flexible cycle-accurate timing models What if DRAM latency 100 cycles? 200?

1000? What if barrier takes 5 cycles? 20? 50?

“Tapeout” every day, to incorporate feedback from application and software layers

Rapidly distribute hardware ideas to larger community

Page 41: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

41

Summary: A Berkeley View 2.0 Our industry has bet its

future on parallelism (!) Recruit best minds to help?

Try Apps-Driven vs. CS Solution-Driven Research

Dwarfs as lingua franca, anti-benchmarks, …

Efficiency layer for ≈ 10% today’s programmers

Productivity layer for ≈ 90% today’s programmers

C&C language to help compose and coordinate

Autotuners vs. Parallelizing Compilers

OS & HW: Composable Primitives vs. Solutions

Personal

Health

Image Retriev

al

Hearing, Music

Speech

Parallel

Browser

Dwarfs

Sketching

Legacy Code

Schedulers

Communication & Synch.

Primitives

Efficiency Language Compilers

Legacy OS

Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

Hypervisor

OS

Arc

h.

Pro

duct

ivit

yEffi

cienc

y Corr

ect

ness

Apps

Composition & Coordination Language (C&CL)

Parallel Librarie

s

Parallel Frameworks

Static Verificati

on

Dynamic Checkin

g

Debugging

with Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

Easy to write correct programs that run efficiently on manycore

Page 42: Parallel Applications Parallel Hardware Parallel Software 1 The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel,

42

Acknowledgments Berkeley View material comes from discussions with:

Profs Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John Kubiatowicz, John Wawrzynek, Kathy Yelick

UCB Grad students Bryan Catanzaro, Jike Chong, Joe Gebis, William Plishker, and Sam Williams

LBNL:Parry Husbands, Bill Kramer, Lenny Oliker, John Shalf See view.eecs.berkeley.edu RAMP based on work of RAMP Developers:

Krste Asanovic (Berkeley), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI), and John Wawrzynek (Berkeley, PI)

See ramp.eecs.berkeley.edu