CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala)...

52
CS442: High Productivity and Performance with Domain- specific Languages in Scala

Transcript of CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala)...

Page 1: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

CS442: High Productivity and

Performance with Domain-

specific Languages in Scala

Page 2: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 2

CS442 Overview

Experimental Research Course Lots of interest in DSLs

Opportunity to do something new

New ideas: DSLs for parallelism and productivity

New software: DSL infrastructure

May lead to publications

Goals

Knowledge and tools for DSL development

Bring domain experts and CS experts together

Develop new DSLs and push development of DSL infrastructure

Learn Scala: important new programming lang.

Page 3: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 3

Course Information

Instructor: Kunle Olukotun E-mail: [email protected]

Office: Gates 302

Office Hours: 11:00am-noon M/W or by appointment

TAs Hassan Chafi, Arvind Sujeeth, Kevin Brown

E-Mail: cs442-spr1011-staff at lists dot stanford dot edu

Course Support Darlene Hadding

E-Mail: darleneh at stanford dot edu

Location:Gates 408

Telephone: 650-723-1430

Office Hours: M-F 10:00 AM-3:00 PM

Page 4: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 4

Course Information

Prerequisites

CS background: CS108, and a systems course (CS143, CS140), preferably CS 149.

Non CS background: Expertise in a particular domain

Participation (10% of grade)

You are expected to participate in the class discussion

Website

http://www.stanford.edu/class/cs442

Programming Assignments

Scala

DSL embedding with

Final Project (90% of grade)

The Final Project is an open-ended research project

New DSL or adapt existing DSL

Page 5: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 5

2020 Vision for Parallelism

Make parallelism accessible to all programmers

Parallelism is not for the average programmer Too difficult to find parallelism, to debug,

maintain and get good performance for the masses

Need a solution for “Joe/Jane the programmer”

Can’t expose average programmers to parallelism But auto parallelizatoin doesn’t work

Page 6: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 6

What Is Computing?

Predicting the future

Modeling and simulation (weather, materials, products)

Decide what to build and experiment or instead of build and experiment ⇒ third pilar of science

Coping with the present (real time)

Embedded systems control (cars, planes, communication)

Virtual worlds (second life, facebook)

Electronic trading (airline reservation, stock market)

Robotics (manufacturing, cars, household)

Understanding the past Large data set analysis (commerce, web, census, simulation)

Discover trends and develop insight

Page 7: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 7

The Changing Nature of Science

Page 8: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 8

Explosion of Data Sources

Page 9: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 9

Genetics Gets Personal

Page 10: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 10

Computing System Power

second

OpsEnergyPower

Op

FIXED

Page 11: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 11

Heterogeneous Hardware

Heterogeneous HW for energy efficiency Multi-core, ILP, threads, data-parallel engines, custom engines

H.264 encode study

1

10

100

1000

4 cores + ILP + SIMD + custom inst

ASIC

Performance

Energy Savings

Source: Understanding Sources of Inefficiency in General-Purpose Chips (ISCA’10)

3 orders of magnitude

Page 12: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 12

DE Shaw Research: Anton

D. E. Shaw et al. SC 2009, Best Paper and Gordon Bell Prize

100 times more power efficient

Molecular dynamics computer

Page 13: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 13

Heterogeneous Parallel Architectures Today

Cray

Jaguar

Sun

T2

Nvidia

Fermi

Altera

FPGA

Page 14: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 14

Heterogeneous Parallel Programming

Cray

Jaguar

Sun

T2

Nvidia

Fermi

Altera

FPGA

MPIPGAS

PthreadsOpenMP

CUDA OpenCL

VerilogVHDL

Page 15: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 15

Programmability Chasm

Too many different programming models

Cray

Jaguar

Sun

T2

Nvidia

Fermi

Altera

FPGA

MPIPGAS

PthreadsOpenMP

CUDA OpenCL

VerilogVHDL

Virtual

Worlds

Personal

Robotics

Datainformatics

Scientific

Engineering

Applications

Page 16: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 16

Hypothesis

It is possible to write one programand

run it on all these machines

Page 17: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 17

Programmability Chasm

Cray

Jaguar

Sun

T2

Nvidia

Fermi

Altera

FPGA

MPIPGAS

PthreadsOpenMP

CUDA OpenCL

VerilogVHDL

Virtual

Worlds

Personal

Robotics

Datainformatics

Scientific

Engineering

Applications

Ideal Parallel

Programming Language

Page 18: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 18

Performance

Productivity Generality

The Ideal Parallel Programming Language

Page 19: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 19

Successful Languages

Performance

Productivity Generality

Page 20: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 20

True Hypothesis ⇒Domain Specific Languages

DomainSpecific

Languages

Performance(Heterogeneous Parallelism)

Productivity Generality

Page 21: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 21

Domain Specific Languages

Domain Specific Languages (DSLs)

Programming language with restricted expressiveness for a particular domain

High-level, usually declarative, and deterministic

Page 22: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 22

Benefits of Using DSLs for Parallelism

Productivity

•Shield average programmers from the difficulty of parallel programming

•Focus on developing algorithms and applications and not on low level implementation details

Performance

•Match high level domain abstraction to generic parallel execution patterns

•Restrict expressiveness to more easily and fully extract available parallelism

•Use domain knowledge for static/dynamic optimizations

Portability and forward scalability

•DSL & Runtime can be evolved to take advantage of latest hardware features

•Applications remain unchanged

•Allows innovative HW without worrying about application portability

Page 23: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 23

Bridging the Programmability Chasm

Domain Embedding Language (Scala)

Virtual

Worlds

Personal

RoboticsData

informatics

Scientific

Engineering

Physics

(Liszt)

Data Analysis

(SQL)

Probabilistic

(RandomT)

Machine Learning(OptiML)

Rendering

Parallel Runtime (Delite)

Dynamic Domain Spec. Opt. Locality Aware Scheduling

StagingPolymorphic Embedding

Applications

Domain

Specific

Languages

Heterogeneous

Hardware

DSL

Infrastructure

Task & Data Parallelism

Static Domain Specific Opt.

Page 24: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 24

Fuel injectionTransition Therm

al

Turbulence

Turbulence

Combustion

Liszt: DSL for Mesh PDEs

Z. DeVito, N. Joubert, P. Hanrahan

Solvers for mesh-based PDEs

Complex physical systems

Huge domains

millions of cells

Example: Unstructured Reynolds-averaged Navier Stokes (RANS) solver

Goal: simplify code of mesh-based PDE solvers

Write once, run on any type of parallel machine

From multi-cores and GPUs to clusters

Page 25: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 25

Liszt Language Features

Minimal Programming language Aritmetic, short vectors, functions, control flow

Built-in mesh interface for arbitrary polyhedra Vertex, Edge, Face, Cell

Optimized memory representation of mesh

Collections of mesh elements

Element Sets: faces(c:Cell), edgesCCW(f:Face)

Mapping mesh elements to fields

Fields: val vert_position = position(v)

Parallelizable iteration

forall statements: for( f <- faces(cell) ) { … }

Page 26: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 26

Liszt Code Example

Simple Set Comprehension

Functions, Function Calls

Mesh Topology Operators

Field Data Storage

for(edge <- edges(mesh)) {

val flux = flux_calc(edge)

val v0 = head(edge)

val v1 = tail(edge)

Flux(v0) += flux

Flux(v1) -= flux

}

Code contains possible write conflicts!

We use architecture specific strategies guided by domain knowledge

MPI: Ghost cell-based message passing

GPU: Coloring-based use of shared memory

Page 27: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 27

MPI Performance

Using 8 cores per node, scaling up to 96 cores (12 nodes, 8 cores per node, all communication using MPI)

0

20

40

60

80

100

120

0 20 40 60 80 100 120

Spe

ed

up

ove

r Sc

alar

Number of MPI Nodes

MPI Speedup 750k Mesh

Linear Scaling Liszt Scaling Joe Scaling

1

10

100

1000

0 20 40 60 80 100 120

Ru

nti

m L

og

Scal

e (

seco

nd

s)

Number of MPI Nodes

MPI Wall-Clock Runtime

Liszt Runtime Joe Runtime

Page 28: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 28

GPU Performance

Scaling mesh size from 50K (unit-sized) cells to 750K (16x) on a Tesla C2050. Comparison is against single threaded runtime on host CPU (Core 2 Quad 2.66Ghz)

Single-Precision: 31.5x, Double-precision:

28x

0

5

10

15

20

25

30

35

0 5 10 15 20

Spe

ed

up

ove

r Sc

alar

Problem Size

GPU Speedup over Single-Core

Speedup Double

Speedup Float

Page 29: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 29

OptiML: A DSL for ML

A. Sujeeth and H. Chafi

Machine Learning domain Learning patterns from data

Applying the learned models to tasks

Regression, classification, clustering, estimation

Computationally expensive

Regular and irregular parallelism

Motivation for OptiML Raise the level of abstraction

Use domain knowledge to identify coarse-grained parallelism

Single source ⇒ multiple heterogeneous targets

Domain specific optimizations

Page 30: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 30

OptiML Language Features

Provides a familiar (MATLAB-like) language and API for writing ML applications Ex. val c = a * b (a, b are Matrix[Double])

Implicitly parallel data structures General data types : Vector[T], Matrix[T]

Independent from the underlying implementation

Special data types : TrainingSet, TestSet, IndexVector, Image, Video ..

Encode semantic information

Implicitly parallel control structures sum{…}, (0::end) {…}, gradient { … }, untilconverged { … }

Allow anonymous functions with restricted semantics to be passed as arguments of the control structures

Page 31: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 31

% x : Matrix, y: Vector

% mu0, mu1: Vector

n = size(x,2);

sigma = zeros(n,n);

parfor i=1:length(y)

if (y(i) == 0)

sigma = sigma + (x(i,:)-mu0)’*(x(i,:)-mu0);

else

sigma = sigma + (x(i,:)-mu1)’*(x(i,:)-mu1);

end

end

Example OptiML / MATLAB code(Gaussian Discriminant Analysis)

// x : TrainingSet[Double]

// mu0, mu1 : Vector[Double]

val sigma = sum(0,x.numSamples) {

if (x.labels(_) == false) {

(x(_)-mu0).trans.outer(x(_)-mu0)

}

else {

(x(_)-mu1).trans.outer(x(_)-mu1)

}

}

OptiML code (parallel) MATLAB code

ML-specific data types

Implicitly parallel

control structures

Restricted index

semantics

Page 32: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 32

Benchmark Applications

6 machine learning applications Gaussian Discriminant Analysis (GDA)

Generative learning algorithm for probability distribution

Loopy Belief Propagation (LBP)

Graph based inference algorithm

Naïve Bayes (NB)

Supervised learning algorithm for classification

K-means Clustering (K-means)

Unsupervised learning algorithm for clustering

Support Vector Machine (SVM)

Optimal margin classifier using SMO algorithm

Restricted Boltzmann Machine (RBM)

Stochastic recurrent neural network

Page 33: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 33

Experimental Setup

4 Different implementations OptiML+Delite

MATLAB (Original, GPU, Jacket)

System 1: Performance Tests Intel Nehalem

2 sockets, 8 cores, 16 threads

24 GB DRAM

NVIDIA GTX 275 GPU

System 2: Scalability Tests Sun Niagara T2+

4 sockets, 32 cores, 256 threads

128 GB DRAM

Page 34: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 34

OptiML vs. MATLAB1.0

1.6

1.8

1.9

41.3

0.5

0.9

1.4

1.6

2.6

13.2

0.00

0.50

1.00

1.50

2.00

2.50

1

CPU

2

CPU

4

CPU

8

CPU

CPU

+

GPUNo

rm

alized

Execu

tio

n

Tim

e

GDA

1.0

1.7

2.7

3.5

11.0

1.0

1.9

3.2

4.7

8.9

16.1

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1 CPU 2 CPU 4 CPU 8 CPU CPU +

GPU

RBM

1.0

1.3

1.4

2.0 1.7

0.5

0.9

1.3 1.1

0.4 0.3

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

Linear Regression

1.0

1.7

3.0

3.8

1.0

0.7

0.9

0.9

1.2

0.00

0.50

1.00

1.50

2.00

1 CPU 2 CPU 4 CPU 8 CPU CPU + GPU

0.1

0.1

7.00

15.00

SVM

... ...

1.0

1.9

3.6

5.8 1.1

0.1

0.2

0.2

0.3

0.1

0.00

2.00

4.00

6.00

8.00

10.00

1 CPU 2 CPU 4 CPU 8 CPU CPU +

GPU

0.0

1

100.00

110.00

Naive Bayes

...

DELITE MATLAB JacketOptiML

1.0

2.1

4.0

6.7

10.0

0.3

0.4

0.4

0.4

0.3

0.3

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

1 CPU 2 CPU 4 CPU 8 CPU CPU +

GPU

K-means

Page 35: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 35

Measuring Intracellular Signaling with Mass Cytometry

Bioinformatics Algorithm Spanning-tree Progression Analysis of Density-normalized Events (SPADE)

P. Qiu, E. Simonds, M. Linderman, P. Nolan

Peng Qiu

Page 36: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 36

SPADE is computationally intensive

Processing time for 30 files:

Matlab (parfor & vectorized loops)

2.5 days

C++ (hand-optimized OpenMP)

2.5 hours

…what happens when we have 1,000 files?

Page 37: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 37

SPADE Downsample: OptiML

for(node <- G.nodes if node.density == 0) {

val (closeNbrs,closerNbrs) =

node.neighbors filter {dist(_,node) < kernelWidth}

{dist(_,node) < approxWidth}

node.density = closeNbrs.count

for(nbr <- closerNbrs) {

nbr.density = closeNbrs.count

}

}

kernelWidth

Downsample:

L1 distances between all 106

events in 13D space… reduce to

50,000 events

B. Wang and A. Sujeeth

Page 38: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 38

SPADE Downsample: Matlabwhile sum(local_density==0)~=0

% process no more than 1000 nodes each time

ind = find(local_density==0); ind = ind(1:min(1000,end));

data_tmp = data(:,ind);

local_density_tmp = local_density(ind);

all_dist = zeros(length(ind), size(data,2));

parfor i=1:size(data,2)

all_dist(:,i) = sum(abs(repmat(data(:,i),1,size(data_tmp,2)) –data_tmp),1)';

end

for i=1:size(data_tmp,2)

local_density_tmp(i) = sum(all_dist(i,:) < kernel_width);

local_density(all_dist(i,:) < apprx_width) = local_density_tmp(i);

end

end

Page 39: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 39

OptiML vs. C++

OptiML provides much simpler programming model

OptiML performance as good as C++ on full applications

1.0

1.7

3.1

4.9

0.7

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1 CPU 2 CPU 4 CPU 8 CPU

No

rm

alized

Execu

tion

Tim

e

Template Match

DELITE C++

1.0

1.9

3.4

5.4

1.0

2.0

3.6

5.9

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1 CPU 2 CPU 4 CPU 8 CPU

SPADE

1.0

2.5

4.9

9.8

1.6

2.1

4.3

7.1

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1 CPU 2 CPU 4 CPU 8 CPU

LBP

OptiML

Page 40: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 40

More DSLs …

Graphs

Social networks, data analysis

Visualization

Data analysis

Query Language

Data analysis, financial trading

Bio-simulation

Molecular dynamics, cells & viruses , drug-design, prosthetics

Bio-informatics

Data analyis

Your DSL goes here

Page 41: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 41

New Problem

We need to develop all of these DSLs

Current DSL methods are unsatisfactory

Page 42: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 42

Current DSL Development Approaches

Stand-alone DSLs Can include extensive optimizations Enormous effort to develop to a sufficient degree of maturity

Actual Compiler/Optimizations Tooling (IDE, Debuggers,…)

Interoperation between multiple DSLs is very difficult

Purely embedded DSLs ⇒ “just a library” Easy to develop (can reuse full host language) Easier to learn DSL Can Combine multiple DSLs in one program Can Share DSL infrastructure among several DSLs Hard to optimize using domain knowledge Target same architecture as host language

Need to do better

Page 43: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 43

Need to Do Better

Goal: Develop embedded DSLs that perform as well as stand-alone ones

Intuition: General-purpose languages should be designed with DSL embedding in mind

Page 44: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 44

DSL Embedding Language

Mixes OO and FP paradigms

Targets JVM

Stanford/EPFL collaboration on leveraging Scala for parallelism

“Language Virtualization for Heterogeneous Parallel Computing” Onward 2010, Reno

Page 45: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 45

What is Scala

Statically typed Allows OOP

Improves it by adding traits Everything is an object

Allows FP Functions are just like any other object

can be passed to and returned from other functions/methods

Also offers closures

Compiles to JVM/.NET Succinct, elegant and flexible syntax

Type inference Syntactic sugaring features make libraries feel “native”

Sophisticated type system Scalable language

Page 46: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 46

Embedded DSL gets it all for free,but can’t change any of it

Lightweight Modular Staging Approach

Lexer ParserType

checkerAnalysis Optimization

Code gen

DSLs adopt front-end from highly expressive

embedding language

but can customize IR and participate in backend phases

Stand-alone DSL implements everything

Typical Compiler

Modular Staging provides a hybrid approach

GPCE’10: Lightweight modular staging: a pragmatic approach to runtime code generation and compiled DSLs

Page 47: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 47

Delite: A Framework for DSL Parallelism

Lexer ParserType

checkerAnalysis Optimization

Code gen

DSLs adopt front-end from highly expressive

embedding language

but can customize IR and participate in backend phases

Need a framework to simplify development of DSL backends

Delite

H. Chafi, A. Sujeeth, K. Brown, H. Lee

Page 48: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 48

Benefits of a Shared DSL Infrastructure

Parallelization of DSLs Map domain abstractions to low level parallel execution

patterns Why not provide a collection of these execution patterns?

Code Generation to multiple targets Most DSLs have common functionality

Why no handle this once and for all DSLs?

Shared parallel runtime to accelerate DSL development

Handle other issues in a consistent fashion across DSLs Debugging IDE and tooling support

Page 49: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 49

Intermediate Representation (IR)

Delite DSL Compiler

Provide a common IR that can be extended while still benefitting from generic analysis and opt.

Extend common IR and provide IR nodes that encode data parallel execution patterns

Now can do parallel optimizations and mapping

DSL extends appropriate data parallel nodes for their operations

Now can do domain-specific analysis and opt.

Generate an execution graph, kernels and data structures

Scala Embedding

Framework

Delite

Execution

Graph

Delite Parallelism

Framework

Base IR

Generic

Analysis & Opt.

Code Generation

Kernels(Sc

ala, C,

Cuda, MPI

Verilog, …)

Liszt

program

OptiML

program

DS IR

Domain

Analysis & Opt.

Delite IR

Parallelism Analysis,

Opt. & Mapping

⇒⇒

Data

Structures(arra

ys, trees,

graphs, …)

Page 50: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 50

The Delite IR

Matrix

Plus

Vector

Exp

Matrix

Sum

ReduceMapZipWith

Expression

s = sum(M)V1 = exp(V2)M1 = M2 + M3

Domain

Analysis & Opt.

Domain User

Interface

Parallelism

Analysis & Opt.

Code Generation

DSL

User

Generic Analysis

& Opt.

Application

DS IR

Delite Op IR

Base IR

DSL

Author

Delite

Delite

Collection

Quicksort

Divide &

Conquer

C2 = sort(C1)

Page 51: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 51

Partial schedules, Fused & specialized kernels

Cluster

GPUSMP

Machine Inputs

Delite

Execution

Graph

Kernels

(Scala, C,

Cuda,

Verilog, …)

Data

Structures

(arrays, trees,

graphs, …)

Application Inputs

Delite Execution

SchedulerCode Generator

Fusion, Specialization, Synchronization

Walk-Time

Schedule Dispatch, Dynamic load balancing, Memory management,

Lazy data transfers, Kernel auto-tuning, Fault tolerance

Maps the machine-agnostic DSL compiler output onto the machine configuration for execution

Walk-time scheduling produces partial schedules

Code generation produces fused, specialized kernels to be launched on each resource

Run-time executor controls and optimizes execution

Run-Time

Page 52: CS442: High Productivity and Performance with Domain ... · Domain Embedding Language (Scala) Virtual Worlds Personal Robotics Data informatics Scientific Engineering Physics (Liszt)

(C) 2011 Kunle Olukotun 52

Conclusions

DSLs have potential to solve the heterogeneous parallel programming problem

Don’t expose programmers to explicit parallelism unless they ask for it

Look at the process of developing DSLs for parallelism

Need programming languages to be designed for flexible embedding (Scala)

Lightweight modular staging in Scala allows for more powerful embedded DSLs

Delite provides a framework for adding parallelism

Explore DSLs and DSL infrastucture in this course