SystemML - Datapalooza Denver - 05.17.16 MWD

47
Apache SystemML Mike Dusenberry Engineer, Machine Learning & SystemML Spark Technology Center @dusenberrymw Datapalooza, Denver - 05.19.16

Transcript of SystemML - Datapalooza Denver - 05.17.16 MWD

Page 1: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemMLMike Dusenberry

Engineer, Machine Learning & SystemMLSpark Technology Center

@dusenberrymwDatapalooza, Denver - 05.19.16

Page 2: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML

1. Backgrounda. Machine Learningb. Declarative ML

2. SystemMLa. Overviewb. Languagec. Compiler/Optimizerd. Runtime

3. Demo4. Current Work

a. Deep Learning: SystemML-NN

5. Questions

Agenda

Page 4: SystemML - Datapalooza Denver - 05.17.16 MWD

Machine Learning

Page 5: SystemML - Datapalooza Denver - 05.17.16 MWD

Machine Learning● Data

○ Multiple “examples”○ Multiple “features” per “example”○ “Label(s)” for each “example” (supervised)

● Model○ Construct/select a model that fits the problem.○ Examples:

■ Linear/Logistic Regression■ SVM■ Neural Networks

● Loss○ An “evaluation” of how well the model fits the data.

● Optimizer○ Minimize “loss” by adjusting model to better fit the data.

Page 6: SystemML - Datapalooza Denver - 05.17.16 MWD

Declarative Machine Learning

Page 7: SystemML - Datapalooza Denver - 05.17.16 MWD

Laptop

Exploratory Data Analysis Today

7

R

Python

Others

DataScientist

DataR

Python

Others

DataScientist

Page 8: SystemML - Datapalooza Denver - 05.17.16 MWD

Laptop

Exploratory Data Analysis Today

8

R

Python

Others

DataScientist

R

Python

Others

DataScientist

Page 9: SystemML - Datapalooza Denver - 05.17.16 MWD

Current Best Practice for Big Data Analysis

DataScientist

DataScientist

DataScientist

HadoopEngineer

SparkEngineer

MPIEngineer

R

Python

Others

Page 10: SystemML - Datapalooza Denver - 05.17.16 MWD

Laptop

DataScientist

Scale-up

Cluster

R

Python Query Optimization

Others

Vision: Declarative Machine Learning

Page 11: SystemML - Datapalooza Denver - 05.17.16 MWD

Common patterns:

•Changes in feature set

•Changes in data size

•Algorithm customization

•Quick iteration

Declarative Machine Learning

Page 12: SystemML - Datapalooza Denver - 05.17.16 MWD

Classification by level of abstraction (different target user)

Landscape of Existing Work

Distributed Systems w/ DSLs

Large-Scale ML Libraries (fixed plan)

Declarative ML (fixed algorithm)

Declarative ML++ (fixed task)

Spark, Flink, REEF, GraphLab, (R, Matlab, SAS)

MLlib, Mahout MR, MADlib, ORE, Rev R, HP Dist R, Custom alg.

SystemML, (Mahout Samsara, Tupleware, Cumulon, Dmac, SimSQL)

Mlbase*, Specific sys.

Page 13: SystemML - Datapalooza Denver - 05.17.16 MWD

Requirements to Support Declarative ML• Goal: Write ML algorithms independent of input data and cluster characteristics.• R1: Full flexibility

▪ Specify new / customize existing ML algorithms.▪ ➔ ML DSL

• R2: Data independence▪ Hide physical data representation (sparse/dense, row/column-major, blocking

configs, partitioning, caching, compression).▪ ➔ Abstract data types and coarse-grained logical operations.

• R3: Efficiency and scalability▪ Very small to very large use-cases.▪ ➔ Automatic optimization and hybrid runtime plans.

• R4: Specified algorithm semantics▪ Understand, debug, and control algorithm behavior.▪ ➔ Optimization for performance only, not accuracy.

Page 14: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML

Page 15: SystemML - Datapalooza Denver - 05.17.16 MWD

Sidenote: Fun Stuff - Neural Art

-A Neural Algorithm of Artistic Style, L.A. Gatys, A.S. Ecker, M. Bethge-https://github.com/jcjohnson/neural-style

Page 16: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML

Page 17: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML● High-level language

○ DML -> R-like○ PyDML -> Python-like

○ Focus is on matrices and linear algebra.

● Engine○ Compiler/Optimizer○ Lots of optimizations, such as

rewrites.

● Runtime○ Laptop○ Spark○ (also Hadoop)

(DML) (PyDML)

Engine

Page 18: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML● High-level language

○ DML -> R-like○ PyDML -> Python-like

○ Focus is on matrices and linear algebra.

● Engine○ Compiler/Optimizer○ Lots of optimizations, such as

rewrites.

● Runtime○ Laptop○ Spark○ (also Hadoop)

(DML) (PyDML)

Engine

Page 19: SystemML - Datapalooza Denver - 05.17.16 MWD

SystemML - Example: Logistic Regression (DML)

Page 20: SystemML - Datapalooza Denver - 05.17.16 MWD

SystemML - Example: Sigmoid Function (DML)

Page 21: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML● High-level language

○ DML -> R-like○ PyDML -> Python-like

○ Focus is on matrices and linear algebra.

● Engine○ Compiler/Optimizer○ Lots of optimizations, such as

rewrites.

● Runtime○ Laptop○ Spark○ (also Hadoop)

(DML) (PyDML)

Engine

Page 22: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML● High-level language

○ DML -> R-like○ PyDML -> Python-like

○ Focus is on matrices and linear algebra.

● Engine○ Compiler/Optimizer○ Lots of optimizations, such as

rewrites.

● Runtime○ Laptop○ Spark○ (also Hadoop)

(DML) (PyDML)

Engine

Page 23: SystemML - Datapalooza Denver - 05.17.16 MWD

SystemML - Compilation Chain

Page 24: SystemML - Datapalooza Denver - 05.17.16 MWD

SystemML - Compilation Chain

24

Page 25: SystemML - Datapalooza Denver - 05.17.16 MWD

SystemML - Compilation Chain

25

Page 26: SystemML - Datapalooza Denver - 05.17.16 MWD

SystemML - Compilation Chain

26

Page 27: SystemML - Datapalooza Denver - 05.17.16 MWD

SystemML - Compilation Chain

27

Page 28: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML● High-level language

○ DML -> R-like○ PyDML -> Python-like

○ Focus is on matrices and linear algebra.

● Engine○ Compiler/Optimizer○ Lots of optimizations, such as

rewrites.

● Runtime○ Laptop○ Spark○ (also Hadoop)

(DML) (PyDML)

Engine

Page 29: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML● High-level language

○ DML -> R-like○ PyDML -> Python-like

○ Focus is on matrices and linear algebra.

● Engine○ Compiler/Optimizer○ Lots of optimizations, such as

rewrites.

● Runtime○ Laptop○ Spark○ (also Hadoop)

(DML) (PyDML)

Engine

Page 30: SystemML - Datapalooza Denver - 05.17.16 MWD

More Fun...

https://github.com/google/deepdream

Page 31: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML● High-level language

○ DML -> R-like○ PyDML -> Python-like

○ Focus is on matrices and linear algebra.

● Engine○ Compiler/Optimizer○ Lots of optimizations, such as

rewrites.

● Runtime○ Laptop○ Spark○ (also Hadoop)

(DML) (PyDML)

Engine

Page 32: SystemML - Datapalooza Denver - 05.17.16 MWD

SystemML - Compilation Chain

32

Page 33: SystemML - Datapalooza Denver - 05.17.16 MWD

SystemML - Compilation Chain

33

Spark

CP + b sb _mVar1SPARK mapmm X.MATRIX.DOUBLE _mvar1.MATRIX.DOUBLE _mVar2.MATRIX.DOUBLE RIGHT false NONE CP * y _mVar2 _mVar3

Page 34: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML● High-level language

○ DML -> R-like○ PyDML -> Python-like

○ Focus is on matrices and linear algebra.

● Engine○ Compiler/Optimizer○ Lots of optimizations, such as

rewrites.

● Runtime○ Laptop○ Spark○ (also Hadoop)

(DML) (PyDML)

Engine

Page 35: SystemML - Datapalooza Denver - 05.17.16 MWD

SystemML Architecture (APIs and runtime)

35

Command Line JMLC Spark

MLContextSpark

MLAPIs

High-Level Operators (HOPs)

Parser/Language

Low-Level Operators (LOPs)

Compiler

Runtime

Control ProgramRuntime

ProgBuffer Pool

ParFor Optimizer/Runtime

MR InstSpark Inst

CPInst

Recompiler

Cost-based optimizations

DFS IOMem/FS IO

Generic MR

MatrixBlock Library(single/multi-threaded)

Page 36: SystemML - Datapalooza Denver - 05.17.16 MWD

SystemML Architecture (APIs and runtime)

36

Command Line JMLC Spark

MLContextSpark

MLAPIs

High-Level Operators (HOPs)

Parser/Language

Low-Level Operators (LOPs)

Compiler

Runtime

Control ProgramRuntime

ProgBuffer Pool

ParFor Optimizer/Runtime

MR InstSpark Inst

CPInst

Recompiler

Cost-based optimizations

DFS IOMem/FS IO

Generic MR

MatrixBlock Library(single/multi-threaded)

Page 37: SystemML - Datapalooza Denver - 05.17.16 MWD

Demo

Page 38: SystemML - Datapalooza Denver - 05.17.16 MWD

Current Work

Page 39: SystemML - Datapalooza Denver - 05.17.16 MWD

Current Work● Usability / Applications:

○ Deep Learning (SYSTEMML-540)○ Embedded Scala/Python/R DSL with sufficient optimization scope (SYSTEMML-451)

● Optimizer:○ Cost-model enhancement (SYSTEMML-416)○ Global program optimization (SYSTEMML-421)○ Source code generation for automatic operator fusion (SYSTEMML-448)

● Runtime:○ Add GPU backend (SYSTEMML-445) => CUDA / OpenCL○ Frame support / Sparse block representation○ Integrate Apache Flink as additional backend for SystemML (SYSTEMML-636 / PR-119)○ NUMA-aware single node backend (SYSTEMML-406)

Page 40: SystemML - Datapalooza Denver - 05.17.16 MWD

Deep Learning - Plans● Deep Learning library for SystemML written in DML (SYSTEMML-618).

○ SystemML-NN [https://github.com/dusenberrymw/systemml-nn]

● Built-in DML functions for computationally-intensive layers.○ Convolution (2D), Max Pooling

● GPU acceleration for these built-in functions (SYSTEMML-445).● Integration with existing deep learning libraries (Keras, TensorFlow, Torch,

etc.)?

Page 41: SystemML - Datapalooza Denver - 05.17.16 MWD

Deep Learning - SystemML-NN Library● Deep learning library written in DML (and

PyDML soon…).● Multiple layers:

○ Core:■ Affine, 2D Convolution, Max Pooling

○ Nonlinearity/Transfer:■ Sigmoid, Tanh, Softmax, ReLU

○ Regularization:■ Dropout, L1, L2

○ Loss:■ Log-loss, Cross-entropy, L1, L2

● Multiple optimizers:○ SGD, SGD w/ momentum, SGD w/

Nesterov momentum, Adagrad, RMSprop, Adam

https://github.com/dusenberrymw/systemml-nn

Page 42: SystemML - Datapalooza Denver - 05.17.16 MWD

Deep Learning - SystemML-NN Library (cont.)

https://github.com/dusenberrymw/systemml-nn

● Each layer type has a simple `forward(...)` and `backward(...)` API.

○ `forward(...)` computes the output of the function based on the inputs.

○ `backward(...)`computes the partial derivatives (gradient) of the inputs to the function w.r.t. some function deeper in the network (usually the loss function at the end).

● Each optimizer has a simple `update(...)` API.

○ `update(...)` adjusts the given parameters based on their partial derivatives.

● Includes test code in DML.○ Gradient checks, unit tests

Page 43: SystemML - Datapalooza Denver - 05.17.16 MWD

Deep Learning - SystemML-NN Library (cont.)

SystemML-NN

SystemMLEngine

Page 44: SystemML - Datapalooza Denver - 05.17.16 MWD

Apache SystemML

1. Backgrounda. Machine Learningb. Declarative ML

2. SystemMLa. Overviewb. Languagec. Compiler/Optimizerd. Runtime

3. Demo4. Current Work

a. Deep Learning: SystemML-NN

5. Questions

Agenda Revisited

Page 46: SystemML - Datapalooza Denver - 05.17.16 MWD

Questions?

Page 47: SystemML - Datapalooza Denver - 05.17.16 MWD

Thanks!