Combining Machine Learning Frameworks with Apache Spark

42
Combining Machine Learning frameworks with Apache Spark Tim Hunter Hadoop Summit June 2016

Transcript of Combining Machine Learning Frameworks with Apache Spark

Page 1: Combining Machine Learning Frameworks with Apache Spark

Combining Machine Learning frameworks with Apache SparkTim HunterHadoop SummitJune 2016

Page 2: Combining Machine Learning Frameworks with Apache Spark

About meApache Spark contributor (since Spark 0.6)

Software Engineer @ Databricks

Ph.D. in Machine Learning @ UC Berkeley

2

Page 3: Combining Machine Learning Frameworks with Apache Spark

Founded by the team who created Apache Spark

Offers a hosted service: - Apache Spark in the cloud - Notebooks - Cluster management - Production environment

About Databricks

3

Page 4: Combining Machine Learning Frameworks with Apache Spark

Apache Spark• The most active open-source project in big data

Page 5: Combining Machine Learning Frameworks with Apache Spark

• Large-scale machine learning on Apache SparkSpark MLlib

Page 6: Combining Machine Learning Frameworks with Apache Spark

MLlib’s MissionMLlib’s mission is to make practical machine learning easy and scalable.

• Easy to build machine learning applications• Capable of learning from large-scale datasets• Easy to integrate into existing workflows

6

Page 7: Combining Machine Learning Frameworks with Apache Spark

Algorithm Coverage• Classification• Logistic regression• Naive Bayes• Streaming logistic regression• Linear SVMs• Decision trees• Random forests• Gradient-boosted trees• Multilayer perceptron

• Regression• Ordinary least squares• Ridge regression• Lasso• Isotonic regression• Decision trees• Random forests• Gradient-boosted trees• Streaming linear methods• Generalized Linear Models

• Frequent itemsets• FP-growth• PrefixSpan

7

Clustering• Gaussian mixture models• K-Means• Streaming K-Means• Latent Dirichlet Allocation• Power Iteration Clustering• Bisecting K-Means

Statistics• Pearson correlation• Spearman correlation• Online summarization• Chi-squared test• Kernel density estimation• Kolmogorov–Smirnov test• Online hypothesis testing• Survival analysis

Linear algebra• Local dense & sparse vectors & matrices• Normal equation for least squares• Distributed matrices

• Block-partitioned matrix• Row matrix• Indexed row matrix• Coordinate matrix

• Matrix decompositions

Recommendation• Alternating Least Squares

Feature extraction & selection• Word2Vec• Chi-Squared selection• Hashing term frequency• Inverse document frequency• Normalizer• Standard scaler• Tokenizer• One-Hot Encoder• StringIndexer• VectorIndexer• VectorAssembler• Binarizer• Bucketizer• ElementwiseProduct• PolynomialExpansion• Quantile discretizer• SQL transformer

Model import/exportPipelines

List based on Spark 2.0

Page 8: Combining Machine Learning Frameworks with Apache Spark

Outline• ML workflows are complex• Distributing single-machine ML frameworks:• Embedding with Spark:• Unified cross-languages ML pipelines with MLlib

8

Page 9: Combining Machine Learning Frameworks with Apache Spark

ML workflows are complex• Specify the pipeline• Re-run on new data• Inspect the results• Tune the parameters

• Usually, each step of a pipeline is easier with one framework

9

Page 10: Combining Machine Learning Frameworks with Apache Spark

ML Workflows are Complex

10

Train model 1

Evaluate

Datasource 1Datasource 2

Datasource 3

Extract featuresExtract features

Feature transform 1

Feature transform 2

Feature transform 3

Train model 2

Ensemble

Page 11: Combining Machine Learning Frameworks with Apache Spark

Existing tools• Scikit-learn

– Excellent documentation– Standard for Python

• R– Lots of packages available

• Pandas– Very easy to use

• A lot of investment in tooling and education– How to integrate big data with these tools?

11

Page 12: Combining Machine Learning Frameworks with Apache Spark

Common misconceptions• Spark is for big data only• Spark can only work with dedicated, distributed

libraries

12

Page 13: Combining Machine Learning Frameworks with Apache Spark

Spark as a scheduler• A lot of tasks in ML are ”embarrassingly parallel”• Use Spark for data management and for scheduling

13

Page 14: Combining Machine Learning Frameworks with Apache Spark

One example: learning digits• Learning tasks: given set of images, recognized digits• Standard benchmark dataset in computer vision built

by NIST:

14

Page 15: Combining Machine Learning Frameworks with Apache Spark

Training Deep Learning algorithms

• Training a neural network is hard:• It is a sequential procedure (present one image after the other to

learn from)• It can be sensitive to noise and order of images: robustness

analysis is critical• Tuning the training parameters (descent rate, batch sizes, etc.) is

very important. Otherwise, learning is too slow or gets stuck in a local minima. A lot of heuristics are used in practice.

15

Page 16: Combining Machine Learning Frameworks with Apache Spark

TensorFlow as a training library• A lot of algorithms have been presented for this task,

we will choose TensorFlow, from Google:• Popular choice for neural network training and deep

learning• Competitive performance• Easy to experiment with• Python interface makes it easy to integrate with Spark

16

Page 17: Combining Machine Learning Frameworks with Apache Spark

Distributing TensorFlow computations

• Even if TF is used as a single-machine library, we get speedups from Spark

17

Distributed Cross Validation

...

Best Model

Model #1Training

Model #2Training

Model #3 Training

Page 18: Combining Machine Learning Frameworks with Apache Spark

Distributing TensorFlow computations

18

Distributed Cross Validation

...

Best Model

Model #4Training

Model #6Training

Model #3 Training

Model #1Training

Model #5Training

Model #2Training

Page 19: Combining Machine Learning Frameworks with Apache Spark

Results• Running a 2-layer neural network, and testing for different

update rates and different layer sizes

19

1 node 2 nodes 13 nodes0

3000

6000

9000

12000

Page 20: Combining Machine Learning Frameworks with Apache Spark

Embedding deep learning in Spark

• Best known algorithms are essentially sequential during training• Careful selection of training parameters is critical• Spark can help for fast iterations and find a good set of

parameters

20

Page 21: Combining Machine Learning Frameworks with Apache Spark

Managing ML workflows with Spark

21

Page 22: Combining Machine Learning Frameworks with Apache Spark

A data scientist’s wish list:• Run original code on a production environment• Use distributed data sources• Use familiar APIs and libraries• Distribute ML workload piece by piece• Only distribute as needed• Easily switch between local & distributed settings

22

Page 23: Combining Machine Learning Frameworks with Apache Spark

Example: sentiment analysis

23

Given a review (text), predict the user’s rating.

Data from https://snap.stanford.edu/data/web-Amazon.html

Page 24: Combining Machine Learning Frameworks with Apache Spark

ML Workflow

24

Train model

Evaluate

Load data

Extract features

Review: This product doesn't seem to be made to last… Rating: 2

feature_vector: [0.1 -1.3 0.23 … -0.74] rating: 2.0

Regression: (review: String) => Double

Page 25: Combining Machine Learning Frameworks with Apache Spark

Load Data

25

built-in external

{ JSON }

JDBC

and more …

Data sources for DataFrames

LIBSVM

Train model

Evaluate

Load data

Extract features

Page 26: Combining Machine Learning Frameworks with Apache Spark

Extract Features

words: [this, product, doesn't, seem, to, …]

feature_vector: [0.1 -1.3 0.23 … -0.74]

Review: This product doesn't seem to be made to last… Rating: 2

Prediction: 3.0

Train model

Evaluate

Load data

Tokenizer

Hashed Term Frequ.

Page 27: Combining Machine Learning Frameworks with Apache Spark

Extract Features

words: [this, product, doesn't, seem, to, …]

feature_vector: [0.1 -1.3 0.23 … -0.74]

Review: This product doesn't seem to be made to last… Rating: 2

Prediction: 3.0

Linear regression

Evaluate

Load data

Tokenizer

Hashed Term Frequ.

Page 28: Combining Machine Learning Frameworks with Apache Spark

Our ML workflow

28

Cross Validation

Model TrainingFeature Extraction

regularizationparameter:{0.0, 0.1, ...}

Page 29: Combining Machine Learning Frameworks with Apache Spark

Cross validation

29

Cross Validation

...

Best Model

Model #1 Training

Model #2 Training

Feature Extraction

Model #3 Training

Page 30: Combining Machine Learning Frameworks with Apache Spark

Example

30

Page 31: Combining Machine Learning Frameworks with Apache Spark

MLlib in production ML Persistence

31

Page 32: Combining Machine Learning Frameworks with Apache Spark

A data scientist’s wish list:• Run original code on a production environment• Use distributed data sources• Use familiar APIs and libraries• Distribute ML workload piece by piece• Only distribute as needed• Easily switch between local & distributed settings

32

Page 33: Combining Machine Learning Frameworks with Apache Spark

DataFrame-based API for MLliba.k.a. “Pipelines” API, with utilities for constructing ML Pipelines

In 2.0, the DataFrame-based API will become the primary API for MLlib.• Voted by community• org.apache.spark.ml, pyspark.ml

The RDD-based API will enter maintenance mode.• Still maintained with bug fixes, but no new features• org.apache.spark.mllib, pyspark.mllib

33

Page 34: Combining Machine Learning Frameworks with Apache Spark

Why ML persistence?

34

Data Science Software Engineering

Prototype (Python/R)Create model

Re-implement model for production (Java)

Deploy model

Page 35: Combining Machine Learning Frameworks with Apache Spark

Why ML persistence?

35

Data Science Software Engineering

Prototype (Python/R)Create Pipeline• Extract raw features• Transform features• Select key features• Fit multiple models• Combine results to

make prediction

• Extra implementation work• Different code paths• Synchronization overhead

Re-implement Pipeline for production (Java)

Deploy Pipeline

Page 36: Combining Machine Learning Frameworks with Apache Spark

With ML persistence...

36

Data Science Software Engineering

Prototype (Python/R)Create Pipeline

Persist model or Pipeline: model.save(“s3n://...”)

Load Pipeline (Scala/Java) Model.load(“s3n://…”)Deploy in production

Page 37: Combining Machine Learning Frameworks with Apache Spark

Model tuning

ML persistence status

37

Text preprocessing

Feature generation

Generalized Linear

Regression

Unfitted Fitted

Model

Pipeline

Supported in MLlib’s RDD-based API

“recipe” “result”

Page 38: Combining Machine Learning Frameworks with Apache Spark

ML persistence statusNear-complete coverage in all Spark language APIs• Scala & Java: complete (29 feature transformers, 21 models)• Python: complete except for 2 algorithms• R: complete for existing APIs

Single underlying implementation of models

Exchangeable data format• JSON for metadata• Parquet for model data (coefficients, etc.)

38

Page 39: Combining Machine Learning Frameworks with Apache Spark

A data scientist’s wish list:• Run original code on a production environment• Directly apply learned pipelines• Use MLlib as export format

• Use distributed data sources• Builtin Spark conversions

• Use familiar APIs and libraries• Distribute ML workload piece by piece• Easy to distribute the most common ML tasks

39

Page 40: Combining Machine Learning Frameworks with Apache Spark

What’s next?Prioritized items on the 2.1 roadmap JIRA (SPARK-15581):• Critical feature completeness for the DataFrame-based API

– Multiclass logistic regression– Statistics

• Python API parity & R API expansion• Scaling & speed tuning for key algorithms: trees & ensembles

GraphFrames• Release for Spark 2.0• Speed improvements ( join elimination, connected components)

40

Page 41: Combining Machine Learning Frameworks with Apache Spark

Get started• Get involved via roadmap JIRA (SPARK-15581) +

mailing lists

• Download notebook for this talk http://dbricks.co/1UfvAH9

• ML persistence blog post http://databricks.com/blog/2016/05/31

41

Try out the Apache Spark 2.0 preview release:http://databricks.com/try