Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

38

Transcript of Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Page 1: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Page 2: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Webinar Logistics

3

Page 3: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Webinar Logistics

4

Page 4: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

About the speaker: Joseph Bradley

Joseph Bradley is a Software Engineer and Apache Spark Committer & PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.

5

Page 5: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

About the speaker: Jules S. Damji

Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies building large-scale distributed systems.

6

Page 6: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

DatabricksFounded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricksin 2014

75%

Data Value

Created Databricks on top of Spark to make big data simple.7

Page 7: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Apache Spark Engine

Spark Core

Spark StreamingSpark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant

Python, Java, Scala, & R APIs

Standard libraries

8

Page 8: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

9

Page 9: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

10

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO

Source: Slide 5 of Spark Community Update

Page 10: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

OutlineIntro to MLlib in 2.x

Migrating an ML workload to DataFrames

ML persistence

Roadmap ahead during 2.x

11

Page 11: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

OutlineIntro to MLlib in 2.x

Migrating an ML workload to DataFrames

ML persistence

Roadmap ahead during 2.x

12

Page 12: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

A bit of MLlib historySpark 0.8RDD-based API Fast, scale-out ML

Challenges• Expressing complex workflows• Integrating with DataFrames• Developing Java, Python & R APIs

Spark 1.2DataFrame-based API (a.k.a. “Spark ML”)

Major improvements• ML Pipelines with automated tuning• Native DataFrame integration• Standard API across languages

See Xiangrui Meng’s original design & prototype in SPARK-3530.

13

Page 13: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

MLlib trajectory

v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.00

200400600800

1000

com

mits

/ re

leas

e

Scala/JavaAPI

Primary API for MLlib

Python API R API

DataFrame-basedAPI for MLlib

14

Page 14: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

DataFrame-based API for MLlibDataFrames are the standard ML dataset type.

Uniform APIs for algorithms, hyperparameters, etc.

Pipelines provide utilities for constructing ML workflows + automating hyperparameter tuning.

Learn more about ML Pipelines:http://spark-summit.org/2015/events/practical-machine-learning-pipelines-with-mllib-2http://docs.databricks.com/spark/latest/mllib/binary-classification-mllib-pipelines.html

15

Page 15: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

DataFrame-based API for MLlibIn 2.0, the DataFrame-based API became the primary MLlib API.• Voted by community• org.apache.spark.ml, pyspark.ml

The RDD-based API is in maintenance mode.• Still maintained with bug fixes, but no new features• org.apache.spark.mllib, pyspark.mllib

16

Page 16: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

OutlineIntro to MLlib in 2.x

Migrating an ML workload to DataFrames

ML persistence

Roadmap during 2.x

17

Page 17: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Why migrate to DataFrames?DataFrames

Language APIs

Pipelines

DataFrames & Datasets are the new “core” API for Spark.• Data sources & ETL• Latest performance improvements (Catalyst & Tungsten)• Structured Streaming

18

Page 18: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Why migrate to DataFrames?DataFrames

Language APIs

Pipelines

Standardized across Scala, Java, Python, and R• Python & R match Scala/Java performance• Cross-language persistence (saving/loading models)

19

Page 19: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Why migrate to DataFrames?DataFrames

Language APIs

Pipelines Specify complex ML workflows• Chain together Transformers, Estimators, & Models• Automated hyperparameter tuning

20

Page 20: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Demo migrationConvert a notebook from the RDD-based API to the DataFrame-based API.

Key points• Work with single models or complex Pipelines• Incremental migration• Many benefits: simpler APIs, SQL integration, tuning• A few gotchas (linear algebra types)

21

WarningDemo for experts!

Page 21: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Demo recap: migration processSeparate 2 migrations:

• Spark 1.6 2.0• RDDs DataFrames

Migrate ML APIs: spark.mllib spark.ml• Gotcha: a few naming changes (from standardizing algorithm APIs)

• Certain Param and model methods• run() fit()

• Tips:• Use explainParams()• Compare the API docs if you hit issues!

Migrate data APIs: RDDs DataFrames• Tip: Get familiar with conversion syntax in both directions.

22

Page 22: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Demo recap: migration processDebugging runtime errors

• Gotcha: Lazy evaluation in Pipelines means bugs appear later than expected.• Tip: Check intermediate results.

• Gotcha: Vector (and Matrix) types in spark.mllib and spark.ml.• Relevant for Spark 1.6 2.0 migration• Tip: Watch for buried errors: MatchError and mentions of “vector”• Tip: Use helper methods for conversion

• org.apache.spark.mllib.linalg.Vector.asML• org.apache.spark.mllib.linalg.Vectors.fromML• http://spark.apache.org/docs/latest/ml-guide.html#migration-guide

23

Page 23: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Future benefits of migrationCurrentlyML training is implemented on RDDs.

GoalPort implementation to DataFrames.Benefit from DataFrame

optimizations (Catalyst, Tungsten).

Spark SQL MLlib

Core

RDDs

DataFramesDatasetsSQL

24

Page 24: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Future benefits of migrationStatusFirst published implementation in GraphFrames (Spark package for graph processing)

Ongoing workDataFrame improvements for iterative algorithms: checkpointing, improved caching, and more.

Spark SQL

MLlib

Core

RDDs

DataFramesDatasetsSQL

25

Page 25: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

OutlineIntro to MLlib in 2.x

Migrating an ML workload to DataFrames

ML persistence

Roadmap during 2.x

26

Page 26: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Why ML persistence?Data Science Software Engineering

Prototype (Python/R)Create model

Re-implement model for production (Java)

Deploy model

27

Page 27: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Why ML persistence?Data Science Software Engineering

Prototype (Python/R)Create Pipeline• Extract raw features• Transform features• Select key features• Fit multiple models• Combine results to

make prediction

• Extra implementation work• Different code paths• Synchronization overhead

Re-implement Pipeline for production (Java)

Deploy Pipeline

28

Page 28: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

With ML persistence...Data Science Software Engineering

Prototype (Python/R)Create Pipeline

Persist model or Pipeline: model.save(“s3n://...”)

Load Pipeline (Scala/Java) Model.load(“s3n://…”)Deploy in production

29

Page 29: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Model tuning

ML persistence status

Text preprocessing

Feature generation

Random forest

Unfitted Fitted

Model

Pipeline

“recipe” “result”

30

Page 30: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

ML persistence statusNear-complete coverage in all Spark language APIs• Scala & Java: complete• Python: complete except for 2 algorithms• R: complete for existing APIs

Single underlying implementation of models

Exchangeable data format• JSON for metadata• Parquet for model data (coefficients, etc.)

31

Page 31: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Demo: ML persistence• Can persist single models & complex workflows

• Easy to move models across Spark deployments

• Share models across teams & languages

32

Page 32: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

ML persistence: pending issuesPython tuning: not yet implemented• CrossValidator, TrainValidationSplit

R format: incompatible with Python/Java/Scala• Issue: R wrappers are all special Pipelines.• Working towards a fix• Workaround: Load underlying PipelineModel from subfolder in

saved model directory.

Backwards compatibility: WIP in SPARK-15573

ML persistence blog post: http://databricks.com/blog/2016/05/31

33

Page 33: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

OutlineIntro to MLlib in 2.x

Migrating an ML workload to DataFrames

ML persistence

Roadmap during 2.x

34

Page 34: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Goals for MLlib in 2.xMajor initiatives• ML persistence: saving &

loading models & Pipelines• Complete feature parity for

DataFrames-based API. Missing items:• Frequent Pattern Mining• Certain methods in models• Developer APIs

For an overview of MLlib in 2.0, seehttp://spark-summit.org/2016/events/apache-spark-mllib-20-preview-data-science-and-production

35

Other important improvements• Generalized Linear Models• Python & R API parity• Speed & scalability improvements

Page 35: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Coming in 2.1Multiclass logistic regression (SPARK-7159)Locality sensitive hashing (SPARK-5992)More ML in SparkR (SPARK-16442)• ALS• Isotonic Regression• Multilayer Perceptron Classifier• Random Forest• Gaussian Mixture Model• LDA• Multiclass Logistic Regression• Gradient Boosted Trees

Various speed & scalability improvements• Random Forest, Naive Bayes, LDA, Gaussian Mixture, and others

Spark 2.1 status:Release candidates are under QA.

For release schedule, see http://spark.apache.org/versioning-policy.html

36

Page 36: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Get startedGet involved in the community• Events & news https://sparkhub.databricks.com/• User mailing list http://spark.apache.org/

community.html

Get involved in development• Dev mailing list

http://spark.apache.org/community.html• JIRA http://issues.apache.org/jira/browse/SPARK• Contribute http://spark.apache.org/contributing.html

Try out Apache Spark for free on Databricks Community Edition!http://databricks.com/try

Many thanks to the Apache Spark community!

37

Page 37: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

https://spark-summit.org/east-2017/

Page 38: Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames

Thank you!Twitter: @jkbatcmu