Download - Best practices for productionizing Apache Spark MLlib models

Transcript
Page 1: Best practices for productionizing Apache Spark MLlib models

Best practices for productionizing Apache Spark MLlib models

Joseph Bradley March 7, 2018 Strata San Jose

Page 2: Best practices for productionizing Apache Spark MLlib models

About me

Joseph Bradley • Software engineer at Databricks • Apache Spark committer & PMC member • Ph.D. in Machine Learning from Carnegie Mellon

Page 3: Best practices for productionizing Apache Spark MLlib models

TEAM

About Databricks

Started Spark project (now Apache Spark) at UC Berkeley in 2009

PRODUCT Unified Analytics Platform

MISSION Making Big Data Simple

Try for free today. databricks.com

Page 4: Best practices for productionizing Apache Spark MLlib models

Apache Spark Engine

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant Python, Java, Scala, & R APIs Standard libraries

Page 5: Best practices for productionizing Apache Spark MLlib models

MLlib’s success

• Apache Spark integration simplifies •  Deployment •  ETL •  Integration into complete analytics pipelines with SQL & streaming

• Scalability & speed • Pipelines for featurization, modeling & tuning

Page 6: Best practices for productionizing Apache Spark MLlib models

MLlib’s success

•  1000s of commits •  100s of contributors •  10,000s of users

(on Databricks alone) • Many production use cases

Page 7: Best practices for productionizing Apache Spark MLlib models

End goal: Data-driven applications

Challenge:

Smooth deployment to production

Page 8: Best practices for productionizing Apache Spark MLlib models

Productionizing Machine Learning

Data Science / ML End Users

Prediction Servers

models results

Page 9: Best practices for productionizing Apache Spark MLlib models

Deployment Options for MLlib Latencyrequirement

10ms 100ms 1s 1 day 1 hour 1 min

Low-latencyReal-5me

Spark-less highly-available prediction server

Streaming

Spark Structured Streaming

Batch

Spark batch processing

Page 10: Best practices for productionizing Apache Spark MLlib models

Challenges of Productionizing

Data Science / ML End Users

Prediction Servers

models results

Serialize

Deserialize

Makepredic5ons

Page 11: Best practices for productionizing Apache Spark MLlib models

Challenge: working across teams

Data Science / ML End Users

Prediction Servers

models results

Serialize

Deserialize

Makepredic5ons

Page 12: Best practices for productionizing Apache Spark MLlib models

Challenge: featurization logic

Data Science / ML End Users

Prediction Servers

models results

Serialize

Deserialize

Makepredic5ons

FeatureLogic↓

FeatureLogic↓

FeatureLogic↓

Model

Page 13: Best practices for productionizing Apache Spark MLlib models

ML Pipelines Feature

extraction Original dataset

13

Predictive model

Text Label I bought the game... 4

Do NOT bother try... 1

this shirt is aweso... 5

never got it. Seller... 1

I ordered this to... 3

Page 14: Best practices for productionizing Apache Spark MLlib models

ML Pipelines: featurization

14

Feature extraction

Original dataset

Predictive model

Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]

Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]

this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]

never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]

I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]

Page 15: Best practices for productionizing Apache Spark MLlib models

ML Pipelines: model

15

Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8

Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6

this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9

never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7

I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7

Feature extraction

Original dataset

Predictive model

Page 16: Best practices for productionizing Apache Spark MLlib models

Challenge: various environments

Data Science / ML

Prediction Servers

modelsresults

Makepredic5ons

Makepredic5ons

results

Makepredic5ons

results

Page 17: Best practices for productionizing Apache Spark MLlib models

Summary of challenges

Sharing models across teams and across systems & environments while maintaining identical behavior both now and in the future

model & pipeline persistence/export

dev, staging, prod

including featurization

versioning & compatibility

Page 18: Best practices for productionizing Apache Spark MLlib models

Production architectures for ML

Page 19: Best practices for productionizing Apache Spark MLlib models

Architecture A: batch Pre-compute predictions using Spark and serve from database

TrainALSModel SendEmailOfferstoCustomers

SaveOfferstoNoSQL

RankedOffers

DisplayRankedOffersinWeb/

Mobile

RecurringBatch

E.g.: content recommendation

Page 20: Best practices for productionizing Apache Spark MLlib models

Architecture B: streaming Score in Spark Streaming + use an API with cached predictions

WebAc5vityLogs

KillUser’sLoginSessionComputeFeatures RunPredic5on

APICheck

Streaming

CachedPredic5ons

E.g.: monitoring web sessions

Page 21: Best practices for productionizing Apache Spark MLlib models

Predic5onServerorApplica5on

Architecture C: sub-second Train with Spark and score outside of Spark

TrainModelinSpark

SaveModeltoS3/HDFS

NewData

CopyModeltoProduc5on

Predic5ons

E.g.: card swipe fraud detection

Page 22: Best practices for productionizing Apache Spark MLlib models

Solving challenges with Apache Spark MLlib

Page 23: Best practices for productionizing Apache Spark MLlib models

MLlib solutions by architecture

A: Batch scoring B: Streaming C: Sub-second

•  ML Pipelines •  Spark SQL &

custom logic

•  ML Pipelines (Spark 2.3)

•  Spark SQL & custom logic

•  3rd-party solutions

Page 24: Best practices for productionizing Apache Spark MLlib models

A: Batch scoring in Spark

ML Pipelines cover most featurization and modeling •  Save models and Pipelines via ML persistence: pipeline.save(path)

Simple to add custom logic •  Spark SQL

–  Save workflows via notebooks, JARs, and Jobs •  Custom ML Pipeline components

–  Save via ML persistence

Page 25: Best practices for productionizing Apache Spark MLlib models

B: Streaming scoring in Spark

As of Apache Spark 2.3, same as batch: • ML Pipelines cover most featurization + modeling • Simple to add custom logic •  Spark SQL •  Custom ML Pipeline components

But be aware of critical updates in Spark 2.3!

Page 26: Best practices for productionizing Apache Spark MLlib models

Scoring with Structured Streaming in Spark 2.3

Some existing Pipelines will need fixes: • OneHotEncoder à OneHotEncoderEstimator • VectorAssembler •  Sometimes need VectorSizeHint

RFormula has been updated & works out of the box.

Page 27: Best practices for productionizing Apache Spark MLlib models

(demo of streaming)

Page 28: Best practices for productionizing Apache Spark MLlib models

The nitty gritty

One-hot encoding •  (Spark 2.2) Transformer: Stateless transform of DataFrame. •  (Spark 2.3) Estimator: Record categories during fitting. Use same

categories during scoring. •  Important fix for both batch & streaming!

Feature vector assembly (including in RFormula) •  (Spark 2.2) Vector size sometimes inferred from data •  (Spark 2.3) Add size hint to Pipeline when needed

Page 29: Best practices for productionizing Apache Spark MLlib models

C: Sub-second scoring For REST APIs and embedded applications Requirements: •  Lightweight deployment (no Spark dependency) •  Milliseconds for prediction (no SparkSession or Spark jobs) Several 3rd-party solutions exist: •  Databricks Model Export •  MLeap •  PMML and PFA •  H2O

Page 30: Best practices for productionizing Apache Spark MLlib models

Lessons from Databricks Model Export

Most engineering work is in testing •  Identical behavior in MLlib and in exported models

–  Including in complex Pipelines •  Automated testing to catch changes in MLlib •  Backwards compatibility tests

Backwards compatibility & stability guarantees are critical

•  Added explicit guarantees to MLlib docs: https://spark.apache.org/docs/latest/ml-pipeline.html#backwards-compatibility-for-ml-persistence

Page 31: Best practices for productionizing Apache Spark MLlib models

Summary A: Batch scoring B: Streaming C: Sub-second

•  ML Pipelines •  Spark SQL &

custom logic

•  ML Pipelines (Spark 2.3)

•  Spark SQL & custom logic

•  3rd-party solutions

Additional challenges outside the scope of this talk •  Feature and model management •  Monitoring •  A/B testing

Page 32: Best practices for productionizing Apache Spark MLlib models

Resources Overview of productionizing Apache Spark ML models

Webinar with Richard Garris: http://go.databricks.com/apache-spark-mllib-2.x-how-to-productionize-your-machine-learning-models

Batch scoring

Apache Spark docs: https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and-loading-pipelines

Streaming scoring

Guide and example notebook: https://tinyurl.com/y7bk5plu

Sub-second scoring

Webinar with Sue Ann Hong: https://www.brighttalk.com/webcast/12891/268455/productionizing-apache-spark-mllib-models-for-real-time-prediction-serving

Page 33: Best practices for productionizing Apache Spark MLlib models

Aside: new in Apache Spark 2.3

https://databricks.com/blog/2018/02/28 • Fixes for ML scoring in Structured Streaming (this talk) •  ImageSchema and image utilities to enable Deep Learning use

cases on Spark • Python API improvements for developing custom algorithms • And much more! Available now in Databricks Runtime 4.0!

Page 34: Best practices for productionizing Apache Spark MLlib models

hWp://dbricks.co/2sK35XT

hWp://shop.oreilly.com/product/0636920034957.do

Blog post

Available from O’Reilly

Page 35: Best practices for productionizing Apache Spark MLlib models

https://databricks.com/careers

Page 36: Best practices for productionizing Apache Spark MLlib models

Thank You! Questions?