Best practices for productionizing Apache Spark MLlib models

Click here to load reader

  • date post

    02-Nov-2021
  • Category

    Documents

  • view

    1
  • download

    0

Embed Size (px)

Transcript of Best practices for productionizing Apache Spark MLlib models

About me
Joseph Bradley •Software engineer at Databricks •Apache Spark committer & PMC member •Ph.D. in Machine Learning from Carnegie Mellon
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT Unified Analytics Platform
Apache Spark Engine
Unified engine across diverse workloads & environments
Scale out, fault tolerant Python, Java, Scala, & R APIs Standard libraries
MLlib’s success
•Scalability & speed •Pipelines for featurization, modeling & tuning
MLlib’s success
(on Databricks alone) •Many production use cases
End goal: Data-driven applications
Prediction Servers
models results
10ms 100ms 1s 1 day 1 hour 1 min
Low-latency Real-5me
Streaming
Prediction Servers
models results
Prediction Servers
models results
Prediction Servers
models results
Do NOT bother try... 1
this shirt is aweso... 5
never got it. Seller... 1
I ordered this to... 3
ML Pipelines: featurization
Feature extraction
Original dataset
Predictive model
Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
ML Pipelines: model
15
Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
Feature extraction
Original dataset
Predictive model
Sharing models across teams and across systems & environments while maintaining identical behavior both now and in the future
model & pipeline persistence/export
dev, staging, prod
Architecture A: batch Pre-compute predictions using Spark and serve from database
Train ALS Model Send Email Offers to Customers
Save Offers to NoSQL
Mobile
E.g.: content recommendation
Architecture B: streaming Score in Spark Streaming + use an API with cached predictions
Web Ac5vity Logs
API Check
E.g.: monitoring web sessions
Predic5on Server or Applica5on
Architecture C: sub-second Train with Spark and score outside of Spark
Train Model in Spark
New Data
Solving challenges with Apache Spark MLlib
MLlib solutions by architecture
• ML Pipelines • Spark SQL &
A: Batch scoring in Spark
ML Pipelines cover most featurization and modeling • Save models and Pipelines via ML persistence: pipeline.save(path)
Simple to add custom logic • Spark SQL
– Save workflows via notebooks, JARs, and Jobs • Custom ML Pipeline components
– Save via ML persistence
B: Streaming scoring in Spark
As of Apache Spark 2.3, same as batch: •ML Pipelines cover most featurization + modeling •Simple to add custom logic • Spark SQL • Custom ML Pipeline components
But be aware of critical updates in Spark 2.3!
Scoring with Structured Streaming in Spark 2.3
Some existing Pipelines will need fixes: •OneHotEncoder à OneHotEncoderEstimator •VectorAssembler • Sometimes need VectorSizeHint
RFormula has been updated & works out of the box.
(demo of streaming)
The nitty gritty
Feature vector assembly (including in RFormula) • (Spark 2.2) Vector size sometimes inferred from data • (Spark 2.3) Add size hint to Pipeline when needed
C: Sub-second scoring For REST APIs and embedded applications Requirements: • Lightweight deployment (no Spark dependency) • Milliseconds for prediction (no SparkSession or Spark jobs) Several 3rd-party solutions exist: • Databricks Model Export • MLeap • PMML and PFA • H2O
Lessons from Databricks Model Export
Most engineering work is in testing • Identical behavior in MLlib and in exported models
– Including in complex Pipelines • Automated testing to catch changes in MLlib • Backwards compatibility tests
Backwards compatibility & stability guarantees are critical
• Added explicit guarantees to MLlib docs: https://spark.apache.org/docs/latest/ml-pipeline.html#backwards- compatibility-for-ml-persistence
• ML Pipelines • Spark SQL &
• 3rd-party solutions
Additional challenges outside the scope of this talk • Feature and model management • Monitoring • A/B testing
Resources Overview of productionizing Apache Spark ML models
Webinar with Richard Garris: http://go.databricks.com/apache-spark-mllib-2.x-how-to-productionize-your-machine-learning-models
Sub-second scoring
https://databricks.com/blog/2018/02/28 •Fixes for ML scoring in Structured Streaming (this talk) • ImageSchema and image utilities to enable Deep Learning use
cases on Spark •Python API improvements for developing custom algorithms •And much more! Available now in Databricks Runtime 4.0!
hWp://dbricks.co/2sK35XT