Spark and machine learning in microservices architecture
-
Upload
stepan-pushkarev -
Category
Software
-
view
1.255 -
download
1
Transcript of Spark and machine learning in microservices architecture
Spark and Machine Learning in Microservices Architecture
by Stepan Pushkarev CTO of Hydrosphere.io
Mission: Accelerate Machine Learning to Production
Opensource Products:- Mist: Spark Compute as a Service- ML Lambda: ML Function as a Service - Sonar: Data and ML Monitoring
Business Model: Subscription services and hands-on consulting
About
Spark/Kafka Users here?
Machine Learning nerds here?
Stateless Microservices are well studied
Data, reporting and machine learning architectures are different
● Raw SQL / HiveQL / SQL on Hadoop
● Datawarehouse / Data Lake centric
● Scripts driven ./bin/spark-submit
● Automated with Cron and/or Workflow Managers
● Hosted Notebooks culture
● Traditionally offline / for internal users
● File system aware (HDFS, S3)
● Defined by all-inclusive Hadoop distributions
- Data Pipelines on Microservices
- ML Functions as low latency prediction services
Agenda
Part 1: Data Pipeline Intuition
Need to transform Source Data into desired shape
Evolution: Scripts driven development
Python Script
SQL Script
Evolution: More Scripts
Python Script
SQL Script
Python Script
SQL Script
Python Script
SQL Script
Evolution: Move to Hive/Spark
Spark Script
Hive Script
Spark Script
Python Script
Hive Script
SQL Script
Evolution: DAG schedulers
Typical Python/SQL/Hive/Spark Task
1) Read data 2) Transform3) Save to temp table for the
next task
Problem: Unmanageable State in Shared folder
- Data Flow is not managed. DAG scheduling is a different
objection.
- Who is responsible for schema migration? Task 1, Task 2 or
Manager?
- What folder Task 1 should write to and Task 2 should read from?
- How to manage folders/resources between parallel sessions?
- When / how to clean up shared folder? Another cleanup pipeline?
- How to check that data batch is arrived and valid?
- How to unit test it?
- How to handle errors?
Statesafe Pipelines
1. Get rid of Workflow Manager!
2. Turn black box tasks and scripts into microservices.
3. Use Avro data contracts between stages. Data is also
an API to be standardized, versioned and validated.
4. Segregate black box tasks into (read) (process) and
(write) services.
5. Keep the state in shared folder/topic/session
manageable by framework rather than data engineers.
6. Abstract engineer from data transport and provide a
pure function to deal with.
Statesafe Data Pipeline
Zoom in: Data Transport between stages
Statesafe Stage
Zoom in: Spark Stage Details
Zoom in: Statesafe API for Business Logic
Zoom in: Statesafe API for Business Logic
On-demand batch pipelines
- Could not be
pre-calculated
- On-demand
parametrized jobs
- Interactive
- Require large
scale processing
- Reporting- Simulation (pricing,
bank stress testing, taxi rides)
- Forecasting (ad campaign, energy savings, others)
- Ad-hoc analytics tools for business users
Bad Practice: Database as API
Execute reporting job
Mark Job as complete & save result
Poll for new tasks
Poll for resultSet a flag and parameters to build a report
Batch prediction / reporting services
From Vanila Spark to Spark Compute as a Service
./bin/spark-submit
- Spark Sessions Pool
- REST API Framework
- Data API Framework
- Infrastructure
Integration (EMR,
Hortonworks, etc)
Spark Compute as a Service
DEMO
Part 2: Machine Learning
cluster
datamodel
data scientist
? web app
Machine Learning: training + serving (deployment)
pipeline
Training (Estimation) pipeline
trainpreprocess preprocess
pipeline
Prediction Pipeline
preprocess preprocess
val test = spark.createDataFrame(Seq( ("spark hadoop"), ("hadoop learning"))).toDF("text")
val model = PipelineModel.load("/tmp/spark-model")
model.transform(test).collect()
./bin/spark-submit …
https://issues.apache.org/jira/browse/SPARK-16365
https://issues.apache.org/jira/browse/SPARK-13944
cluster
datamodel
data scientist
web app
PMMLPFA
MLEAP
- Yet another Format Lock
- Code & state duplication
- Limited extensibility
- Inconsistency
- Extra moving parts
cluster
data
model
data scientist
web appA
PI
API
- Needs Spark Running
- High latency, low
throughput
cluster
datamodel
data scientist
web app
docker
API
libs
model
Spark ML Local Serving Library:https://github.com/Hydrospheredata/spark-ml-serving
Model Container
Zoo: Models - Runtimes - Standards
API & Logistics
- HTTP/1.1, HTTP/2, gRPC
- Kafka, Flink, Kinesis
- Protobuf, Avro
- Service Discovery
- Pipelining
- Tracing
- Monitoring
- Autoscaling
- Versioning
- A/B, Canary
- Testing
- CPU, GPU
Proposed Architecture
UX: Train anywhere and deploy as a Function
UX: Models and Applications
Applications provide public endpoints for the models
and compositions of the models.
UX: Streaming Applications + Batching
UX: Pipelines, Assembles and BestSLA Applications
Demo!!!
Thank you
Looking for
- Feedback
- Advisors, mentors &
partners
- Pilots and early adopters
Stay in touch
- @hydrospheredata
- https://github.com/Hydrospheredata
- https://hydrosphere.io/