Spark and machine learning in microservices architecture

46
Spark and Machine Learning in Microservices Architecture by Stepan Pushkarev CTO of Hydrosphere.io

Transcript of Spark and machine learning in microservices architecture

Page 1: Spark and machine learning in microservices architecture

Spark and Machine Learning in Microservices Architecture

by Stepan Pushkarev CTO of Hydrosphere.io

Page 2: Spark and machine learning in microservices architecture

Mission: Accelerate Machine Learning to Production

Opensource Products:- Mist: Spark Compute as a Service- ML Lambda: ML Function as a Service - Sonar: Data and ML Monitoring

Business Model: Subscription services and hands-on consulting

About

Page 3: Spark and machine learning in microservices architecture

Spark/Kafka Users here?

Machine Learning nerds here?

Page 4: Spark and machine learning in microservices architecture

Stateless Microservices are well studied

Page 5: Spark and machine learning in microservices architecture

Data, reporting and machine learning architectures are different

● Raw SQL / HiveQL / SQL on Hadoop

● Datawarehouse / Data Lake centric

● Scripts driven ./bin/spark-submit

● Automated with Cron and/or Workflow Managers

● Hosted Notebooks culture

● Traditionally offline / for internal users

● File system aware (HDFS, S3)

● Defined by all-inclusive Hadoop distributions

Page 6: Spark and machine learning in microservices architecture

- Data Pipelines on Microservices

- ML Functions as low latency prediction services

Agenda

Page 7: Spark and machine learning in microservices architecture

Part 1: Data Pipeline Intuition

Need to transform Source Data into desired shape

Page 8: Spark and machine learning in microservices architecture

Evolution: Scripts driven development

Python Script

SQL Script

Page 9: Spark and machine learning in microservices architecture

Evolution: More Scripts

Python Script

SQL Script

Python Script

SQL Script

Python Script

SQL Script

Page 10: Spark and machine learning in microservices architecture

Evolution: Move to Hive/Spark

Spark Script

Hive Script

Spark Script

Python Script

Hive Script

SQL Script

Page 11: Spark and machine learning in microservices architecture

Evolution: DAG schedulers

Page 12: Spark and machine learning in microservices architecture

Typical Python/SQL/Hive/Spark Task

1) Read data 2) Transform3) Save to temp table for the

next task

Page 13: Spark and machine learning in microservices architecture

Problem: Unmanageable State in Shared folder

- Data Flow is not managed. DAG scheduling is a different

objection.

- Who is responsible for schema migration? Task 1, Task 2 or

Manager?

- What folder Task 1 should write to and Task 2 should read from?

- How to manage folders/resources between parallel sessions?

- When / how to clean up shared folder? Another cleanup pipeline?

- How to check that data batch is arrived and valid?

- How to unit test it?

- How to handle errors?

Page 14: Spark and machine learning in microservices architecture

Statesafe Pipelines

1. Get rid of Workflow Manager!

2. Turn black box tasks and scripts into microservices.

3. Use Avro data contracts between stages. Data is also

an API to be standardized, versioned and validated.

4. Segregate black box tasks into (read) (process) and

(write) services.

5. Keep the state in shared folder/topic/session

manageable by framework rather than data engineers.

6. Abstract engineer from data transport and provide a

pure function to deal with.

Page 15: Spark and machine learning in microservices architecture

Statesafe Data Pipeline

Page 16: Spark and machine learning in microservices architecture

Zoom in: Data Transport between stages

Page 17: Spark and machine learning in microservices architecture

Statesafe Stage

Page 18: Spark and machine learning in microservices architecture

Zoom in: Spark Stage Details

Page 19: Spark and machine learning in microservices architecture

Zoom in: Statesafe API for Business Logic

Page 20: Spark and machine learning in microservices architecture

Zoom in: Statesafe API for Business Logic

Page 21: Spark and machine learning in microservices architecture

On-demand batch pipelines

- Could not be

pre-calculated

- On-demand

parametrized jobs

- Interactive

- Require large

scale processing

- Reporting- Simulation (pricing,

bank stress testing, taxi rides)

- Forecasting (ad campaign, energy savings, others)

- Ad-hoc analytics tools for business users

Page 22: Spark and machine learning in microservices architecture

Bad Practice: Database as API

Execute reporting job

Mark Job as complete & save result

Poll for new tasks

Poll for resultSet a flag and parameters to build a report

Page 23: Spark and machine learning in microservices architecture

Batch prediction / reporting services

Page 24: Spark and machine learning in microservices architecture

From Vanila Spark to Spark Compute as a Service

./bin/spark-submit

- Spark Sessions Pool

- REST API Framework

- Data API Framework

- Infrastructure

Integration (EMR,

Hortonworks, etc)

Page 25: Spark and machine learning in microservices architecture

Spark Compute as a Service

DEMO

Page 26: Spark and machine learning in microservices architecture

Part 2: Machine Learning

Page 27: Spark and machine learning in microservices architecture

cluster

datamodel

data scientist

? web app

Page 28: Spark and machine learning in microservices architecture

Machine Learning: training + serving (deployment)

Page 29: Spark and machine learning in microservices architecture

pipeline

Training (Estimation) pipeline

trainpreprocess preprocess

Page 30: Spark and machine learning in microservices architecture

pipeline

Prediction Pipeline

preprocess preprocess

Page 31: Spark and machine learning in microservices architecture

val test = spark.createDataFrame(Seq( ("spark hadoop"), ("hadoop learning"))).toDF("text")

val model = PipelineModel.load("/tmp/spark-model")

model.transform(test).collect()

Page 32: Spark and machine learning in microservices architecture

./bin/spark-submit …

Page 33: Spark and machine learning in microservices architecture

https://issues.apache.org/jira/browse/SPARK-16365

https://issues.apache.org/jira/browse/SPARK-13944

Page 34: Spark and machine learning in microservices architecture

cluster

datamodel

data scientist

web app

PMMLPFA

MLEAP

- Yet another Format Lock

- Code & state duplication

- Limited extensibility

- Inconsistency

- Extra moving parts

Page 35: Spark and machine learning in microservices architecture

cluster

data

model

data scientist

web appA

PI

API

- Needs Spark Running

- High latency, low

throughput

Page 36: Spark and machine learning in microservices architecture

cluster

datamodel

data scientist

web app

docker

API

libs

model

Spark ML Local Serving Library:https://github.com/Hydrospheredata/spark-ml-serving

Page 37: Spark and machine learning in microservices architecture

Model Container

Page 38: Spark and machine learning in microservices architecture

Zoo: Models - Runtimes - Standards

Page 39: Spark and machine learning in microservices architecture

API & Logistics

- HTTP/1.1, HTTP/2, gRPC

- Kafka, Flink, Kinesis

- Protobuf, Avro

- Service Discovery

- Pipelining

- Tracing

- Monitoring

- Autoscaling

- Versioning

- A/B, Canary

- Testing

- CPU, GPU

Page 40: Spark and machine learning in microservices architecture

Proposed Architecture

Page 41: Spark and machine learning in microservices architecture

UX: Train anywhere and deploy as a Function

Page 42: Spark and machine learning in microservices architecture

UX: Models and Applications

Applications provide public endpoints for the models

and compositions of the models.

Page 43: Spark and machine learning in microservices architecture

UX: Streaming Applications + Batching

Page 44: Spark and machine learning in microservices architecture

UX: Pipelines, Assembles and BestSLA Applications

Page 45: Spark and machine learning in microservices architecture

Demo!!!

Page 46: Spark and machine learning in microservices architecture

Thank you

Looking for

- Feedback

- Advisors, mentors &

partners

- Pilots and early adopters

Stay in touch

- @hydrospheredata

- https://github.com/Hydrospheredata

- https://hydrosphere.io/

- [email protected]