Spark and machine learning in microservices architecture

Post on 21-Jan-2018

1.256 views 1 download

Transcript of Spark and machine learning in microservices architecture

Spark and Machine Learning in Microservices Architecture

by Stepan Pushkarev CTO of Hydrosphere.io

Mission: Accelerate Machine Learning to Production

Opensource Products:- Mist: Spark Compute as a Service- ML Lambda: ML Function as a Service - Sonar: Data and ML Monitoring

Business Model: Subscription services and hands-on consulting

About

Spark/Kafka Users here?

Machine Learning nerds here?

Stateless Microservices are well studied

Data, reporting and machine learning architectures are different

● Raw SQL / HiveQL / SQL on Hadoop

● Datawarehouse / Data Lake centric

● Scripts driven ./bin/spark-submit

● Automated with Cron and/or Workflow Managers

● Hosted Notebooks culture

● Traditionally offline / for internal users

● File system aware (HDFS, S3)

● Defined by all-inclusive Hadoop distributions

- Data Pipelines on Microservices

- ML Functions as low latency prediction services

Agenda

Part 1: Data Pipeline Intuition

Need to transform Source Data into desired shape

Evolution: Scripts driven development

Python Script

SQL Script

Evolution: More Scripts

Python Script

SQL Script

Python Script

SQL Script

Python Script

SQL Script

Evolution: Move to Hive/Spark

Spark Script

Hive Script

Spark Script

Python Script

Hive Script

SQL Script

Evolution: DAG schedulers

Typical Python/SQL/Hive/Spark Task

1) Read data 2) Transform3) Save to temp table for the

next task

Problem: Unmanageable State in Shared folder

- Data Flow is not managed. DAG scheduling is a different

objection.

- Who is responsible for schema migration? Task 1, Task 2 or

Manager?

- What folder Task 1 should write to and Task 2 should read from?

- How to manage folders/resources between parallel sessions?

- When / how to clean up shared folder? Another cleanup pipeline?

- How to check that data batch is arrived and valid?

- How to unit test it?

- How to handle errors?

Statesafe Pipelines

1. Get rid of Workflow Manager!

2. Turn black box tasks and scripts into microservices.

3. Use Avro data contracts between stages. Data is also

an API to be standardized, versioned and validated.

4. Segregate black box tasks into (read) (process) and

(write) services.

5. Keep the state in shared folder/topic/session

manageable by framework rather than data engineers.

6. Abstract engineer from data transport and provide a

pure function to deal with.

Statesafe Data Pipeline

Zoom in: Data Transport between stages

Statesafe Stage

Zoom in: Spark Stage Details

Zoom in: Statesafe API for Business Logic

Zoom in: Statesafe API for Business Logic

On-demand batch pipelines

- Could not be

pre-calculated

- On-demand

parametrized jobs

- Interactive

- Require large

scale processing

- Reporting- Simulation (pricing,

bank stress testing, taxi rides)

- Forecasting (ad campaign, energy savings, others)

- Ad-hoc analytics tools for business users

Bad Practice: Database as API

Execute reporting job

Mark Job as complete & save result

Poll for new tasks

Poll for resultSet a flag and parameters to build a report

Batch prediction / reporting services

From Vanila Spark to Spark Compute as a Service

./bin/spark-submit

- Spark Sessions Pool

- REST API Framework

- Data API Framework

- Infrastructure

Integration (EMR,

Hortonworks, etc)

Spark Compute as a Service

DEMO

Part 2: Machine Learning

cluster

datamodel

data scientist

? web app

Machine Learning: training + serving (deployment)

pipeline

Training (Estimation) pipeline

trainpreprocess preprocess

pipeline

Prediction Pipeline

preprocess preprocess

val test = spark.createDataFrame(Seq( ("spark hadoop"), ("hadoop learning"))).toDF("text")

val model = PipelineModel.load("/tmp/spark-model")

model.transform(test).collect()

./bin/spark-submit …

https://issues.apache.org/jira/browse/SPARK-16365

https://issues.apache.org/jira/browse/SPARK-13944

cluster

datamodel

data scientist

web app

PMMLPFA

MLEAP

- Yet another Format Lock

- Code & state duplication

- Limited extensibility

- Inconsistency

- Extra moving parts

cluster

data

model

data scientist

web appA

PI

API

- Needs Spark Running

- High latency, low

throughput

cluster

datamodel

data scientist

web app

docker

API

libs

model

Spark ML Local Serving Library:https://github.com/Hydrospheredata/spark-ml-serving

Model Container

Zoo: Models - Runtimes - Standards

API & Logistics

- HTTP/1.1, HTTP/2, gRPC

- Kafka, Flink, Kinesis

- Protobuf, Avro

- Service Discovery

- Pipelining

- Tracing

- Monitoring

- Autoscaling

- Versioning

- A/B, Canary

- Testing

- CPU, GPU

Proposed Architecture

UX: Train anywhere and deploy as a Function

UX: Models and Applications

Applications provide public endpoints for the models

and compositions of the models.

UX: Streaming Applications + Batching

UX: Pipelines, Assembles and BestSLA Applications

Demo!!!

Thank you

Looking for

- Feedback

- Advisors, mentors &

partners

- Pilots and early adopters

Stay in touch

- @hydrospheredata

- https://github.com/Hydrospheredata

- https://hydrosphere.io/

- spushkarev@hydrosphere.io