Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine...

35
Cirrus: A Serverless Framework for End-to-end ML Workflows Joao Carreira, Pedro Fonseca, Alexey Tumanov, Andrew Zhang, Randy Katz

Transcript of Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine...

Page 1: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Cirrus:A Serverless Framework for End-to-end ML WorkflowsJoao Carreira, Pedro Fonseca, Alexey Tumanov,

Andrew Zhang, Randy Katz

Page 2: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Machine Learning

Page 3: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

End-to-end ML workflows● Modern end-to-end ML workflows are complex

Page 4: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

End-to-end ML workflows● Modern end-to-end ML workflows are complex

● ML workflows consist of 3 heterogeneous stages

Page 5: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

End-to-end ML workflows● Modern end-to-end ML workflows are complex

● ML workflows consist of 3 heterogeneous stages

Dataset Preprocessing

Page 6: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

End-to-end ML workflows● Modern end-to-end ML workflows are complex

● ML workflows consist of 3 heterogeneous stages

Dataset Preprocessing

Model Training

Page 7: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

End-to-end ML workflows● Modern end-to-end ML workflows are complex

● ML workflows consist of 3 heterogeneous stages

Dataset Preprocessing

Model Training Hyperparameter Tuning

Page 8: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

End-to-end ML workflows● Modern end-to-end ML workflows are complex

● ML workflows consist of 3 heterogeneous stages

ML workflows are interactive and iterative

Dataset Preprocessing

Model Training Hyperparameter Tuning

Page 9: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Provisioning ML workflowsProvisioning ML workflows is challenging

● Complex infrastructure management detracts from ML work● Resource waste due to overprovisioning of resources

Hard to accurately estimate resource demands of each stage

Data scientists have limited systems expertise

Page 10: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Serverless computingOutput

Input

Code

AWS S3

Page 11: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Serverless computingOutput

Input

Code

Page 12: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Fine-grainedresources

Fine-grained billing

High elasticity

Automatic resource configuration / provisioning

/ maintenance

Serverless computing benefitsTight provisioning of

resourcesSimplifying infrastructure

management

Page 13: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Challenges of serverless

Small local memoryand storage

Short-lived andunpredictable launch times

Low bandwidth andno P2P communication

Lack of fastshared storage

Limited lambda package size

Page 14: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Existing approachesServerless Frameworks Machine Learning Frameworks

Short-lived and unpredictablelaunch times

Page 15: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Existing approachesServerless Frameworks Machine Learning Frameworks

PyWren

Short-lived and unpredictablelaunch times

Limit. Pkgsize

Download dependencies from S3

High-latency communication through S3No fast

storage

StragglersUnpred.launch

Page 16: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Existing approachesServerless Frameworks Machine Learning Frameworks

PyWren

Short-lived and unpredictablelaunch times

Limit. Pkgsize

Download dependencies from S3

High-latency communication through S3No fast

storage

StragglersUnpred.launch

Smallmem.

Unable to launch runtimes in lambdas

No ring/tree reducesNo driver-to-worker comm.

Precludes MPIUnpred.launch

No P2Pcomm.

Page 17: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Cirrus: a framework for serverless end-to-end

ML workflows

Page 18: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Robust handling of lambda termination

Ultra-lightweight runtime + data prefetching

Limited pkgsize

High-perf. data store (parameter-server and KV)

1)Addressing serverless challenges

No fast storage

Low memory

Limited package size

No P2P communication

Short lifetimes andunpredictable launch

Cirrus: design principles

Page 19: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Per-stage fine-grained variable agile scalability

Cirrus: design principles

Limited pkgsize

Tight provisioning of resources

Simplifying infrastructuremanagement

High-level API supports end-to-end ML

2)Achieving benefits for end-to-end ML

Page 20: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Cirrus architecture (client side)Dashboard

Python API

Client frontend

Preproc. Training Tuning

Create/Stop Task

Client backendTask

SchedulerLambdaManager

Client side(stateful)

Data scientist

Page 21: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Cirrus Dashboard

Page 22: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Cirrus Dashboard

Page 23: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Cirrus Dashboard

Page 24: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Cirrus Dashboard

Page 25: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Server side(stateless)

Cirrus runtimeData Iterator API

Minibatch Buffer

Sparse LR Mat. Fact. LDA

Data store client API

put(gradient)

get(model)Data store

PS API Key-value API

ModelsKey-values

SGD Adagrad

Momentum

Cirrus architecture (server side)

put/get key

Page 26: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Cirrus evaluation1. Cirrus provides benefits by specializing both for serverless and

end-to-end ML

2. We show that Cirrus outperforms a state-of-the-art serverless system: PyWren

Page 27: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Evaluation setup1. Deployment: AWS Lambdas (3GB of mem.)

2. Benchmark: async. distributed SGD Sparse Logistic Regression task

3. Dataset: Criteo Dataset (a dataset of display ads)

4. PyWren:

a. Baseline: iterative synchronous SGD training using AWS S3 to

store gradients and model

b. + 3 incremental optimizations

5. Cirrus: 2 modes (with/without prefetching)

Page 28: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Cirrus outperforms vanilla serverlessSynchronous SGD training suffers from stragglers

TestLoss

Page 29: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

● Multiple SGD iterations on each lambda invocation

● Asynchronous SGD

TestLoss

Cirrus outperforms vanilla serverless

Page 30: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Sparse gradients and training data prefetchingTest

Loss

Cirrus outperforms vanilla serverless

Page 31: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Replace AWS S3 with high-performance store (Redis)

TestLoss +700x updates/sec

Cirrus outperforms vanilla serverless

Page 32: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Cirrus without training data prefetching

TestLoss

10x faster

Cirrus outperforms vanilla serverless

Page 33: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Cirrus with training data prefetching

TestLoss

10x faster

10x faster

Cirrus outperforms vanilla serverless

Page 34: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Conclusion1. End-to-end ML workflows:

a. time-consuming infrastructure managementb. resource overprovisioning

2. Cirrus -- serverless end-to-end ML framework:a. simplify deployment of ML workflowsb. per-stage provisioning of resources

3. Cirrus outperforms existing serverless solutions by specializing for serverless and ML

Page 35: Cirrus: A Serverless Framework for Andrew Zhang, Randy ... · Serverless Frameworks Machine Learning Frameworks PyWren Short-lived and unpredictable launch times Limit. Pkg size Download

Thank you!

github.com/ucbrise/cirrus @jccarreira