Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems...

54
Joseph E. Gonzalez Asst. Professor, UC Berkeley [email protected] Co-founder, GraphLab (now Turi Inc.) [email protected] what happens after learning? Prediction Serving

Transcript of Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems...

Page 1: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Joseph E. Gonzalez

Asst. Professor, UC Berkeley

[email protected]

Co-founder, GraphLab (now Turi Inc.)

[email protected]

what happens after learning?

Prediction Serving

Page 2: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Prediction Serving

Learning Systems

Graph SystemsGraph

Frames

Time SeriesFrequency DomainAnalytics Systems

Cluster ManagementMulti Task Learningfor Job Scheduling

Cross-CloudPerf. Estimation

Page 3: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Outline

DanielCrankshaw

XinWang

MichaelFranklin

IonStoica

ActiveCollaborators

Page 4: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

Data

Big Model

Training

Learning

Timescale: minutes to days

Systems: offline and batch optimized

Heavily studied ... major focus of the AMPLab

Page 5: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

Data

Big Model

Training

Application

Decision

Query

?

Learning Inference

Page 6: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

Data

Trainin

g

LearningInference

Big ModelApplication

Decision

Query

Timescale: ~10 milliseconds

Systems: online and latency optimized

Less studied …

Page 7: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

Data

Big Model

Training

Application

Decision

Query

Learning Inference

Feedback

Page 8: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

Data

Training

Application

Decision

Learning Inference

Feedback

Timescale: hours to weeks

Systems: combination of systems

Less studied …

Page 9: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

Data

Big Model

Training

Application

Decision

Query

Learning Inference

Feedback

Responsive

(~10ms)Adaptive

(~1 seconds)

Page 10: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Responsive

(~10ms)Adaptive

(~1 seconds)

VELOX Model Serving System [CIDR’15]Daniel Crankshaw, Peter Bailis, Haoyuan Li, Zhao Zhang, Joseph Gonzalez, Michael J. Franklin, Ali Ghodsi, and Michael I. Jordan

Key Insight:

Decompose models into fast and slow changing components

Page 11: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

DataTraining

Application

Decision

Query

Learning Inference

Feedback

Page 12: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

DataTraining

Application

Decision

Query

Learning Inference

FeedbackSlow

Slow Changing

Model

Fast Changing

Model

Page 13: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Hybrid Offline + Online Learning

Update the user weights online:• Simple to train + more robust model• Address rapidly changing user statistics

Update feature functions offline using batch solvers• Leverage high-throughput systems (Tensor Flow)• Exploit slow change in population statistics

Page 14: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Common modeling structure

Items

Use

rs

Matrix

Factorization

Input

Deep

Learning

Ensemble

Methods

Page 15: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Velox Online Learning for Recommendations(Simulated News Rec.)

0

0.1

0.2

0.3

0.4

0.5

0.6

0 10 20 30

Err

or

Examples

Partial Updates: 0.4 ms

Retraining: 7.1 seconds

>4 orders-of-magnitude

faster adaptation

Page 16: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

DataTraining

Application

Decision

Query

Learning Inference

FeedbackSlow

Slow Changing

Model

Fast Changing

Model per user

Page 17: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

Data Training

Application

Decision

Query

Learning Inference

FeedbackSlow

Slow Changing

Model

Fast Changing

Model per user

Velox

Page 18: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

B

D

ASTachyon

Mesos

Spark

HDFS, S3, …

Spark

StreamingSpark

SQL

BlinkDB

GraphX

Graph

Frames

MLLib

Keystone

ML

Learning

erke ley

ata

na ly t i cs

tack

VELOX: the Missing Piece of BDAS

Page 19: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

B

D

AS

erke ley

ata

na ly t i cs

tackTachyon

Mesos

Spark

HDFS, S3, …

Spark

StreamingSpark

SQL

BlinkDB

GraphX

Graph

Frames

MLLib

Keystone

ML

LearningManagement

and Serving

VELOX: the Missing Piece of BDAS

Velox

Page 20: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

B

D

AS

erke ley

ata

na ly t i cs

tackMesos

HDFS, S3, …

Spark

StreamingSpark

SQL

BlinkDB

GraphX

Graph

Frames

LearningManagement

and Serving

VELOX: the Missing Piece of BDAS

Velox

Tachyon

Spark

MLLib

Keystone

ML

Page 21: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

VELOX Architecture

Spark

MLLib

Single JVM Instance

Velox

Keystone ML

Content

Rec.

Fraud

Detection

Page 22: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

VELOX Architecture

Spark

MLLib

Single JVM Instance

Velox

Keystone ML

Content

Rec.

Fraud

Detection

Personal

Asst.

Robotic

Control

Machine

Translation

Create

VW

Caffe

Page 23: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

VELOX as a Middle Layer Arch?

SparkMLLib

Velox

Keystone ML

Content

Rec.

Fraud

Detection

Personal

Asst.

Robotic

Control

Machine

Translation

Create VWCaffe

Generalize ?

Page 24: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Clipper Generalizes Velox Across ML Frameworks

Clipper

Content

Rec.

Fraud

Detection

Personal

Asst.

Robotic

Control

Machine

Translation

Create VWCaffe

Page 25: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Clipper

Create VWCaffeKey Insight:

The challenges of prediction serving can be addressed between

end-user applications and machine learning frameworks

As a result, Clipper is able to:

hide complexity by providing a common prediction interface

bound latency and maximize throughput through approximate caching and adaptive batching

enable robust online learning and personalization through generalized split-model correction policies

without modifying machine learning frameworks or end-user applications

Page 26: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Clipper Design Goals

Low and bounded latency predictions interactive applications need reliable latency objectives

Up-to-date and personalized predictions across models and frameworks

generalize the split model decomposition

Optimize throughput for performance under heavy load single query can trigger many predictions

Simplify deployment serve models using the original code and systems

Page 27: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Clipper Architecture

Clipper

Content

Rec.

Fraud

Detection

Personal

Asst.

Robotic

Control

Machine

Translation

VWCaffe

Create

Page 28: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Clipper Architecture

Clipper

Applications

Predict ObserveRPC/REST Interface

VWCaffe

Create

Page 29: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Clipper Architecture

Clipper

Caffe

Applications

ust

Predict ObserveRPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

Page 30: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Clipper Architecture

Clipper

Caffe

Applications

Predict ObserveRPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

Model Abstraction LayerProvide a common interface to modelswhile bounding latency and maximizing throughput.

Correction LayerImprove accuracy through ensembles,online learning and personalization

Page 31: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Clipper Architecture

Clipper

Caffe

Applications

Predict ObserveRPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

Correction LayerCorrection Policy

Model Abstraction LayerApproximate Caching

Adaptive Batching

Page 32: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Caffe

Correction LayerCorrection Policy

Provides a unified generic prediction API across frameworks

Reduce Latency Approximate Caching

Increase Throughput Adaptive Batching

Simplify Deployment RPC + Model Wrapper

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

Model Abstraction LayerApproximate Caching

Adaptive Batching

Approximate Caching

Adaptive Batching

Model Wrapper (MW) MW MW MW

Page 33: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Caffe

Correction LayerCorrection Policy

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

Model Abstraction LayerApproximate Caching

Adaptive Batching

Provide a common interface to models while bounding

Page 34: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Correction LayerCorrection Policy

Model Wrapper (MW)

RPC

Caffe

MW

RPC

MW

RPC

MW

RPC

Model Abstraction LayerApproximate Caching

Adaptive Batching

Common Interface Simplifies Deployment:

Evaluate models using original code & systems

Models run in separate processes Resource isolation

Page 35: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Correction LayerCorrection Policy

Model Abstraction LayerApproximate Caching

Adaptive Batching

Model Wrapper (MW)

RPC

Caffe

MW

RPC

MW

RPC

MW

RPC

MW

RPC

MW

RPC

Common Interface Simplifies Deployment:

Evaluate models using original code & systems

Models run in separate processes Resource isolation

Scale-out

Problem: frameworks optimized for batch processing not latency

Page 36: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

A single page load may generatemany queries

Adaptive Batching to Improve Throughput

Optimal batch depends on: hardware configuration

model and framework

system load

Clipper Solution:

be as slow as allowed…

Application specifies latency objective

Clipper uses TCP-like tuning algorithm to increase latency up to the objective

Why batching helps:

HardwareAcceleration

Helps amortizesystem overhead

Page 37: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Thro

ug

hp

ut

(Querie

s P

er S

econd)

Late

ncy (

ms)

Batch Sizes (Queries)

Tensor Flow Conv. Net (GPU)

Latency

Deadline

Optimal Batch Size

Page 38: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Comparison to TensorFlow Serving

Takeaway: Clipper is able to match the average latency of

TensorFlow Serving while reducing tail latency (2x) and

improving throughput (2x)

Page 39: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Approximate Caching to Reduce Latency

Clipper Solution: Approximate Caching

apply locality sensitive hash functions

Opportunity for caching

Need for approximation

Popular items may be evaluatedfrequently

High Dimensional and continuous valued queries have low cache hit rate.

Bag-of-Words

ModelImages

?

?

Cache Hit

Cache Miss

?Cache Hit

Error

Page 40: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Clipper Architecture

Clipper

Caffe

Applications

Predict ObserveRPC/REST Interface

Model Wrapper (MW) MW MW MW

RPC RPC RPC RPC

Correction LayerCorrection Policy

Model Abstraction LayerApproximate Caching

Adaptive Batching

Page 41: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Goal:

Maximize accuracy through ensembles, online learning, and personalization

Generalize the split-model insight from Velox to achieve:

robust predictions by combining multiple models & frameworks

online learning and personalization by correcting and personalizing predictions in response to feedback

Clipper

Correction LayerCorrection Policy

Page 42: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

Data

Application

Learning Inference

FeedbackSlow

Slow Changing

Model

Fast Changing

User Model

Velox

Page 43: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Ca

ffe

Big

Data

Application

Learning Inference

FeedbackSlow

Slow Changing

Model

Fast Changing

User Model

Clipper

Page 44: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Ca

ffe

Slow Changing

Model

Fast Changing

User Model

Clipper

Correction Policy

Improves prediction accuaray by:

Incorporating real-time feedback

Managing personalization

Combine models & frameworks enables frameworks to compete

Page 45: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Improved Prediction Accuracy (ImageNet)

System Model Error Rate #Errors

Caffe VGG 13.05% 6525

Caffe LeNet 11.52% 5760

Caffe ResNet 9.02% 4512

TensorFlow Inception v3 6.18% 3088

sequence of pre-trained state-of-the-art models

Page 46: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Improved Prediction Accuracy

System Model Error Rate #Errors

Caffe VGG 13.05% 6525

Caffe LeNet 11.52% 5760

Caffe ResNet 9.02% 4512

TensorFlow Inception v3 6.18% 3088

Clipper Ensemble 5.86% 2930

5.2% relative improvement

in prediction accuracy!

Page 47: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Increased Load

Solutions: Caching and Batching

Load-shedding correction policy can prioritize frameworks

Stragglers e.g., framework fails to meet SLO

Solution: Anytime predictions Correction policy must render

predictions with missing inputs

e.g., built-in correction policies substitute expected value

Ca

ffe

Slow Changing

Model

Fast Changing

User Model

Clipper

Cost of Ensembles

?

Page 48: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Ca

ffe

Slow Changing

Model

Fast Changing

User Model

Clipper

Anytime Predictions

Application

20ms ✓

Page 49: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Ca

ffe

Slow Changing

Model

Fast Changing

User Model

Anytime Predictions

+ +

✓ ✓

Page 50: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Evaluation of Throughput Under Heavy LoadA

ccu

racy

Throughput (queries per second)

Takeaway: Clipper is able to gracefully degrade accuracy to

maintain availability under heavy load.

Page 51: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

No

Co

arsenin

g

Coarsening + Anytime Predictions

Ove

rly

Co

arse

ned

More FeaturesApprox. Expectation

Be

tter

Best

Coarser Hash

Page 52: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Conclusion

Clipper

Create VWCaffe

Clipper sits between applications and ML frameworks to

to simplifying deployment

bound latency and increase throughput

and enable real-time learning and personalization

across machine learning frameworks

Page 53: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Big

Data

Model

Training

Application

Decision

Query

Feedback

VELOX

Clipper

CreateVWCaffe

Page 54: Prediction Serving · 2018-01-04 · Prediction Serving. Prediction Serving Learning Systems GraphSystems Graph Frames Time Series ... Streaming Spark SQL BlinkDB GraphX Graph Frames

Ongoing & Future Research Directions

Serving and updating RL models

Bandit techniques in correction policies Collaboration with MSR

Splitting inference across the cloud and the client to reduce latency and bandwidth requirements

Secure model evaluation on the client (model DRM)