Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency...

14
DRAFT 09-28-2016 DO NOT DISTRIBUTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael J. Franklin, Joseph E. Gonzalez, Ion Stoica {crankshaw,xinw,gzhou123,franklin,jegonzal,istoica}@berkeley.edu Abstract Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment. In this paper, we introduce Clipper, the first general- purpose low-latency prediction serving system. Interpos- ing between end-user applications and a wide range of ma- chine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frame- works. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underly- ing machine learning frameworks. We evaluate Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. Finally, we compare Clipper to the Tensorflow Serving system and demonstrate comparable prediction through- put and latency on a range of models while enabling new functionality, improved accuracy, and robustness. 1 Introduction The past few years have seen an explosion of applications driven by machine learning, including recommendation systems [27, 59], voice assistants [17, 25, 54], and ad- targeting [3, 26]. These applications depend on two stages of machine learning: training and inference. Training is the process of building a model from data (e.g., movie ratings). Inference is the process of using the model to make a prediction given an input (e.g., predict a users rating for a movie). While training is often computation- ally expensive, requiring multiple passes over potentially large datasets, inference is often assumed to be inexpen- sive. Conversely, while it is acceptable for training to take hours to days to complete, inference must run in real- time, often on orders of magnitude more queries, and is typically part of user facing applications. For example, consider a news agency that wants to personalize its content presentation. Ideally, such an ap- plication should be able to recommend articles at interac- tive latencies (<100ms) [63], scale to large and growing user populations, sustain the throughput demands of flash Applications Computer Vision Speech Recognition Content Recommendation CLIpper REST API Caching Adaptive Batching Model Abstraction Layer Model Selection Layer Model Selection Policy Machine Learning Frameworks Model Container RPC RPC Model Container RPC Model Container Figure 1: The Clipper Architecture crowds driven by breaking news, and provide accurate predictions as the news cycle and reader interests evolve. The challenges of developing these applications differ between the training and inference stages. On the training side, developers must chose from a bewildering array of machine learning frameworks with diverse APIs, models, algorithms, and hardware requirements. Furthermore, they may often need to migrate between models and frame- works as new, more accurate techniques are developed. Once trained, models must be deployed to a prediction serving system to provide low-latency predictions at scale. Unlike model development, which is supported by a so- phisticated infrastructure, theory, and systems, model de- ployment and prediction-serving have received relatively little attention. Developers must cobble together the nec- essary pieces from various systems components, and must integrate and support inference across multiple, evolving frameworks, all while coping with ever-increasing de- mands for scalability and responsiveness. As a result, the deployment, optimization, and maintenance of machine learning applications is difficult and error-prone. To address these challenges, we propose Clipper, a lay- ered architecture (Figure 1) that dramatically reduces the complexity of implementing a prediction serving stack, and achieves three highly desirable properties of a pre- diction serving system: low latency, high throughput, and improved accuracy. Clipper is divided into two layers: (1) the model abstraction layer, and (2) the model selection layer. The first layer exposes a common API that abstracts away the heterogeneity of existing ML frameworks. Con- sequently, applications using the Clipper API need to be written only once – the machine learning frameworks can

Transcript of Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency...

Page 1: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

Clipper: A Low-Latency Online Prediction Serving System

Daniel Crankshaw, Xin Wang, Giulio Zhou,Michael J. Franklin, Joseph E. Gonzalez, Ion Stoica

{crankshaw,xinw,gzhou123,franklin,jegonzal,istoica}@berkeley.edu

AbstractMachine learning is being deployed in a growing numberof applications which demand real-time, accurate, androbust predictions under heavy query load. However, mostmachine learning frameworks and systems only addressmodel training and not deployment.

In this paper, we introduce Clipper, the first general-purpose low-latency prediction serving system. Interpos-ing between end-user applications and a wide range of ma-chine learning frameworks, Clipper introduces a modulararchitecture to simplify model deployment across frame-works. Furthermore, by introducing caching, batching,and adaptive model selection techniques, Clipper reducesprediction latency and improves prediction throughput,accuracy, and robustness without modifying the underly-ing machine learning frameworks. We evaluate Clipperon four common machine learning benchmark datasetsand demonstrate its ability to meet the latency, accuracy,and throughput demands of online serving applications.Finally, we compare Clipper to the Tensorflow Servingsystem and demonstrate comparable prediction through-put and latency on a range of models while enabling newfunctionality, improved accuracy, and robustness.

1 IntroductionThe past few years have seen an explosion of applicationsdriven by machine learning, including recommendationsystems [27, 59], voice assistants [17, 25, 54], and ad-targeting [3,26]. These applications depend on two stagesof machine learning: training and inference. Training isthe process of building a model from data (e.g., movieratings). Inference is the process of using the model tomake a prediction given an input (e.g., predict a usersrating for a movie). While training is often computation-ally expensive, requiring multiple passes over potentiallylarge datasets, inference is often assumed to be inexpen-sive. Conversely, while it is acceptable for training to takehours to days to complete, inference must run in real-time, often on orders of magnitude more queries, and istypically part of user facing applications.

For example, consider a news agency that wants topersonalize its content presentation. Ideally, such an ap-plication should be able to recommend articles at interac-tive latencies (<100ms) [63], scale to large and growinguser populations, sustain the throughput demands of flash

ApplicationsComputer Vision

Speech Recognition

Content Recommendation

CLIpper

REST API

Caching

Adaptive Batching

Model AbstractionLayer

Model Selection LayerModel Selection Policy

Machine LearningFrameworks

Model Container

RPC RPC

Model Container

RPC

Model Container

Figure 1: The Clipper Architecture

crowds driven by breaking news, and provide accuratepredictions as the news cycle and reader interests evolve.

The challenges of developing these applications differbetween the training and inference stages. On the trainingside, developers must chose from a bewildering array ofmachine learning frameworks with diverse APIs, models,algorithms, and hardware requirements. Furthermore, theymay often need to migrate between models and frame-works as new, more accurate techniques are developed.Once trained, models must be deployed to a predictionserving system to provide low-latency predictions at scale.

Unlike model development, which is supported by a so-phisticated infrastructure, theory, and systems, model de-ployment and prediction-serving have received relativelylittle attention. Developers must cobble together the nec-essary pieces from various systems components, and mustintegrate and support inference across multiple, evolvingframeworks, all while coping with ever-increasing de-mands for scalability and responsiveness. As a result, thedeployment, optimization, and maintenance of machinelearning applications is difficult and error-prone.

To address these challenges, we propose Clipper, a lay-ered architecture (Figure 1) that dramatically reduces thecomplexity of implementing a prediction serving stack,and achieves three highly desirable properties of a pre-diction serving system: low latency, high throughput, andimproved accuracy. Clipper is divided into two layers: (1)the model abstraction layer, and (2) the model selectionlayer. The first layer exposes a common API that abstractsaway the heterogeneity of existing ML frameworks. Con-sequently, applications using the Clipper API need to bewritten only once – the machine learning frameworks can

Page 2: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

be modified or swapped below Clipper without any ap-plication changes. The model selection layer sits abovethe model abstraction layer and dynamically selects andcombines predictions across competing models to providemore accurate and robust predictions.

To achieve low latency, high throughput predictions,Clipper implements a range of optimization techniques.At the model abstraction layer, Clipper caches predic-tions on a per-framework basis, and implements adap-tive batching to maximize the throughput given a querylatency target. At the model selection layer, Clipper im-plements techniques to improve both prediction accuracyand latency. To improve prediction accuracy, Clipper ex-ploits bandit and ensemble methods to robustly selectand combine predictions from multiple models and esti-mate prediction uncertainty. In addition, Clipper is able toadapt the model selection independently for each user orsession. Finally, to improve latency, the model selectionlayer implements a straggler mitigation technique to ren-der predictions without waiting for slow models. Becauseof this layered design, neither the end-user applicationsnor the underlying machine learning frameworks need tobe modified to take advantage of these optimizations.

We implemented Clipper in Rust and added support forseveral of the most widely used machine learning frame-works: Apache Spark MLLib [40], Scikit-Learn [52],Caffe [31], TensorFlow [1], and HTK [62]. While theseframeworks span multiple application domains, program-ming languages, and system requirements, each wasadded using fewer than 25 lines of code.

We evaluate Clipper using four common machine learn-ing benchmark datasets and demonstrate that Clipperis able to render low and bounded latency predictions(<20ms), scale to many deployed models even across ma-chines, quickly select and adapt to the best combinationof models, and dynamically trade-off accuracy and la-tency under heavy query load. We compare Clipper tothe Google TensorFlow Serving system [58], a tightlycoupled industrial grade prediction serving system. Wedemonstrate that Clipper’s modular design and broad func-tionality impose minimal performance cost, achievingcomparable prediction throughput and latency to Tensor-Flow Serving while supporting substantially more func-tionality. In summary, our key contributions are:

• A layered architecture that abstracts away the com-plexity associated with serving predictions in exist-ing machine learning frameworks.

• A set of novel techniques to reduce and boundlatency while maximizing prediction throughputacross machine learning frameworks.

• A model selection layer that enables real-time selec-tion and composition of models to provide robustand accurate predictions in dynamic applications.

TrainingData Model

x ŷ

Application

Training Inference

LearnPrediction

Query

ClipperFeedback

Figure 2: Machine Learning Lifecycle

2 Applications and ChallengesThe machine learning life-cycle (Figure 2) can be dividedinto two distinct phases: training and inference. Trainingis the process of estimating a model from data. Training isoften computationally expensive requiring multiple passesover large datasets and can take hours or even days tocomplete [10, 28, 42]. Much of the innovation in systemsfor machine learning has focused on model training withthe development of systems like Apache Spark [64], theParameter Server [38], PowerGraph [24], and Adam [13].

A wide range of machine learning frameworks havebeen developed to address the challenges of training.Many specialize in particular models (e.g., Tensor-Flow [1] for deep learning, Vowpal Wabbit [34] for linearmodels) and applications (e.g., Caffe [31] for computervision, HTK [62] for speech recognition). Typically, theseframeworks leverage advances in parallel and distributedsystems to scale the training process.

Inference is the process of evaluating a model to ren-der a prediction. When compared to training, inference isgenerally assumed to be relatively inexpensive. As a con-sequence, there is little systems work addressing the pro-cess of inference and most machine learning frameworksprovide only basic support for offline batch inference. Inthis paper we focus on the less studied but increasinglyimportant challenges of inference.

2.1 Application WorkloadsTo illustrate the challenges of inference and provide abenchmark on which to evaluate Clipper, we describe twocanonical real-world applications of machine learning:object recognition and speech recognition.

Object Recognition

Advances in deep learning have spurred rapid progressin computer vision, especially in object recognition prob-lems – the task of identifying and labeling the objectsin a picture. Object recognition models form an impor-tant building block in many computer vision applicationsranging from image search to self-driving cars.

As users interact with these applications, they providefeedback about the accuracy of the predictions, either

2

Page 3: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

by explicitly labeling images (e.g., tagging a user in animage) or implicitly by indicating whether the providedprediction was correct (e.g., clicking on a suggested imagein a search). Incorporating this feedback quickly can beessential to eliminating failing models and providing amore personalized experience for users.

Benchmark Applications: We use the well studiedMNIST [35], CIFAR-10 [32], and ImageNet [50] datasetsto evaluate increasingly difficult object-recognition taskswith correspondingly larger inputs. For each dataset, theprediction task requires identifying the correct label foran image based on its pixel values. MNIST is a commonbaseline dataset used to demonstrate the potential of anew algorithm or technique, and both deep learning andmore classical machine learning models perform well onMNIST. On the other hand, for CIFAR-10 and Imagenet,deep learning significantly outperforms other methods. Byusing three different datasets, we evaluate Clipper’s per-formance when serving models that have a wide varietyof computational requirements and accuracies.

Automatic Speech Recognition

Another successful application of machine learning is au-tomatic speech recognition. A speech recognition modelis a function from a spoken audio signal to the correspond-ing sequence of words. Speech recognition models can berelatively large [9] and are often composed of many com-plex sub-models trained using specialized speech recog-nition frameworks (e.g., HTK [62]). Speech recognitionmodels are also often personalized to individual users toaccommodate variations in dialect and accent.

In most applications, inference is done online as theuser speaks. Providing real-time predictions is essentialto user experience [4] and enables new applications likereal-time translation [55]. However inference in speechmodels can be costly [9] requiring the evaluation of largetensors products in convolutional neural networks.

As users interact with speech services, they provideimplicit signal about the quality of the speech predictionswhich can be used to identify the dialect. Incorporatingthis feedback quickly can improve user experience by al-lowing us to choose models specialized for a users dialect.

Benchmark Application: To evaluate the benefit ofpersonalization and online model-selection on a datasetwith real user data, we built a speech recognition ser-vice with the widely used TIMIT speech corpus [23] andthe HTK [62] machine learning framework. This datasetconsists of voice recordings for 630 speakers in eight di-alects of English. We randomly drew users from the testcorpus and simulated their interaction with our speechrecognition service using their pre-recorded speech data.

2.2 ChallengesMotivated by the above applications, we outline the keychallenges of prediction serving and describe how Clipperaddresses these challenges.

Complexity of Deploying Machine Learning

There is a large and growing number of machine learningframeworks [1, 7, 12, 15, 31]. Each framework has its ownstrengths and weaknesses and many are optimized forspecific models or application domains (e.g., computervision). Thus, there is no dominant framework and oftenmultiple frameworks may be used for a single application(e.g., speech recognition and comptuer vision in automaticcaptioning). Although common model exchange formatshave been proposed [48, 49], they have never achievedwidespread adoption because of the rapid and fundamen-tal changes in state-of-the-art techniques. Furthermore,machine learning frameworks are often developed by andfor machine learning experts and are therefore heavilyoptimized towards model development rather than modeldeployment. As a consequence, application developersare forced to accept reduced accuracy by forgoing the useof a model well-suited to the task or to incur the substan-tially increased complexity of integrating and supportingmultiple machine learning frameworks.

Solution: Clipper introduces a model abstraction layerthat isolates applications from variability in machinelearning frameworks (Section 4) and simplifies the pro-cess of deploying a new model or framework.

Prediction Latency and Throughput

The prediction latency is the time it takes to render aprediction given a query. Because prediction serving isoften on the critical path, predictions must both be fastand have bounded tail latencies to meet service level ob-jectives [63]. While simple linear models are fast, moresophisticated and often more accurate models such assupport vector machines, random forests, and deep neu-ral networks are much more computationally intensiveand can have substantial latencies (50-100ms) [12] (seeFigure 11 for details). In many cases accuracy can beimproved by combining models but at the expense ofstragglers and increased tail latencies. Finally, most ma-chine learning frameworks are optimized for offline batchprocessing and not single-input prediction latency. More-over, the low and bounded latency demands of interactiveapplications are often at odds with the design goals ofmachine learning frameworks.

The computational cost of sophisticated models cansubstantially impact prediction throughput. For example,a relatively fast neural network which is able to render100 predictions per second is still orders of magnitudeslower than a modern web-server. While batching pre-diction requests can substantially improve throughput by

3

Page 4: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

exploiting optimized BLAS libraries, SIMD instructions,and GPU acceleration it can also adversely affect predic-tion latency. Finally, under heavy query load it is oftenpreferable to marginally degrade accuracy rather thansubstantially increase latency or lose availability [3, 22].

Solution: Clipper automatically and adaptively batchesprediction requests to maximize the use of batch-orientedsystem optimizations in machine learning frameworkswhile ensuring that prediction latency objectives are stillmet (Section 4.3). In addition, Clipper employs stragglermitigation techniques to reduce and bound tail latency,enabling model developers to experiment with complexmodels without affecting serving latency (Section 5.2.2).

Model Selection

Model development is an iterative process producingmany models reflecting different feature representations,modeling assumptions, and machine learning frameworks.Typically developers must decide which of these modelsto deploy based on offline evaluation using stale datasetsor engage in costly online A/B testing. When predictionscan influence future queries (e.g., content recommenda-tion), offline evaluation techniques can be heavily biasedby previous modeling results. Alternatively, A/B testingtechniques [2] have been shown to be statistically ineffi-cient — requiring data to grow exponentially in the num-ber of candidate models. The resulting choice of modelis typically static and therefore susceptible to changesin model performance due to factors such as feature cor-ruption or concept drift. In some cases the best modelmay differ depending on the context (e.g., user or region)in which the query originated. Finally, predictions frommore than one model can often be combined in ensem-bles to boost prediction accuracy and provide more robustpredictions through confidence estimates.

Solution: Clipper leverages adaptive online model se-lection and ensembling techniques to incorporate feed-back and automatically select and combine predictionsfrom models that can span multiple machine learningframeworks.

2.3 Experimental SetupBecause we include microbenchmarks of many of Clip-per’s features as we introduce them, we present the experi-mental setup now. For each of the three object recognitionbenchmarks, the prediction task is predicting the correctlabel given the raw pixels of an unlabeled image as input.We used a variety of models on each of the object recogni-tion benchmarks. For the speech recognition benchmark,the prediction task is predicting the phonetic transcrip-tion of the raw audio signal. For this benchmark, weused the HTK Speech Recognition Toolkit [62] to learnHidden Markov Models whose outputs are sequences ofphonemes representing the transcription of the sound. De-

Dataset Type Size Features LabelsMNIST [35] Image 70K 28x28 10CIFAR [32] Image 60k 32x32x3 10ImageNet [50] Image 1.26M 299x299x3 1000Speech [23] Sound 6300 5 sec. 39

Table 1: Datasets: The collection of real-world benchmarkdatasets used in the experiments.

tails about each dataset are presented in Table 1.Unless otherwise noted, all experiments were con-

ducted on a single server. All machines used in the exper-iments contain 2 Intel Haswell-EP CPUs and 256 GB ofRAM running Ubuntu 14.04 on Linux 4.2.0. TensorFlowmodels were executed on a Nvidia Tesla K20c GPUs with5 GB of GPU memory and 2496 cores. In the scaling ex-periment described in Figure 6, the servers in the clusterwere connected with both a 10Gbps and 1Gbps network.For each network, all the servers were located on the sameswitch. Both network configurations were investigated.

3 System ArchitectureClipper is divided into model selection and model abstrac-tion layers (see Figure 1). The model abstraction layer isresponsible for providing a common prediction interface,resource isolation, and optimizing the query workload forbatch oriented machine learning frameworks. The modelselection layer is responsible for dispatching queries toone or more models and combining their outputs basedon feedback to improve accuracy, estimate uncertainty,and provide robust predictions.

Before presenting the detailed Clipper system designwe first describe the path of a prediction request throughthe system. Applications issue prediction requests to Clip-per through application facing REST or RPC APIs. Pre-diction requests are first processed by the model selectionlayer. Based on properties of the prediction request andrecent feedback, the model selection layer dispatches theprediction request to one or more of the models throughthe model abstraction layer.

The model abstraction layer first checks the predic-tion cache for the query before assigning the query toan adaptive batching queue associated with the desiredmodel. The adaptive batching queue constructs batches ofqueries that are tuned for the machine learning frameworkand model. A cross language RPC is used to send thebatch of queries to a model container hosting the modelin its native machine learning framework. To simplifydeployment, we host each model container in a separateDocker container. After evaluating the model on the batchof queries, the predictions are sent back to the model ab-straction layer which populates the prediction cache andreturns the results to the model selection layer.

The model selection layer then combines one or more

4

Page 5: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

of the predictions to render a final prediction and con-fidence estimate and replies to the end-user application.Any feedback the application collects about the quality ofthe predictions is sent back to the model selection layerthrough the same application-facing REST/RPC interface.The model selection layer joins this feedback with the cor-responding predictions in the prediction cache to improvehow it selects and combines future predictions.

We now present the model abstraction layer and themodel selection layer in greater detail.

4 Model Abstraction LayerThe Model Abstraction Layer provides a common in-terface across machine learning frameworks. The modelabstraction layer (Figure 1) is composed of a predictioncache, an adaptive query-batching component, and a set ofmodel containers connected to Clipper via a lightweightRPC system. This modular architecture enables cachingand batching mechanisms to be shared across frameworkswhile also improving scaling to more concurrent modelsand simplifying the addition of new frameworks.

4.1 OverviewAt the top of the model abstraction layer is the predictioncache (Section 4.2). The prediction caches provides a par-tial pre-materialization mechanism for frequent queriesand accelerates the adaptive model selection techniquesdescribed in Section 5 by enabling efficient joins betweenrecent predictions and feedback.

The batching component (Section 4.3) sits below theprediction cache and translates point queries into mini-batches that are dynamically resized for each model con-tainer to maximize throughput. Once a mini-batch is con-structed for a given model it is dispatched via the RPCsystem to the container for evaluation.

Models deployed in Clipper are each encapsulatedwithin their own lightweight container (Section 4.4), com-municating with Clipper through an RPC mechanism toprovide uniform interface to Clipper and simplify the de-ployment of new models. The lightweight RPC systemminimizes the overhead of the container-based architec-ture and simplifies cross-language integration.

In the following sections we describe each of thesecomponents in greater detail and discuss some of the keyalgorithmic innovations associated with each.

4.2 CachingFor many applications (e.g., content recommendation),predictions concerning popular items are requested fre-quently. By maintaining a prediction cache, Clipper canserve these frequent queries without evaluating the model.This substantially reduces latency and system load byeliminating the additional cost of model evaluation.

In addition, caching in Clipper serves an important role

0

20000

40000

60000

Thro

ughp

ut (q

ps)

48386

2285929350

48934

197

4721946084

19903 2774247879

188

43954

8963

317 72061920

203 1921

Adaptive Quantile Regression No Batching

No-Op

Random Forest

(SKlearn)Linear SVM

(PySpark)Linear SVM

(SKLearn)Kernel SVM

(SKLearn)

Log Regression

(SKLearn)

0

20000

40000

P99

Late

ncy

(µs)

2044820784

2074020171 28811

2051320261

1971419784

17822 28712

20984

152 5149

195341 5997

348

Figure 4: Comparison of Dynamic Batching Strategies

in model selection (Section 5). To select models intelli-gently Clipper needs to join the original predictions withany feedback it receives. Since feedback is likely to returnsoon after predictions are rendered [39], even infrequentqueries can benefit from caching. For example, even witha small ensemble of four models (a random forest, logisticregression model, and linear SVM trained in Scikit-Learnand a linear SVM trained in Spark), prediction cachingincreased feedback processing throughput in Clipper by1.6x from 6707 to 10928 observations per second.

The prediction cache acts as a function cache for thegeneric prediction function:

Predict(m: ModelId, x: X) -> y: Y

that takes a model id m along with the query x and com-putes the corresponding model prediction y. The cacheexposes a simple non-blocking request and fetch API.When a prediction is needed, the request function is in-voked which notifies the cache to compute the predictionif it is not already present and returns a boolean indicat-ing whether the entry is in the cache. The fetch functionchecks the cache and returns the query result if present.

Clipper employs an LRU eviction policy for the pre-diction cache, using the standard CLOCK [16] cacheeviction algorithm. With an adequately sized cache, fre-quent queries will not be evicted and the cache servesas a partial pre-materialization mechanism for hot items.However, because adaptive model selection occurs abovethe cache in Clipper, changes in predictions due to modelselection do not invalidate cache entries.

4.3 BatchingThe Clipper batching component transforms the streamof prediction queries received by Clipper into batches ofqueries that more closely match the workload assump-tions made by machine learning frameworks while simul-taneusly amortizing RPC and system overhead. Batchingimproves throughput, but it does so at the expense of in-creased latency by requiring all queries in the batch tocomplete before returning a single prediction.

5

Page 6: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

0 200 400 600 800 1000 1200 1400 16000

5000100001500020000250003000035000

Late

ncy

(µs)

a) Linear SVM (SKLearn)

0 200 400 600 800 1000 1200 1400 16000

5000100001500020000250003000035000

b) Random Forest (SKlearn)

P99 Regression Line SLO Samples AIMD Decrease

0 1 2 3 4 5 6 70

10000200003000040000500006000070000 c) Kernel SVM (SKLearn)

0 200 400 600 800 1000 1200 1400 1600

Batch Size

05000

100001500020000250003000035000

Late

ncy

(µs)

d) No-Op

0 200 400 600 800 1000 1200 1400 1600

Batch Size

05000

100001500020000250003000035000

e) Logistic Regression (SKLearn)

0 200 400 600 800 1000 1200 1400 1600

Batch Size

05000

100001500020000250003000035000 f) Linear SVM (PySpark)

Figure 3: Model Container Latency Profiles: We measured the batching latency profile of several models trained on the MNISTbenchmark dataset. The models were trained using Scikit-Learn (SKLearn) or Spark and were chosen to represent several of themost widely used types of models. The No-Op Container measures the system overhead of the model containers and RPC system.

We exploit an explicitly stated latency service level ob-jective (SLO) to increase latency in exchange for substan-tially improved throughput. By allowing users to specifya latency objective, Clipper is able to tune batched queryevaluation to maximize throughput while still meeting thelatency requirements of interactive applications. For ex-ample, requesting predictions in sufficiently large batchescan improve throughput by up to 26x (the Scikit-LearnSVM in Figure 4) while meeting a 20ms latency SLO.

Batching increases throughput via two mechanisms.First, batching amortizes the cost of RPC calls and in-ternal framework overheads such as copying inputs toGPU memory. Second, batching enables machine learningframeworks to exploit existing data-parallel optimizationsby performing batch inference on many inputs simultane-ously (e.g., by using the GPU or BLAS acceleration).

As the model selection layer dispatches queries formodel evaluation, they are placed on queues associatedwith model containers. Each model container has its ownadaptive batching queue tuned to the latency profile of thatcontainer and a corresponding thread to process predic-tions. Predictions are processed in batches by removingas many queries as possible from a queue up to the max-imum batch size for that model container and sendingthe queries as a single batch prediction RPC to the con-tainer for evaluation. Clipper imposes a maximum batchsize to ensure that latency objectives are met and avoidexcessively delaying the first queries in the batch.

Frameworks that leverage GPU acceleration such asTensorFlow often enforce static batch sizes to maintaina consistent data layout across evaluations of the model.These frameworks typically encode the batch size directlyinto the model definition in order to fully exploit GPUparallelism. When rendering fewer predictions than thebatch size, the input must be padded to reach the definedsize, reducing model throughput without any improve-ment in prediction latency. Careful tuning of the batch

size should be done to maximize inference performance,but this tuning must be done offline and is fixed by thetime a model is deployed.

However, most machine learning frameworks can ef-ficiently process variable-sized batches at serving time.Yet differences between the framework implementationand choice of model and inference algorithm can lead toorders of magnitude variation in model throughput andlatency. As a result, the latency profile – the expected timeto evaluate a batch of a given size – varies substantiallybetween model containers. For example, in Figure 3 wesee that the maximum batch size that can be executedwithin a 20ms latency SLO differs by 241x between thelinear SVM which does a very simple vector-vector mul-tiply to perform inference and the kernel SVM whichmust perform a sequence of expensive nearest-neighborcalculations to evaluate the kernel. As a consequence, thelinear SVM can achieve throughput of nearly 30,000 qpswhile the kernel SVM is limited to 200 qps under thisSLO. Moreover, instead of requiring application develop-ers to manually tune the batch size for each new model,Clipper employs a simple adaptive batching scheme todynamically find and adapt the maximum batch size.

4.3.1 Dynamic Batch Size

We define the optimal batch size as the batch size that max-imizes throughput subject to the constraint that the batchevaluation latency is under the target SLO. To automati-cally find the optimal maximum batch size for each modelcontainer we employ an additive-increase-multiplicative-decrease (AIMD) scheme. Under this scheme, we addi-tively increase the batch size by a fixed amount until thelatency to process a batch exceeds the latency objective.At this point, we perform a small multiplicative back-off, reducing the batch size by 10%. Because the optimalbatch size does not fluctuate substantially, we use a muchsmaller backoff constant than other Additive-Increase,

6

Page 7: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

Multiplicative-Decrease schemes [14].Early performance measurements (Figure 3) suggested

a stable linear relationship between batch size and latencyacross several of the modeling frameworks. As a result,we also explored the use of quantile regression to esti-mate the 99th-percentile (P99) latency as a function ofbatch size and set the maximum batch size accordingly.However, we compared the two approaches on a rangeof commonly used Spark and Scikit-Learn models in Fig-ure 4. Both strategies provide significant performanceimprovements over the baseline strategy of no batching,achieving up to a 138x throughput increase in the caseof the Scikit-Learn linear SVM, demonstrating the per-formance gains that batching provides. While the twobatching strategies perform nearly identically, the AIMDscheme is significantly simpler and easier to tune. Further-more, the ongoing adaptivity of the AIMD strategy makesit robust to changes in throughput capacity of a model(e.g., during a garbage collection pause in Spark). As aresult, Clipper employs the AIMD scheme as the default.

4.3.2 Delayed Batching

Under moderate or bursty loads, the batching queue maycontain less queries than the maximum batch size whenthe next batch is ready to be dispatched. For some models,briefly delaying the dispatch to allow more queries toarrive can significantly improve throughput under burstyloads. Similar to the motivation for Nagle’s algorithm [45],the gain in efficiency is a result of the ratio of the fixedcost for sending a batch to the variable cost of increasingthe size of a batch.

In Figure 5, we compare the gain in efficiency (mea-sured as increased throughput) from delayed batching fortwo models. Delayed batching provides no increase inthroughput for the Spark SVM because Spark is alreadyrelatively efficient at processing small batch sizes and cankeep up with the moderate input workload using batchesmuch smaller than the optimal batch size. In contrast, theScikit-Learn SVM has a high fixed cost for processing abatch but employs BLAS libraries to do efficient parallelinference on many inputs at once. As a consequence, a2ms batch delay provides a 3.3x improvement in through-put and allows the Scikit-Learn model container to keepup with the throughput demand while remaining well be-low the 10-20ms latency objectives needed for interactiveapplications.

4.4 Model ContainersModel containers encapsulate the diversity of machinelearning frameworks and model implementations withina uniform “narrow waist” remote prediction API. To adda new type of model to Clipper, model builders only needto implement the standard batch prediction interface inListing 1. Clipper includes language specific bindings

0

10000

20000

Thru. Spark SVMScikit-Learn SVM

0

2000

4000Lat.(µs)

0 1000 2000 3000 4000Batch Wait Timeout (µs)

0

50

100Batch

Size

Figure 5: Increase in update throughput from caching pre-dictions.

i n t e r f a c e Pred i c to r {de f predict_batch ( inputs : L i s t<X>)

-> List<List<Y>>;}

Listing 1: Common Batch Prediction Interface for ModelContainers. Called through the RPC interface to compute thepredictions for a batch of inputs. The return type is a nested listbecause each input may produce multiple outputs.

with RPC implementations for C++, Java, and Python.The model container implementations for most of themodels in this paper only required a few lines of code.

To achieve process isolation, each model is managedin a separate Docker container. By placing models in sep-arate containers, Clipper ensures that variability in perfor-mance and stability of relatively immature state-of-the-artmachine learning frameworks does not directly interferewith overall availability of the Clipper serving system.Furthermore, CPU or memory intensive machine learningframeworks can be replicated across multiple machines orgiven access to specialized hardware (e.g., GPUs), freeingresources for Clipper or other deployed containers.

4.4.1 Container Replica Scaling

Clipper supports replicating model containers, both lo-cally and across a cluster, to improve throughput and lever-age additional hardware accelerators. If a model containerhas more than one replica associated with it, requestsare randomly load balanced across all replicas using thepower of two choices algorithm [41]. Because differentreplicas can have different performance characteristics,particularly when spread across a cluster, Clipper per-forms adaptive batching independently for each replica.

In Figure 6 we demonstrate the linear throughput scal-ing that Clipper can achieve by replicating model con-tainers across a cluster. With a four-node GPU clusterconnected through a 10Gbps Ethernet switch, Clippergets a 3.95x throughput increase from 19,500 qps whenusing a single model container running on a local GPUto 77,120 qps when using four replicas each running ona different machine. Because the model containers in

7

Page 8: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

0

50000

100000Th

roug

hput

(qps

)Agg 10GbpsAgg 1Gbps

Mean 10GbpsMean 1Gbps

2 4Number of Replicas

0

100

200

Late

ncy

(ms) Mean 10Gbps

P99 10GbpsMean 1GbpsP99 1Gbps

Figure 6: Scaling the Model Abstraction Layer to RemoteGPUs. The solid lines refer to aggregate throughput of all themodel replicas and the dashed lines refer to the per-replicaaverage throughput. The first GPU in this cluster is located onthe same machine that Clipper is running on. Subsequent GPUsare each located on separate machines on the same switch.

this experiment are expensive and run on the GPU, GPUthroughput is the bottleneck and Clipper’s RPC systemcan easily saturate the GPUs. However, when the clus-ter is connected through a 1Gbps switch, the aggregatethroughput of the GPUs is higher than 1Gbps and so thenetwork becomes saturated when replicating to a secondremote machine. As machine-learning applications beginto consume increasingly bigger inputs, scaling from hand-crafted features to large images, audio signals, or evenvideo, the network will continue to be a bottleneck toscaling out prediction serving applications. This suggeststhe need for research into efficient networking strategiesfor remote predictions on large inputs.

5 Model Selection LayerThe Model Selection Layer uses feedback to dynami-cally select one or more of the deployed models and com-bine their outputs to provide more accurate and robustpredictions. By allowing many candidate models to bedeployed simultaneously and relying on feedback to adap-tively determine the best model or combination of models,the model selection layer simplifies the deployment pro-cess and provides a mechanism to automatically compen-sate for failing models. By combining multiple predictionsfrom different models, the model selection layer boostsaccuracy and estimates prediction confidence.

While there are a wide range of techniques for modelselection and composition, most can be reduced to a sim-ple select, combine, and observe API. We capture this APIin the model selection policy (Listing 2) which governsthe behavior of the model selection layer and allows usersto introduce new model selection techniques.

The model selection policy (Listing 2) defines four es-sential functions as well as a few basic types. In additionto the query and prediction types X and Y, the state type

t r a i t S e l e c t i onPo l i c y <S , X, Y> {fn i n i t ( ) -> S ;fn s e l e c t ( s : S , x : X) -> List<ModelId>;fn combine ( s : S , x : X,

pred : L i s t <(ModelId , Y)>) -> (Y, f64 ) ;fn observe ( s : S , x : X, feedback : Y,

pred : L i s t <(ModelId , Y)>) -> S ;}

Listing 2: Model Selection Policy Interface in Rust

S encodes the learned state of the selection algorithm.The init function returns an initial instance of the selec-tion policy state. We isolate the selection policy state andrequire an initialization function to enable Clipper to in-stantiate multiple instances of the selection policy to forfine-grained contextualized model selection (Section 5.3).The select and combine functions are responsible forchoosing which models to query and how to combine theresults. In addition, the combine function can provide aprediction confidence score (see Section 5.2.1). Finally,the observe function is used to update the state S basedon feedback from front-end applications.

In the current implementation of Clipper we providetwo generic model selection policies based on robust ban-dit algorithms developed by Auer et al. [6]. These algo-rithms span a trade-off between computation overheadand accuracy. The single model selection policy (Sec-tion 5.1) leverages the Exp3 algorithm to optimally selectthe best model based on noisy feedback. The ensemblemodel selection policy (Section 5.2) is based on the Exp4algorithm which adaptively combines the predictions toimprove prediction accuracy and estimate confidence.

5.1 Single Model Selection PolicyWe can cast the model-selection process as a multi-armedbandit problem [44]. The multi-armed bandit1 problemrefers the task of optimally choosing between k possi-ble actions (e.g., models) each with a stochastic reward(e.g., feedback). Because only the reward for the selectedaction can be observed, solutions to the multi-armed ban-dit problem must address the trade-off between exploringpossible actions and exploiting the estimated best action.

There are numerous algorithms for the multi-armedbandits problem with a wide range of trade-offs. In thiswork we use the simple randomized Exp3 [6] algorithmwhich makes few assumptions about the problem settingand has strong optimality guarantees. The Exp3 algorithmassociates a weight si = 1 for each of the k deployedmodels and then randomly selects model i with proba-bility pi = si/∑

kj=1 s j. For each prediction y, Clipper ob-

serves a loss L(y, y) ∈ [0,1] with respect to the true valuey (e.g., the fraction of words a speech recognition tran-scribed correctly). The Exp3 algorithm then updates the

1The term bandits refers to pull-lever slot machines found in casinos.

8

Page 9: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

Framework Model Size (Layers)Caffe VGG [53] 13 Conv. and 3 FCCaffe GoogLeNet [56] 96 Conv. and 5 FCCaffe ResNet [29] 151 Conv. and 1 FCCaffe CaffeNet [21] 5 Conv. and 3 FCTensorFlow Inception [57] 6 Conv, 1 FC, & 3 Incept.

Table 2: Deep Learning Models: The set of deep learningmodels used to evaluate the ImageNet ensemble selection policy.

single model ensemble 4-agree 5-agree0.0

0.2

0.4

0.6

Top

1 E

rror

Rat

e

0.0915 0.0845 0.0610.1807

0.02350.126

CIFAR-10

single model ensemble confidentunsure

single model ensemble 4-agree 5-agree0.0

0.2

0.4

Top-

5 E

rror

Rat

e

0.0618 0.0586 0.0469

0.3182

0.0327

0.1983

ImageNet

single model ensemble confidentunsure

Figure 7: Ensemble Prediction Accuracy: The linear ensem-bles are composed of five computer vision models (Table 2)applied to the CIFAR and ImageNet benchmarks. The 4-agreeand 5-agree groups correspond to ensemble predictions in whichthe queries have been separated by the ensemble prediction con-fidence (four or five models agree) and the width of each bardefines the proportion of examples in that category.

weight, si← si exp(−ηL(y, y)/pi), corresponding to theselected model i. The constant η determines how quicklyClipper responds to feedback.

The Exp3 algorithm provides several benefits over man-ual experimentation and A/B testing, two common waysof performing model-selection in practice. Exp3 is bothsimple and robust, scaling well to model selection overa large number of models. It is a lightweight algorithmthat requires only a single model evaluation for each pre-diction and thus performs well under heavy loads. AndExp3 has strong theoretical guarantees that ensure it willquickly converge to an optimal solution.

5.2 Ensemble Model Selection PoliciesIt is a well-known result in machine learning [8,11,30,44]that prediction accuracy can be improved by combiningpredictions from multiple models. Rather than select in-dividual models, the ensemble model selection policiesadaptively combine the predictions from all availablemodels to improve accuracy.

In Clipper we use linear ensemble methods which com-pute a weighted average of the base model predictions.In Figure 7, we show the prediction error rate of linearensembles on two benchmarks. In both cases linear en-sembles are able to marginally reduce the overall error

0 5000 10000 15000 20000Queries

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

Tota

l Ave

rage

Err

ors

model 1model 2model 3model 4model 5Exp3Exp4

Figure 8: Behavior of Exp3 and Exp4 Under Model Failure:After 5K queries the lowest-error model suffers a failure, andafter 10k queries recovers. Exp3 and Exp4 quickly compensatefor the failure and achieve lower error than any static modelselection.

rate. In the ImageNet benchmark, the ensemble formula-tion achieves a 5.2% relative reduction in the error ratesimply by combining off-the-shelf models (Table 2).

There are many methods for estimating the ensembleweights including linear regression, boosting [44], andbandit formulations. We adopt the bandits approach anduse the Exp4 algorithm [6] to learn the weights. Exp4confers many of the same guarantees as Exp3 while im-proving prediction accuracy as the number of modelsincreases. Unlike Exp3, Exp4 constructs a weighted com-bination of all of base model predictions and updatesweights based on the individual model prediction error.

To evaluate how the model selection policies performin the presence of changes in deployed model accuracywe simulated a model failure while receiving real-timeprediction feedback. Using the CIFAR dataset we trainedfive different Caffe models with varying levels of accuracy.Using a simulated run of 20K sequential queries withimmediate feedback, we injected a model failure in thebest-performing model (model 5) after 5K queries andthen allow model 5 to recover after 10K queries.

In Figure 8 we plot the cumulative average error ratefor each of the five base models as well as the single(Exp3) and ensemble (Exp4) model selection policies. Inthe first 5K queries the model selection policies quicklyconverge to an error rate near the best performing model(model 5). When we degrade the predictions from model5 its cumulative error rate spikes. The model selectionpolicies are able to quickly mitigate the consequencesof the spike in errors by learning to divert queries to theother models. When model 5 recovers after 10K queriesthe model selection policies also begin to improve bygradually sending queries back to model 5.

5.2.1 Robust Predictions

By evaluating predictions from multiple competing mod-els concurrently we can obtain an estimator of the confi-dence in our predictions. In settings where models have

9

Page 10: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

2 4 6 8 10 12 14 16Size of ensemble

050

100150200250300

Late

ncy

(ms)

Straggler Mitigation P99Straggler Mitigation Mean

Stragglers P99Stragglers Mean

(a) Latency

2 4 6 8 10 12 14 16Size of ensemble

0

20

40

60

80

100

% E

nsem

ble

Mis

sing

P99Mean

(b) Missing Predictions

0 2 4 6 8Stragglers

0.685

0.690

0.695

0.700

0.705

Acc

urac

y

(c) Accuracy

Figure 9: Increase in stragglers from bigger ensembles: The (a) latency and (b) missing predictions when using the ensemblemodel selection policy on SK-Learn Random Forrest models applied to MNIST. (c) The prediction accuracy of the TIMIT speechrecognition benchmark as a function of the number of stragglers. As the size of an ensemble grows, the cost of blocking a responseto the application until all predictions are available grows substantially. Instead, Clipper enforces bounded latency predictions andtransforms the latency cost of waiting for stragglers into a reduction in accuracy from using a smaller ensemble.

high variance or are trained on random samples from thetraining data (e.g., bagging [44]), agreement in modelpredictions is an indicator of prediction confidence. Whenevaluating the combine function in the ensemble selec-tion policy we compute a measure of confidence by cal-culating the number of models that agree with the finalprediction. End user applications can use this confidencescore to decide whether to rely on the prediction. If weonly consider predictions where multiple models agree,we can substantially reduce the error rate (see Figure 7)while declining to predict a small fraction of queries.

5.2.2 Straggler Mitigation

While the ensemble model selection policy can improveprediction accuracy and help quantify uncertainty, it intro-duces additional system costs. As we increase the size ofthe ensemble the cost of rendering a prediction increases.Fortunately, we can compensate for the increased pre-diction cost by scaling-out the model abstraction layer.Unfortunately, as we add model containers we increasethe chance of stragglers adversely affecting tail latencies.

To evaluate the cost of stragglers, we deployed ensem-bles of increasing size and measured the resulting predic-tion latency (Figure 9a) under moderate query load. Evenwith small ensembles we observe the effect of stragglerson the P99 tail latency, which rise sharply to well beyondthe 20ms latency objective. As the size of the ensemble in-creases beyond 10 and the system becomes more heavilyloaded, stragglers begin to affect the mean latency.

To address stragglers, Clipper introduces a simple best-effort straggler-mitigation strategy. For each query themodel selection layer maintains a latency deadline de-termined by the latency SLO. At the latency deadlinethe combine function of the model selection policy is in-voked with the subset of the predictions that are available.The model selection policy must render a final predic-tion using only the available base model predictions andcommunicate the potential loss in accuracy in its confi-dence rating. Currently, we substitute missing predictions

0 2 4 6 8

Updates

0.25

0.30

0.35

Err

or

Static DialectNo Dialect

Clipper Selection Policy

Figure 10: Personalized Corrections Accuracy of the ensem-ble correction policy on the speech recognition benchmark.

with their average value and define the confidence as thefraction of models that agree on the prediction.

The best-effort straggler-mitigation strategy preventsmodel container tail latencies from propagating to front-end applications by maintaining the latency objective asadditional models are deployed. However, the stragglermitigation strategy reduces the size of the ensemble. InFigure 9b we plot the reduction in ensemble size and findthat, up to the capacity of the machine (roughly 10), themost of the predictions arrive by the latency deadline. InFigure 9c we plot the consequence of stragglers on theprediction accuracy for TIMIT speech recognition task.Notably, the loss of three predictions at random results inless than a 5% decrease in accuracy.

5.3 ContextualizationIn many prediction tasks the accuracy of a particularmodel may depend heavily on context. For example, inspeech recognition a model trained for one dialect mayperform well for some users and poorly for others. How-ever, selecting the right model or composition of modelscan be difficult and is best accomplished in the modelselection layer through feedback. To support context spe-cific model selection, the model selection layer can beconfigured to instantiate a unique model selection statefor each user, context, or session. The context specificsession state is managed in an external database system.In our current implementation we use Redis.

To demonstrate the potential gains from personalized

10

Page 11: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

model selection we hosted a collection of TIMIT [23]voice recognition models each trained for a different di-alect. We then evaluated (Figure 10) the prediction errorrates using a single model trained on all dialects, the users’reported dialect model, and the Clipper ensemble selec-tion policy. We observe that the ensemble selection policyis able to quickly identify a combination of models thatout-performs even the users’ designated dialect model.

6 System ComparisonIn addition to the microbenchmarks presented in Section 4and Section 5, we compared Clipper’s performance toTensorFlow Serving and evaluate latency and throughputon three object recognition benchmarks.

TensorFlow Serving [58] is a recently released predic-tion serving system created by Google to accompany theirTensorFlow machine learning training framework. Simi-lar to Clipper, TensorFlow Serving is designed for servingmachine learning models in production environments andprovides a high-performance prediction API to simplifydeploying new algorithms and experimenting with newmodels without modifying frontend applications. Tensor-Flow Serving supports general TensorFlow models withGPU acceleration through direct integration with the Ten-sorFlow machine learning framework and tightly couplesthe model and serving components in the same process.

TensorFlow Serving also employs batching to acceler-ate prediction serving. Batch sizes in TensorFlow Servingare static and rely on a purely timeout based mechanism toavoid starvation. TensorFlow Serving does not explicitlyincorporate prediction latency objectives which must beachieved by manually tuning the batch size. Furthermore,TensorFlow Serving was designed to serve one model ata time and therefore does not directly support feedback,dynamic model selection, or composition.

To better understand the performance overheads intro-duced by Clipper’s layered architecture and decoupledmodel containers, we compared the serving performanceof Clipper and TensorFlow Serving on three TensorFlowobject recognition deep networks of varying computa-tional cost: a 4-layer convolutional neural network trainedon the MNIST dataset [43], the 8-layer AlexNet [33]architecture trained on CIFAR-10 [32], and Google’s 22-layer Inception-v3 network [57] trained on ImageNet.We implemented two Clipper model containers for eachTensorFlow model, one that calls TensorFlow from themore standard and widely used Python API and one thatcalls TensorFlow from the more efficient C++ API. Allmodels were run on a GPU using hand-tuned batch sizes(MNIST: 512, CIFAR: 128, ImageNet: 16) to maximizethe throughput of TensorFlow Serving.

Despite Clipper’s modular design, we are able toachieve comparable throughput to TensorFlow Servingacross all three models (Figure 11). The Python model

0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0

a) MNIST

0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0

b) CIFAR-10

0.0 0.2 0.4 0.6 0.8 1.00.00.20.40.60.81.0

c) ImageNet

0

10000

20000

Thro

ughp

ut

22920 2170519109

TensorFlow Serving Clipper TF-C++ Clipper TF-Python

0

20

40

60

Mea

n La

t.(m

s)

43 4753

0

2000

4000

6000

Thro

ughp

ut 5293 52974448

0

20

40

60

Mea

n La

t.(m

s)

47 4856

0

20

40

60

Thro

ughp

ut 56 5346

0

500

Mea

n La

t.(m

s)

561 598683

Figure 11: TensorFlow Serving Comparison Comparison ofpeak throughput and latency (p99 latencies shown in error bars)on three TensorFlow models of varying inference cost. TF-C++uses TensorFlow’s C++ API and TF-Python the Python API.

containers suffer a 15-18% performance hit comparedto the throughput of TensorFlow Serving, but the C++model containers achieve nearly identical performance.This suggests that the high-level Python API for Ten-sorFlow imposes a significant performance cost in thecontext of low-latency prediction-serving but that Clipperdoes not impose any additional performance degradation.By achieving comparable performance across this rangeof models, we have demonstrated that through carefuldesign and implementation of the system, the modulararchitecture and substantially broader set of features inClipper do not come at a cost of reduced performance oncore prediction-serving tasks.

7 Limitations

While Clipper attempts to address many challenges in thecontext of prediction serving there are a few key limita-tions when compared to other designs like TensorFlowServing. Most of these limitations follow directly fromthe design of the Clipper architecture which assumes mod-els are below Clipper in the software stack, and thus aretreated as black-box components.

Clipper does not optimize the execution of the mod-els within their respective machine learning frameworks.Slow models will remain slow when served from Clip-per. In contrast, TensorFlow Serving is tightly integratedwith model evaluation, and hence is able to leverage GPUacceleration and compilation techniques to speedup infer-ence on models created with TensorFlow.

Similarly, Clipper does not manage the training or re-training of the base models within their respective frame-works. As a consequence, if all models are out-of-dateor inaccurate Clipper will be unable to improve accuracybeyond what can be accomplished through ensembles.

11

Page 12: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

8 Related WorkPerhaps the closest projects to Clipper are LASER [3],Velox [18], and TensorFlow Serving [58]. The LASERsystem was developed at LinkedIn to support linear mod-els for ad-targeting applications. Velox is a Berkeley re-search project to study personalized prediction servingwith Apache Spark. TensorFlow Serving is the open-source prediction serving system developed by Googlefor TensorFlow models. In our experiments we only com-pare against TensorFlow Serving, because LASER is notpublicly available, and the currently released prototype ofVelox is incomplete and has limited functionality.

All three systems propose mechanisms to address la-tency and throughput. Both LASER and Velox utilizecaching at various levels in their systems. In addition,LASER also uses a straggler mitigation strategy to ad-dress slow feature evaluation. Neither LASER or Veloxdiscuss batching. Conversely, TensorFlow Serving doesnot discuss caching and instead leverages batching andhardware acceleration to improve throughput.

LASER and Velox both exploit a form of model de-composition to incorporate feedback and achieve contex-tualization. The model formulation used in LASER andVelox is similar to the linear ensembles in Clipper. How-ever LASER does not incorporate feedback in real-timeand Velox does not support bandits or cross frameworklearning. Moreover, the techniques used for online learn-ing and contextualization in both of these systems arecaptured in the more general Clipper selection policy.In contrast, TensorFlow Serving has no mechanism toachieve personalization or adapt to real-time feedback.

Finally, LASER, Velox, and TensorFlow Serving areall vertically integrated; they focused on serving predic-tions from a single model or framework. In contrast, Clip-per supports a wide range of machine learning modelsand frameworks and simultaneously addresses latency,throughput, and accuracy in a single serving system.

Application Specific Prediction Serving: There hasbeen considerable prior work in application and modelspecific prediction-serving. Much of this work has fo-cused on content recommendation, including video-recommendation [19], ad-targeting [26, 39], and product-recommendations [37]. Outside of content recommen-dation, there has been recent success in speech recogni-tion [36, 54] and internet-scale resource allocation [22].While many of these applications require real-time pre-dictions, the solutions described are highly application-specific and tightly coupled to the model and workloadcharacteristics. As a consequence, much of this worksolves the same systems challenges in different applica-tion areas. In contrast, Clipper is a general-purpose systemcapable of serving many of these applications.

Parameter Server: There has been considerablework in the learning systems community on parameter-

servers [5, 20, 38, 61]. While parameter-servers do focuson reduced latency and caching, they do so in the contextof model training. In particular they are a specialized typeof key-value store used to coordinate updates to modelparameters in a distributed training system. They are nottypically used to serve predictions.

General Serving Systems: The high-performanceserving architecture of Clipper draws from prior workon highly-concurrent serving systems [46,47,51,60]. Thedivision of functionality into vertical stages introducedby [60] is similar to the division of Clipper’s architectureinto independent layers. Notably, while the dominant costin data-serving systems tends to be IO, in prediction serv-ing it is computation. This changes both physical resourceallocation and batching and latency-hiding strategies.

9 ConclusionIn this work we identified three key challenges of pre-diction serving: latency, throughput, and accuracy, andproposed a new layered architecture that addresses thesechallenges by interposing between end-user applicationsand existing machine learning frameworks.

As an instantiation of this architecture, we introducedthe Clipper prediction serving system. Clipper isolatesend-user applications from the variability and diversity inmachine learning frameworks by providing a common pre-diction interface. As a consequence, new machine learn-ing frameworks and models can be introduced withoutmodifying end-user applications.

We addressed the challenges of prediction serving la-tency and throughput within the Clipper Model Abstrac-tion layer. The model abstraction layer lifts caching andadaptive batching strategies above the machine learn-ing frameworks to achieve up to a 26x improvement inthroughput while maintaining strict bounds on tail latencyand providing mechanisms to scale serving across a clus-ter. We addressed the challenges of accuracy in the ClipperModel Selection Layer. The model selection layer enablesmany models to be deployed concurrently and then dy-namically selects and combines predictions from eachmodel to render more robust, accurate, and contextualizedpredictions while mitigating the cost of stragglers.

We evaluated Clipper using four standard machine-learning benchmark datasets spanning computer visionand speech recognition applications. We demonstratedClipper’s capacity to bound latency, scale heavy work-loads across nodes, and provide accurate, robust, and con-textual predictions. We compared Clipper to Google’sTensorFlow Serving system and achieved parity onthroughput and latency performance, demonstrating thatthe modular container-based architecture and substantialadditional functionality in Clipper can be achieved withminimal performance penalty.

12

Page 13: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

References[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,

C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al.Tensorflow: Large-scale machine learning on heteroge-neous systems, 2015. Software available from tensorflow.org.

[2] A. Agarwal, S. Bird, M. Cozowicz, L. Hoang, J. Lang-ford, S. Lee, J. Li, D. Melamed, G. Oshri, O. Ribas, et al.A multiworld testing decision service. arXiv preprintarXiv:1606.03966, 2016.

[3] D. Agarwal, B. Long, J. Traupman, D. Xin, and L. Zhang.Laser: A scalable response prediction platform for onlineadvertising. In Proceedings of the 7th ACM InternationalConference on Web Search and Data Mining, WSDM ’14,pages 173–182, 2014.

[4] S. Agarwal and J. R. Lorch. Matchmaking for onlinegames and other latency-sensitive p2p systems. In ACMSIGCOMM Computer Communication Review, volume 39,pages 315–326. ACM, 2009.

[5] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, andA. J. Smola. Scalable inference in latent variable models.In WSDM, pages 123–132, 2012.

[6] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire.The nonstochastic multiarmed bandit problem. SIAM J.Comput., 32(1):48–77, Jan. 2003.

[7] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pas-canu, G. Desjardins, J. Turian, D. Warde-Farley, andY. Bengio. Theano: a cpu and gpu math expression com-piler. In Proceedings of the Python for scientific computingconference (SciPy), volume 4, page 3. Austin, TX, 2010.

[8] C. M. Bishop. Pattern Recognition and Machine Learning.Springer-Verlag, Secaucus, NJ, USA, 2006.

[9] C. Chelba, D. Bikel, M. Shugrina, P. Nguyen, and S. Ku-mar. Large scale language modeling in automatic speechrecognition. arXiv preprint arXiv:1210.8440, 2012.

[10] J. Chen, R. Monga, S. Bengio, and R. Jozefowicz. Re-visiting Distributed Synchronous SGD. arXiv.org, Apr.2016.

[11] T. Chen and C. Guestrin. XGBoost: A Scalable TreeBoosting System. arXiv.org, Mar. 2016.

[12] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao,B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible andefficient machine learning library for heterogeneous dis-tributed systems. arXiv preprint arXiv:1512.01274, 2015.

[13] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanara-man. Project adam: Building an efficient and scalabledeep learning training system. In 11th USENIX Sympo-sium on Operating Systems Design and Implementation(OSDI 14), pages 571–582, 2014.

[14] D.-M. Chiu and R. Jain. Analysis of the increase anddecrease algorithms for congestion avoidance in computernetworks. Comput. Netw. ISDN Syst., 17(1):1–14, June1989.

[15] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7:A matlab-like environment for machine learning. InBigLearn, NIPS Workshop, number EPFL-CONF-192376,2011.

[16] F. J. Corbato. A paging experiment with the multics sys-tem. 1968.

[17] Microsoft Cortana. https://www.microsoft.com/en-us/mobile/experiences/cortana/.

[18] D. Crankshaw, P. Bailis, J. E. Gonzalez, H. Li, Z. Zhang,M. J. Franklin, A. Ghodsi, and M. I. Jordan. The missingpiece in complex analytics: Low latency, scalable modelmanagement and serving with velox. In CIDR 2015, Sev-enth Biennial Conference on Innovative Data Systems,2015.

[19] J. Davidson, B. Liebald, J. Liu, P. Nandy, T. Van Vleet,U. Gargi, S. Gupta, Y. He, M. Lambert, B. Livingston, andD. Sampath. The YouTube video recommendation system.RecSys, pages 293–296, 2010.

[20] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin,M. Mao, M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang,Q. V. Le, and A. Y. Ng. Large scale distributed deep net-works. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.Weinberger, editors, Advances in Neural Information Pro-cessing Systems 25, pages 1223–1231. Curran Associates,Inc., 2012.

[21] J. Donahue. Caffenet. https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet.

[22] A. Ganjam, F. Siddiqui, J. Zhan, X. Liu, I. Stoica, J. Jiang,V. Sekar, and H. Zhang. C3: Internet-Scale Control Planefor Video Quality Optimization. NSDI, pages 131–144,2015.

[23] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus,D. S. Pallett, and N. L. Dahlgren. Darpa timit acousticphonetic continuous speech corpus cdrom, 1993.

[24] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin.Powergraph: Distributed graph-parallel computation onnatural graphs. OSDI’12, pages 17–30, 2012.

[25] Google Now. https://www.google.com/landing/now/.

[26] T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich.Web-Scale Bayesian Click-Through rate Prediction forSponsored Search Advertising in Microsoft’s Bing SearchEngine. ICML, pages 13–20, 2010.

[27] h20. http://www.h2o.ai.

[28] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid-ual learning for image recognition. arXiv preprintarXiv:1512.03385, 2015.

[29] K. He, X. Zhang, S. Ren, and J. Sun. Deep resid-ual learning for image recognition. arXiv preprintarXiv:1512.03385, 2015.

[30] G. Hinton, O. Vinyals, and J. Dean. Distilling the Knowl-edge in a Neural Network. arXiv.org, Mar. 2015.

13

Page 14: Clipper: A Low-Latency Online Prediction Serving System · 09-28-2016 UTE Clipper: A Low-Latency Online Prediction Serving System Daniel Crankshaw, Xin Wang, Giulio Zhou, Michael

DR

AFT

09-2

8-20

16D

ON

OT

DIS

TR

IBU

TE

[31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Con-volutional architecture for fast feature embedding. In Pro-ceedings of the ACM International Conference on Multi-media, pages 675–678. ACM, 2014.

[32] A. Krizhevsky and G. Hinton. Cifar-10 dataset. https://www.cs.toronto.edu/~kriz/cifar.html.

[33] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in neural information processing systems, pages1097–1105, 2012.

[34] J. Langford, L. Li, and A. Strehl. Vowpal wabbit onlinelearning project, 2007.

[35] Y. LeCun, C. Cortes, and C. J. Burges. MNIST handwrittendigit database. 1998.

[36] X. Lei, A. W. Senior, A. Gruenstein, and J. Sorensen. Ac-curate and compact large vocabulary speech recognitionon mobile devices. INTERSPEECH, pages 662–665, 2013.

[37] R. Lerallut, D. Gasselin, and N. Le Roux. Large-ScaleReal-Time Product Recommendation at Criteo. In the 9thACM Conference, pages 232–232, New York, New York,USA, 2015. ACM Press.

[38] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed,V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scalingdistributed machine learning with the parameter server. In11th USENIX Symposium on Operating Systems Designand Implementation (OSDI 14), pages 583–598, Broom-field, CO, Oct. 2014. USENIX Association.

[39] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner,J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin,S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson,T. Boulos, and J. Kubica. Ad click prediction: a view fromthe trenches. In KDD ’13, page 1222, 2013.

[40] X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkatara-man, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen,D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, andA. Talwalkar. Mllib: Machine learning in apache spark.Journal of Machine Learning Research, 17(34):1–7, 2016.

[41] M. Mitzenmacher. The power of two choices in random-ized load balancing. IEEE Trans. Parallel Distrib. Syst.,12(10):1094–1104, Oct. 2001.

[42] V. Mnih, N. Heess, A. Graves, et al. Recurrent modelsof visual attention. In Advances in Neural InformationProcessing Systems, pages 2204–2212, 2014.

[43] Deep MNIST for Experts. https://www.tensorflow.org/versions/r0.10/tutorials/mnist/pros/index.html.

[44] K. P. Murphy. Machine Learning: A Probabilistic Perspec-tive. The MIT Press, 2012.

[45] J. Nagle. Congestion control in ip/tcp internetworks. SIG-COMM Comput. Commun. Rev., 14(4):11–17, Oct. 1984.

[46] nginx [engine x]. http://nginx.org/en/.

[47] V. S. Pai, P. Druschel, and W. Zwaenepoel. Flash: An effi-cient and portable Web server. USENIX Annual TechnicalConference, General Track, pages 199–212, 1999.

[48] Portable Format for Analytics (PFA). http://dmg.org/pfa/index.html.

[49] PMML 4.2. http://dmg.org/pmml/v4-2-1/GeneralStructure.html.

[50] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, and L. Fei-Fei. ImageNet Large Scale VisualRecognition Challenge. International Journal of Com-puter Vision (IJCV), 115(3):211–252, 2015.

[51] D. C. Schmidt. Reactor: An object behavioral pattern forconcurrent event demultiplexing and dispatching. 1995.

[52] Scikit-Learn machine learning in python. http://scikit-learn.org.

[53] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

[54] Apple Siri. http://www.apple.com/ios/siri/.

[55] Skype real time translator. https://www.skype.com/en/features/skype-translator/.

[56] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition, pages 1–9, 2015.

[57] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo-jna. Rethinking the inception architecture for computervision. arXiv preprint arXiv:1512.00567, 2015.

[58] TensorFlow Serving. https://tensorflow.github.io/serving.

[59] Turi. https://turi.com.

[60] M. Welsh, D. E. Culler, and E. A. Brewer. SEDA: AnArchitecture for Well-Conditioned, Scalable Internet Ser-vices. SOSP, pages 230–243, 2001.

[61] E. P. Xing, Q. Ho, W. Dai, J.-K. Kim, J. Wei, S. Lee,X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A newplatform for distributed machine learning on big data. InSIGKDD, KDD ’15, pages 1335–1344. ACM, 2015.

[62] S. J. Young, G. Evermann, M. J. F. Gales, T. Hain,D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey,V. Valtchev, and P. C. Woodland. The HTK Book, ver-sion 3.4. Cambridge University Engineering Department,Cambridge, UK, 2006.

[63] J.-M. Yun, Y. He, S. Elnikety, and S. Ren. Optimal aggre-gation policy for reducing tail latency of web search. InProceedings of the 38th International ACM SIGIR Con-ference on Research and Development in Information Re-trieval, SIGIR ’15, pages 63–72, New York, NY, USA,2015. ACM.

[64] M. Zaharia et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. InNSDI, 2012.

14