Metis: Robustly Tuning Tail Latencies of Cloud …...a popular scenario [2,3,20]. Notably, cloud...

Open access to the Proceedings of the 2018 USENIX Annual Technical Conference

is sponsored by USENIX.

Metis: Robustly Tuning Tail Latencies of Cloud Systems

Zhao Lucis Li, USTC; Chieh-Jan Mike Liang, Microsoft Research; Wenjia He, USTC; Lianjie Zhu, Wenjun Dai, and Jin Jiang, Microsoft Bing Ads; Guangzhong Sun, USTC

https://www.usenix.org/conference/atc18/presentation/li-zhao

This paper is included in the Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC ’18).

July 11–13, 2018 • Boston, MA, USA

ISBN 978-1-939133-02-1

Metis: Robustly Optimizing Tail Latencies of Cloud Systems

Zhao Lucis Li?‡ Chieh-Jan Mike Liang‡ Wenjia He?‡ Lianjie Zhu◦ Wenjun Dai◦

Jin Jiang◦ Guangzhong Sun??USTC ‡Microsoft Research ◦Microsoft Bing Ads

AbstractTuning configurations is essential for operating moderncloud systems, but the difficulty arises from the cloudsystem’s diverse workloads, large system scale, and vastparameter space. Building on previous space explorationefforts of searching for the optimal system configura-tion, we argue that cloud systems introduce challengesto the robustness of auto-tuning. First, performancemetrics such as tail latencies can be sensitive to non-trivial noises. Second, while treating target systems as ablack box promotes applicability, it complicates the goalof balancing exploitation and exploration. To this end,Metis is an auto-tuning service used by several Microsoftservices, and it implements customized Bayesian opti-mization to robustly improve auto-tuning: (1) diagnosticmodels to find potential data outliers for re-sampling, and(2) a mixture of acquisition functions to balance exploita-tion, exploration and re-sampling. This paper uses BingAds key-value store clusters as the running example –compared to weeks of manual tuning by human experts,production results show that Metis reduces the overalltuning time by 98.41%, while reducing the 99-percentilelatency by another 3.43%.

1 Introduction

For many web-scale cloud systems, main evaluation met-ric is tail latencies (e.g., 99 and 99.9-percentile laten-cies) of serving a request [9]. While tail latencies seemrare, the probability of a user request experiencing thetail latency in an end-to-end system can be high, espe-cially that most web-scale applications employ a multi-stage architecture. Imagine a web-scale application with

Chieh-Jan Mike Liang is the corresponding author. This work wasdone when Zhao Lucis Li and Wenjia He were interns at MicrosoftResearch.

a five-stage processing pipeline, the probability of a re-quest encountering 99-percentile latency at least is∼5%.Furthermore, tail latencies can be more than 10 timeshigher, as compared to the average latency [9].

Recently, advances in machine learning have spawnedstrong interests in automating system customizations,where auto-tuning system configuration of parameters isa popular scenario [2, 3, 20]. Notably, cloud systemsexhibit two motivating characteristics for auto-tuning.First, the overhead of operating cloud systems is in-creasingly larger, due to the increasingly more dynamicand variable system workloads, and the scale of mod-ern cloud systems. In many read-intensive web applica-tions such as news sites and advertisement networks, dataqueries are tied to end-user personal interests and currentcontexts, which can exhibit temporal dynamics. Further-more, even within one cloud system, there can be sev-eral components that individually impose different datarequirement and handle different types of meta data anduser data. Second, as cloud systems become more com-plicated, both design space and parameter space growsignificantly. The optimal system configuration is be-yond what human operators can efficiently and effec-tively reason about, and the cost to naı̈vely benchmarkall possible system configurations is exorbitant.

Bayesian optimization (BO) with Gaussian process(GP) has emerged as a powerful black-box optimizationframework for system customizations [2, 3, 19]. Math-ematically, we model the configuration-vs-performancespace by regressing over data points already collected,i.e., system configurations benchmarked. And, this re-gression model allows us to estimate the global optimum,or the best-performing system configuration. While col-lecting more data points can improve the regressionmodel, benchmarking one system configuration of pa-rameters can be time-consuming due to system warm-upand stabilization. Fortunately, BO offers a way to iter-atively build up the training data by suggesting systemconfigurations to benchmark, with the goal to maximally

USENIX Association 2018 USENIX Annual Technical Conference 981

improve the regression model accuracy.

1.1 Contributions

In this paper, we report the design, implementation, anddeployment of Metis, an auto-tuning service used by sev-eral Microsoft services for robust system tuning. WhileMetis is inspired by previous efforts to leverage BO totrain the GP regression model, we address the followingfactors to improve the robustness of optimizing systemcustomizations.

First, since the GP regression model is trainedwith data points from benchmarking some system con-figurations, how well these data points capture theconfiguration-vs-performance space’s global optimumdetermines the auto-tuning effectiveness. At the sametime, we should avoid unnecessarily over-sampling thespace, as system benchmarking can be resource-intensiveand time-consuming. To this end, at each iteration,BO’s strategy for picking the next system configurationto benchmark should balance exploitation (i.e., regionswith high probability of containing optimum) and explo-ration (i.e., regions with high uncertainty of containingoptimum). Specifically, inadequate exploration reducesthe chance of moving away from a local optimum, andinadequate exploitation impacts the efficiency of identi-fying the global optimum. In contrast to related effortsthat rely on simple-to-implement strategies [2, 3], Metisstrongly decouples exploitation and exploration, so thatit can independently evaluate each action’s expected im-provement and anticipated regret.

Second, the theory behind BO and GP mostly assumesthe data collection is reasonably noise-free, or suscepti-ble to only the Gaussian noise. Unfortunately, many sys-tem performance metrics (e.g., tail latencies) are sensi-tive to non-Gaussian or unstructured noise sources in thewild, e.g., CPU scheduling and OS updates. In contrastto related efforts [2, 3], not only does Metis consider ex-ploitation and exploration, but it also weighs the benefitof re-sampling existing data points. Key enabler is thediagnostic model that Metis creates to identify potentialoutliers.

To demonstrate the usefulness of Metis for real-worldsystem black-box optimizations, this paper showcasesone of our production deployments as the running ex-ample – optimizing tail latencies of Microsoft Bing Adskey-value (KV) stores. We present measurements andexperiences from two KV clusters that handle ∼4.14 bil-lion and∼3.18 billion key-value queries per day, respec-tively. Compared to weeks of manual tuning by humanexperts, production results show that Metis reduces theoverall tuning time by 98.41%, while reducing the 99-percentile latency by another 3.43%.

2 Background and Motivation

As a black-box optimization service, Metis tunes theconfiguration of several Microsoft services and network-ing infrastructure. This section describes one deploy-ment as the running example in this paper – MicrosoftBing Ads key-value (KV) store cluster, BingKV. Then,we motivate optimally customizing cloud systems.

2.1 Running Example: BingKV

Individual stages of the ads query processing pipeline,e.g., selection and ranking, host separate KV store clus-ters. Within the cluster, each KV server is responsible forone non-overlapping dataset partition. Therefore, eachserver can experience different workload, as defined bythe frequency distribution and the size distribution of topqueried key-value objects. Manually tuning each serveris difficult as a cluster can have thousands of servers.

The configuration space consists of parameters of thelocal caching mechanism, which decides what KV ob-jects should be cached in the in-memory cache or servedfrom the persistent data store. Main evaluation metric istail latencies of serving key queries after the in-memorycache is full. We describe these parameters and their typ-ical value ranges next.

First, both the recency and frequency distribution arewell-known foundation for cache eviction. BingKV sup-ports NumCacheLevels (1 - 10) levels of LRU (LeastRecently Used) or LFU (Least Frequently Used) basedcaches – a cached object can move up a level if it hasbeen queried CachePromotionThreshold (1 - 1,000)times. Since cached objects at higher levels have beenqueried more than those at lower levels, locality principleimplies that they should not be easily evicted. Therefore,we allow NumInevictableLevels (0 - 9) levels to bespecified as being inevictable.

Second, many cache designs incorporate a shadowbuffer that stores the key only, instead of the entire key-value object. BingKV implements a shadow buffer witha capacity of ShadowCapacity (1 - 10) MB, to hold keyswith an object size larger than AdmissionThreshold

(1 - 1,000) bytes. Then, keys in the shadow buffer aremoved to the cache only if they have been queried morethan ShadowPromotionFreq (1 - 1,000) times.

Finally, CacheCapacity (1 - 10,000) specifies thecache capacity (excluding the shadow buffer) in MB,and the dataset partition on each server is divided intoNumShards (1 - 64) shards.

2.2 Impacts of Suboptimal ConfigurationsTo illustrate impacts of suboptimal configurations, weempirically measure BingKV’s 99-percentile latency un-

982 2018 USENIX Annual Technical Conference USENIX Association

Day

Late

ncy

Incr

ease

(%)

1 2 3 4 5 6 7 8 9 10 11 12

010

20

Figure 1: Motivating example – suboptimal configu-rations can impact the system performance. We useBingKV as the running example, and compare the taillatency increase with respect to the optimal system con-figuration.

der different parameter configurations, and quantify thetail latency increase with respect to the best-performingsystem configuration. Experiments use two 12-daypre-recorded production workload traces of BingKV,Prod Trace1 and Prod Trace2, and replay these work-load traces to benchmark system configurations.

Suboptimal configurations can happen when systemtuning does not consider machine-specific setup. To il-lustrate, we vary the cache capacity between 32 MBand 512 MB – we find the best-performing configuration(e.g., AdmissionThreshold ranging from 1 to 1,000)for each capacity, and then we compare the latency of thebest-performing configuration to 100 configurations uni-formly sampled from the parameter space. Results showthat the average 99-percentile latency increase can rangefrom 15.34% (for 256-MB cache) to 22.74% (for 512-MB cache), with more than 34.14% increase in the worstcase. For Prod Trace2, the average 99-percentile latencyincrease can range from 10.21% (for 32-MB cache) to13.89% (for 64-MB cache), with more than 13% increasein the worst case.

Suboptimal configurations can also happen whensystem tuning does not consider temporal dynamics.For each day of Prod Trace1, we compare the best-performing configuration to that of the other 11 days.Figure 1 shows that the average 99-percentile latency in-crease can be as high as more than 5.58% (day 5).

2.3 Strawman SolutionsManual tuning. Manual tuning can leverage adminis-trators’ knowledge and prior experience about the tar-get system. Unfortunately, as modern systems get largerand more complicated, reasoning the system behaviorin a high-dimensional configuration space becomes in-creasingly difficult. Furthermore, real-world workloadcan have dynamics across machines and time, and man-ual tuning does not scale. To illustrate this argument,our experience shows that manually tuning a subset ofBingKV can take weeks. And, in some cases, it is not

Figure 2: Architecture of Metis service.

certain whether the manual tuned configuration is indeedthe best-performing one.

Naı̈ve Bayesian optimization. OtterTune [2] and Cher-ryPick [3] demonstrated the potential of Bayesian opti-mization in adaptively finding the best-performing con-figuration for databases and data analytics systems, re-spectively. Both approaches adopt common BO selectionstrategies such as Expected-Improvement (EI). Whilethese strategies are easy-to-implement, they do not con-sider factors impacting the robustness of tuning systemcustomizations: the balance of exploitation and explo-ration [14], and data outliers.

3 Our Approach

Given a workload of a system, Metis has the objec-tive of predicting the best-performing configuration byselectively exploring the configuration-vs-performancespace. Through system benchmarking, Metis can col-lect data points that describe system inputs (e.g., sys-tem workload and parameter values) and outputs (e.g.,performance metrics of interests) for training its model.When Metis does not change its prediction of the best-performing configuration with additional data points, wesay the model has converged. Maximizing the predictionaccuracy and minimizing the model convergence timeare two evaluation metrics for Metis. The former relatesto the auto-tuning correctness, and the latter relates to theservice scalability.

This section first outlines the usage flow of Metis.Then, we formulate auto-tuning as an optimization prob-lem, and highlight practical challenges to motivate cus-tomizations proposed in subsequent sections.

3.1 Architectural Overview

Figure 2 shows the architecture of Metis containing com-ponents for model training and system tuning.

Model training. The Progressive Sampling Enginesolves the optimization problem of robustly construct-ing the predictive regression model, which models theconfiguration-vs-performance space. Each model corre-sponds to one performance metric, and the model dimen-sionality depends on the number of parameters.


Users start by providing the Progressive Sampling En-gine their requirements (e.g., parameters and metric ofinterest) and workload traces (e.g., production or synthe-sized traces). The engine bootstraps BO by picking a setof system configurations as bootstrapping trials, for theTrial Manager to benchmark. Benchmarking can happenin simulators or real machines. Then, using performancemetrics collected as the training data points, the enginepicks subsequent system configuration of parameters tobenchmark. Ideally, each iteration should improve theregression model’s estimation of the space’s global opti-mum. We formulate this iterative process as an optimiza-tion problem in the next subsection (c.f. §3.2).

When a stopping criterion is met, the Progressive Sam-pling Engine stops collecting more data points, and itoutputs the GP regression model trained so far. Stoppingcriteria can include training time budget and so on. Aswe show in §6, the system benchmarking time dominatestraining, rather than the model computation time.

System tuning. The Tuning Manager periodically re-ceives status reports from individual servers of the tar-get system. These status reports contain current work-load characterizations and logged performance metrics.If performance degradation is detected (e.g., a significantincrease in the tail latency), the Tuning Manager uses theworkload characterization to select the nearest regressionmodel. We use DNNs to classify workloads, which mini-mizes the burden of weighing workload features for clas-sification.

3.2 Optimization Problem FormulationFormally, for a workload w, we want to find the k-dimensional configuration c∗ (representing k system pa-rameters), which has the highest probability of being thebest-performing configuration, cw. The overall problemobjective is as follows:

c∗ = argmaxc ∈ con f igs

P(c = cw | w)

Given the k-dimensional parameter space can be toolarge for exhaustive searches, Metis opts the regressionmodel to predict with limited amount of data points col-lected. While more training data will help reducing theregression uncertainty, the training objective should alsoconsider the training overhead:

minimize ∑w ∈ workloads

(con f idence interval(cw))

subject to ∑T (c) ≤ Tbudget

T (c) denotes the time necessary to sample a configura-tion c, and Tbudget denotes the maximum amount of timecost allocated for model training.

Predictive regression model formulation. Regressionallows Metis to model the expected configuration-vs-performance space, with data points already collectedfrom benchmarking some configurations. The regressionmodel captures the conditional distribution of a targetperformance metric given a system configuration. Wepick Gaussian process (GP) [13] as the model, which ex-tends multivariate Gaussian distributions to infinite di-mensionality, such that observations for an unsampleddata point are assumed to follow a multivariate Gaussiandistribution. In other words, assuming a stochastic func-tion f where every input x has an output f (x), each f (x)is defined as a mean µ and a standard deviation σ of aGaussian distribution.

GP exhibits a number of desirable properties. First,it does not assume a certain mathematical relationshipbetween model inputs and outputs, e.g., the linear rela-tionship. Second, for any x, GP can return the expectedvalue of f (x) and uncertainty (i.e., standard deviation).

Optimization framework. A proven approach to trainthe GP model is Bayesian optimization (BO) [17]. Ateach iteration, based on the current GP model, BO se-lects a system configuration to benchmark next to fur-ther train the GP model. The selected configuration isexpected to maximally improve the accuracy predictingthe global optimum of the configuration-vs-performancespace. The logic behind selecting the next trial is im-plemented by BO’s acquisition function, and its de-sign traditionally aims to balance exploitation (i.e., sam-pling regions with high probability of containing op-timum) and exploration (i.e., sampling regions withhigh uncertainty). The model converges when BO be-lieves that the GP model has sufficiently captured theconfiguration-vs-performance space global optimum, orthe best-performing system configuration in our case.

BO conceptually realizes a form of progressive sam-pling, and it is suitable for scenarios where collectingsystem performance metrics of a single trial is resource-intensive or time-consuming (e.g., waiting for a systemto warm up and stabilize).

3.3 Practical Challenges of Robustness

Running BO with GP requires the following practicalconsiderations for robust system tuning.

Sampling configuration-vs-performance space with-out strong assumptions on the target system behav-ior. Given that system behavior can be difficult to beproperly reasoned, Metis tries to learn the configuration-vs-performance space from selective sampling. In otherwords, based on system configurations already bench-marked, Metis selects the next system configuration tobenchmark, which is expected to maximally improve the


regression accuracy. To this end, we believe that existingefforts leave the following two gaps to fill.

First, balancing exploitation and exploration is still anon-trivial problem. Many proposed acquisition func-tions evaluate the potential improvement of a trial [19],and the community has shown their limitations in bal-ancing exploitation and exploration [4, 17]. ExpectedImprovement (EI) is a widely popular choice of acqui-sition functions. It improves upon Maximum Probabilityof Improvement, by estimating the best-case improve-ment with both the mean and the uncertainty around asampling point. However, Ryzhov et al. [14] showed thatEI allocates only O(log n) samples to regions that are ex-pected to be suboptimal, where n is the total number oftrials. In other words, n needs to be significantly largefor EI to balance exploitation and exploration.

Second, bootstrapping trials refer to the set of sam-pling points to initialize BO. Since BO decides the nexttrial with the GP model trained with past trial data points,prior data samples can influence how BO expects theposterior expectation to be. The main requirement for se-lecting bootstrapping trials should be exploration. Ran-dom sampling is a simple approach to pick bootstrap-ping trials, but it requires a large amount of samplingpoints to ensure coverage [12]. This is not ideal for time-consuming trials, where some systems need time to sta-bilize (e.g., system cache warm-up).

Handling non-Gaussian data noise. Most theoreticalwork on BO with GP assumes the trial data collectedare noise-free (i.e., perfectly reproducible experiments),or susceptible to only the Gaussian noise [17]. Un-fortunately, many system metrics such as tail latenciesare sensitive to a wide range of noise sources in thereal world, ranging from background daemons, local re-source sharing, networking variability, etc. This is dif-ferent from reproducible metrics such as classificationaccuracy. Since real-world data noise sources can be het-erogeneous and non-trivial to model, the auto-tuning ser-vice should consider the benefits of removing potentialoutliers.

4 Improving Auto-Tuning Robustness

Given the challenges discussed in §3.3, we now discussour customizations to Bayesian optimization to improvetuning robustness.

4.1 Bootstrapping Trials

To select sampling points to bootstrap Bayesian opti-mization for a given workload, Metis exercises Latin Hy-percube Sampling (LHS) [12]. LHS is a type of stratifiedsampling. According to some assumed probability dis-

tribution, LHS divides the range of each of M parame-ters into I equally probable intervals, and randomly se-lects only one single data point from each interval. Sinceeach interval of each parameter is selected only once, thenumber of bootstrapping trials chosen by LHS is exactlyI, regardless of M. Furthermore, the maximum numberof possible combinations for bootstrapping trials is only(∏I−1

i=0 I− i)M−1 = (I!)M−1.LHS offers several desirable properties for bootstrap-

ping Bayesian optimization. First, it has been shown that,compared to random sampling, stratified sampling canreduce the sample size required to achieve the same con-clusion [12]. Second, compared to another well-knownstratified sampling technique, Monte Carlo, LHS allowsone to obtain a stable output with less samples [7]. And,while quasi-Monte Carlo can be an alternative approach,its output is a low discrepancy sequence which is notrandom, but uniformly deterministic [6]. Third, as LHSdoes not control the sampling of combinations of dimen-sions, the number of samples picked LHS is agnostic tothe dimensionality of the parameter space.

A well-known concern of LHS is that it may exhibita higher memory consumption, in order to keep trackof parameter intervals that have already been sampled.However, we note that only a few sampling points arenecessary to bootstrap Metis, and our current runningsystem sets I to be 5.

4.2 Customized Acquisition Function

After obtaining bootstrapping sampling trials with LHS,Metis then runs Bayesian optimization to iteratively trainthe Gaussian process model. At each iteration, BO out-puts the system configuration that is expected to offerthe most improvement to the GP model, in terms of pre-dicting the best-performing system configuration. Thisoutput represents the system configuration that Metisshould sample in the next iteration of training. To pro-duce this output, Metis customizes the acquisition func-tion to balance three goals: exploitation, exploration, andre-sampling.

Our customized acquisition function works as a mix-ture of three separate sub-acquisition functions (c.f.§4.2.1) corresponding to each goal. At each iterationof BO, based on the GP model constructed so far, in-dividual sub-acquisition functions compute the next sys-tem configuration to sample to maximize their own goal.Then, the acquisition function evaluates these candi-dates in terms of the improvement of the expected best-performing system configuration, and selects the candi-date with the highest gain to sample next (c.f. §4.2.2).

Compared to related efforts, our acquisition func-tion exhibits two differences. First, by introducing re-sampling as a possible action, we allow Metis to re-


sample a trial whose previous results might be suscep-tible to non-Gaussian noise. Second, Metis decouples allthree actions and runs separate acquisition functions foreach action. This decoupling allows Metis to indepen-dently evaluate the potential information gain and regretaversion from taking each action in the next iteration.

4.2.1 Sub-acquisition functions

Metis decouples exploitation, exploration, and re-sampling into three separate sub-acquisition functions.Given past trials, each sub-acquisition function outputsa system configuration to maximize its own goal. Thesesystem configurations become candidates for the acqui-sition function.

Exploration. This sub-acquisition function looks for thesampling point whose expected observation would havewith the highest uncertainty. Formally, in the GP regres-sion model, this sampling point would exhibit the largestconfidence interval.

Exploitation. This sub-acquisition function looks for thesampling point whose expected observation would mostlikely be the system optimum. While one simple ap-proach is to consider the absolute observation of a sam-pling point as expected by the GP regression model, itsaccuracy can be impacted by GP’s uncertainty of thatsampling point. Therefore, Metis takes an approach in-spired by TPE [5], and tries to estimate the probability ofa sampling point giving near-optimum observation.

The exploitation sub-acquisition function works asfollows. It first separates the past trials into two groups:the first group has best observations, and the secondgroup has the rest. Then, we construct two Gaussianmixture models (GMM) to describe each group. Withthese two GMMs, we can evaluate the likelihood of asystem configuration, c, being in either group, i.e., P1(c)and P2(c). The ratio of these two likelihood, or P1(c)

P2(c),

would give us a score of how likely a sampling point isin the first group, instead of second group.

Re-sampling. This sub-acquisition function looks foroutliers in past trials, and suggest trials that should bere-sampled (i.e., system configurations that should be re-benchmarked).

The re-sampling sub-acquisition function works bycreating the diagnostic model. To examine a data point,the diagnostic model quantifies the difference betweenthe measured value and the expected value. To do so, thediagnostic model is a GP regression model trained with-out the examined data point. Then, the diagnostic modelcan calculate the expected value and 95% confidence in-terval. If the measured value falls outside the confidenceinterval, it is probable that the examined data point couldbe an outlier. The sub-acquisition function repeats the

Figure 3: Components of Metis service implementation.

evaluation for all past trials, and it selects the trial that isfarthest from the expected confidence interval.

4.2.2 Evaluation of Candidates

Taking sub-acquisition function outputs as candidateconfigurations, our acquisition function computes theirinformation gain with respect to how the prediction ofoptimal configuration changes. Conceptually, for eachcandidate, the GP model bounds the expected observa-tion with a confidence interval. And, the upper and lowerbound of this interval can help us estimate the potentialinformation gain (if we were to actually select the corre-sponding candidate for benchmark).

The acquisition function starts by predicting the cur-rently best-performing system configuration, ccur best ,and this step searches for the sampling point with thelowest expected observation in the GP regression model.Then, to evaluate a candidate, ccandidate, we add itslower-bound confidence interval (i.e., best-case) to theGP regression model, and predict the would-be best-performing system configuration, cnew best . The differentbetween the expected observation of ccur best and cnew bestis the improvement, and 0 if the improvement is negative.Then, we repeat the process for upper-bound confidenceinterval (i.e., worst-case).

Finally, the acquisition function outputs the ccandidatewith the highest sum of improvement calculated with thelower-bound and the upper-bound confidence interval.

5 Implementation

Figure 3 illustrates the Metis system components imple-mented. Our current implementation is in Python, andthis language choice allows us to use the popular Scikit-learn library for Gaussian process regression [16]. Wechoose the summation of Matern (3/2) and white noiseas the covariance kernel [19].

Separation of logic and execution. Metis consists of acentralized web service and client stubs that sit on servershosting the target system. Network communications hap-pen over TCP, and the message format is JSON. Con-ceptually, the client stub hooks up to the target system,


and it abstracts away system-specific interfaces and ex-ecution from the web services. While an alternative im-plementation is to place both the logic and execution ina local service on each server, the separation providesseveral benefits. First, the computation-intensive logicof Metis does not contend for resources on productionservers. Second, being centralized, Metis has the globalview of all status reports from all servers. This allowsMetis to continuously improve the predictive model withnew data points.

Metis web service. The web service trains one pre-dictive regression model for each workload trace. And,these traces can be recorded in the production environ-ment, or synthesized with tools such as YCSB [8]. TrialManager replays each workload trace in simulators orreal servers, and each replay represents one trial bench-marking one selected system configuration. Then, thetraining dataset consists of system configurations andcorresponding performance benchmarks. In the case ofBingKV, we built a simulator wrapping the productionKV code binaries, to run trials.

From status reports uploaded by client stubs, if TuningManager detects that a server is experiencing a perfor-mance degradation above some user-specified thresholds(e.g., tail latencies have increased by more than 10% inthe last six hours), it computes a new configuration andsends a command to the server’s client stub.

Metis client stubs. The client stub deals with system-specific interfaces, (1) to periodically upload recent sys-tem performance measurements and workload character-ization, and (2) to execute configuration change com-mands from the web service.

Most web-scale cloud systems already have an ex-tensive logging mechanism for continuous performancemonitoring, and Metis client stubs periodically retrieverelevant performance measurements from the same log-ging mechanism. While different systems have differentworkload characterization features, these features are ei-ther already available in the logging mechanism, or re-quire additional code instrumentation. In the case ofBingKV, our workload features are the size and queryfrequency of the top queried KV objects, and the incom-ing query traffic rate.

Upon receiving a configuration change command,Metis does not try to aggressively re-configure servers.Instead, it employs a guard time following a re-configuration to allow the target system to warm up andstabilize. Other than time durations, this guard time canalso be values of system states, e.g., cache occupancy.

Selection of Configs Modeling DORandom Random NoiTuned [10] LHS + EI BO w/ GP NoCherryPick [3] Random + EI BO w/ GP NoTPE [5] Random + EI GMM NoMetis LHS + Customized BO w/ GP Yes

Table 1: This table highlights differences among com-parison baselines and Metis. The ”DO” column showswhether data outlier removal is considered.

6 Evaluation

Our major results include – (1) compared to baselines,Metis picks better configurations 84.67% of times in thepresence of low data noises. (2) In the presence of dataoutliers, Metis has 56.77% more chance of picking betterconfiguration. (3) Metis has a faster convergence time insearching for the best-performing configuration.

6.1 Methodology

We use the Bing Ads KV store (c.f. §2) as the targetsystem. Our testbed machines have two 2.1 GHz CPUs(with 16 cores), 16 GB RAM, and a 256 GB SSD. Thesemachines host the target system binaries, and simulateconfigurable key-value query arrival rates (or queries persecond) for the given workload trace. We log perfor-mance metrics with asynchronous I/Os to minimize anyartifact introduced by logging.

Datasets and workload traces. We obtained two 12-daykey-value query traces, Prod Trace1 and Prod Trace2,from two Bing Ads KV store clusters, BingKV 1 andBingKV 2. These two real-world traces exhibit followingcharacteristics – the average size of top 500 frequentlyqueried KV objects can have a difference up to 9.49%from one week to another, and that of Prod Trace1 isapproximately two-times larger than Prod Trace2.

For more controlled experiments, we also synthesizedthree workload traces containing ∼0.5 million query re-quests. Our workload generation tool is based on the Ya-hoo! Cloud Serving Benchmark (YCSB) [8]. However,while YCSB considers the key-value frequency distri-bution, it does not consider the key-value size distribu-tion. To fill this gap, given a list of unique keys fromYCSB (sorted in descending order of frequency), we as-sign key-value sizes according to some distributions. ForSynth Trace1, Synth Trace2 and Synth Trace3, we fixtheir size distribution to be Zipf, and we generate keysfollowing the frequency distribution of Zipf, normal, andlinear. Doing so allows us to examine different size dis-tributions.

Comparison baselines. In addition to random search,


Prod_Trace1 Prod_Trace2 Synth_Trace1 Synth_Trace2 Synth_Trace3

Win

ning

Rat

e (%

)0

2040

6080

MetisiTunedCherryPick

RandomSearchTPE

Figure 4: Chance that each approach picks the best con-figuration. Each experiment allows each approach to run25 trials, and we repeat the experiment 30 times.

we take state-of-the-art BO-driven approaches includingiTuned [10], CherryPick [3], and TPE [5]. We config-ure these comparison baselines with the recommendedsetting, e.g., the Matern(5/2) kernel for CherryPick. Ateach iteration, these approaches output the configurationthey expect to have lowest tail latency. Table 1 lists theirdifferences. For the ground truth, we brute-force all pos-sible configurations for Prod Trace1.

6.2 Effectiveness of Metis

This section quantifies the likelihood that the systemconfiguration selected by Metis outperforms those ofother approaches. This relates to the auto-tuning correct-ness. We also discuss factors including time budgets (i.e.,number of trials allowed) and data outliers.

Metis has a higher chance in picking system config-urations that outperforms those of other approaches.Assuming a fixed time budget, we allow each approachto run 25 trials. We then rank approaches by the 99-percentile latency of their expectedly best-performingsystem configurations. We clean-install the machine toensure minimal noise in measurements, and evaluate thecase of data outliers later in this section. Experiments arerepeated 30 times, to compensate for the randomness insome approaches. For random search, we take the bestof all 25 trials for evaluation. We note that TPE requiresat least 20 bootstrapping trials (from random sampling),and we allocate five bootstrapping trials to CherryPick,iTuned and Metis.

Figure 4 summarizes results when system bench-marks, or training data, are relatively noise-free. It showsthat Metis has a higher chance of picking the winningconfiguration, i.e., the system configuration that out-performs those selected by comparison baselines. De-pending on the synthesized workload trace, Metis canoutperform 48.97% (for Synth Trace2) to 84.67% (forProd Trace2) of times.

Furthermore, taking a closer look into cases whereMetis failed to win, we note that Metis 99-percentile la-tency is within 1% of the winner. In addition, Metis ob-

Iterations of sampling

Win

ning

rate

(%)

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

020

4060

80

MetisCherryPickiTuned

RandomSearchTPE

Figure 5: Chance that each approach picks the best con-figuration for Prod Trace1, while changing the numberof sampling points allowed.

serves the lowest difference of 99-percentile latency tothe optimal, on average. This difference is 0.44% forMetis, which is 67.16% lower than CherryPick.

Metis has a faster convergence time than other ap-proaches. Regardless of the parameter tuning approach,the effectiveness in finding the optimal configurationshould ideally increase with the number of data pointsalready sampled, or configurations benchmarked. Onenatural question is how Metis converges to the optimalsystem configuration, as compared to other approaches.To this end, we try to answer two questions: (1) at eachiteration, what is the likelihood that Metis selects thesystem configuration that outperforms those of other ap-proaches? (2) how fast does Metis converge to the opti-mal system configuration?

Figure 5 shows that the chance for Metis to pick thewinning configurations is high at all iteration. We repeatthis experiment 30 times, due to the randomness in someapproaches. This result is desirable for auto-tuning underlimited time budget, as system benchmarks can consumea long time. In addition, the figure shows that, as thenumber of sampled data points increase, Metis is ableto better model the configuration-vs-performance space,which increases the likelihood of picking the winningconfiguration. Furthermore, looking at configurationsthat CherryPick and iTuned pick, we observe that theylean towards exploitation. On the other hand, Metis inde-pendently evaluates the potential information gain fromexploitation and exploration.

Following Figure 5, Figure 6 illustrates how eachapproach converges to the global optimum of theconfiguration-vs-performance space. The figure calcu-lates the average 99-percentile latency of selected con-figurations from 30 repeated experiments. Compared toother approaches, Metis has a faster convergence time,as it is able to benchmark system configurations thatwould iteratively improve modeling the configuration-vs-performance space. We note that the effectivenessof TPE significantly improves after 20 iterations, as itsGaussian Mixture Models require many training data. Fi-



Tail

late

ncy

(us)

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 251940

2000

2060 Metis

CherryPickiTunedTPE

Figure 6: The search path visualizes the tail latency ofsystem configurations that BO-driven approaches selectat each iteration.


Win

ning

rate

(%)

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

020

4060

80

MetisCherryPickiTuned

TPEMetis (no resampling)

Figure 7: Chance that each approach picks the winningconfiguration for Prod Trace1. The test machine has in-termittent background activities that result in data out-liers. The red line marks when Metis has sufficient datapoints for re-sampling.

nally, we also note that Figure 6 suggests that iTuned andCherryPick have a higher chance of exploiting local op-timum, and thus their acquisition function would requiremore iterations to find the global optimum.

Metis maintains high chance of picking the winningconfiguration in the presence of data outliers. Build-ing on the discussion so far, we now demonstrate the ro-bustness of Metis to data outliers. We repeat the previousexperiment where we allow each approach to iterativelyselect 25 trials to benchmark. The machine is a pro-duction Bing Ads server, which runs intermittent back-ground activities for software updates, monitoring, andperiodic database updates. Resulting noises in the mea-surement can not reasonably modeled by normal distri-bution. Unlike comparison baselines, Metis proactivelylooks for potential data outliers after a sufficient amountof data points is collected, or 10 in our case.

Figure 7 illustrates that Metis has a higher chance ofpicking the optimal configuration in the presence of dataoutliers. The red line shows when Metis stops markingall sampled data points as potential data outliers (due toinsufficient data collected). The effectiveness of Metisimproves after the red line. The figure also shows the ef-fectiveness of Metis without re-sampling, and removingoutliers improves Metis’s winning rate by 33.33%.

Training data size

Trai

ning

tim

e (s

ec)

0 50 100 150 200 250 300 350 400 450 500

05

1015

20

Figure 8: Impact of training data size on the training timeof GP models.

6.3 Overhead of Metis

It is known that benchmarking configurations can betime-consuming due to system warm-up and stabiliza-tion. Another potential source of overhead is the predic-tive regression model – both in training and configurationselection. To quantify this overhead, we randomly gener-ate the training data for individual experiment runs, andrepeat each experiment 50 times.

Model training time. This is the time for Metis to fita Gaussian process model over all sampled data points.We note that the community has shown that trainingthe Gaussian process model can exhibit a complexityof O(N3 +N2D) [21], where N and D are the numberand the dimensionality of training data points, respec-tively. This suggests that the number of training datapoints largely determines the training time. Our goal isto understand whether the training complexity will limitMetis’s practicality in the real-world. Figure 8 shows thatthe training time increases with the number of sampleddata points (for a training data set with a dimension of 20,and parameter values range between 1 and 1,000). Whenthe number of sampled data points increases to 500, thetraining time reaches 12.33 seconds. However, this train-ing time is still practical in real-world deployments.

Configuration selection time. This is the time for Metisto predict a sampling point that is expected to maximizethe given acquisition function. It has been shown that thetime complexity for prediction with the Gaussian pro-cess model is O(N2 +ND) [21], where N and D are thenumber and the dimensionality of training data points,respectively. When the training data size increases to500, predicting the observation of an unsampled pointtakes ∼10.25 msec. This time suggests that the predic-tive model is feasible for real-world usages.

7 Case Study: BingKV

With a year of Metis operational experience, there areobservations and lessons learned from adopting auto-tuning in Microsoft services. This section presentsmeasurements and experiences from our running ex-


Day

Late

ncy

Redu

ctio

n (%

)

1 2 3 4 5 6 7 8 9 10 11 12 13 14

020

4060

(a) 99-percentile latency.

Day

Cac

he H

it In

crea

se (%

)

1 2 3 4 5 6 7 8 9 10 11 12 13 14

020

4060

8010

0

(b) Cache hit rate.

Figure 9: Performance comparison between the previ-ous LRU-based data store and the Metis-tuned BingKV,based on 14-day hourly performance logs.

ample, Bing Ads KV store clusters (c.f. §2.1). TheKV store evolved from LRU, expert-tuned BingKV, toMetis-tuned BingKV. We study measurements from twoKV store clusters – BingKV 1 (∼700 servers handling∼4.14 billion key queries per day) and BingKV 2 (∼1,700servers handling ∼3.18 billion key queries per day).Each server of clusters has an in-memory cache capac-ity of 512 MB, and an SSD as the persistent data store.

Metis-tuned BingKV reduces the 99-percentile querylookup latency by an average of 20.4%, as comparedto LRU. We start by comparing Metis-based KV storewith the previously LRU-based KV store, under BingAds BingKV 1 production workload. The comparison isbased on 14-day hourly logs of the 99-percentile querylookup latency and the cache hit rate (CHR). Figure 9presents results for BingKV 1 – Figure 9a shows thatMetis helps to reduce the 99-percentile latency by 20.4%on average (with a standard deviation of 8.4), and Fig-ure 9b shows a CHR improvement of 60.6% (with a stan-dard deviation of 11.7). This translates to a 22.76% re-duction in disk I/O reads.

Metis-tuned BingKV reports a 3.43% lower 99-percentile latency, as compared to expert-tunedBingKV. Our human expert is one of the lead program-mers for Bing Ads key-value store, with years of expe-rience operating the system. As it is infeasible for thehuman expert to continuously tune the KV store, thissubsection focuses on picking a single best-performingconfiguration. We note that the human expert did nothave any time budget limitations, and manual tuning tookabout four weeks in a testing environment. The next sub-section delves into the system tuning time comparison.

Perf metrics DifferencesBingKV 1 BingKV 2

99-percentile latency -2.99% -3.43%Cache hit rate 2.43% 0.49%

Table 2: Performance comparison of expert-tuned andMetis-tuned BingKV-based data store, under 14-dayBing Ads workloads. Metis-tuned configurations out-perform expert-tuned configurations, while reducing thetuning time from weeks to hours.

AdmissionThreshold

Late

ncy

(ms)

0 100 200 300 400 500 600 700 800 900 1000

6.5

77.

58

8.5

(a) Brute-force.

AdmissionThreshold

Late

ncy

(ms)

0 100 200 300 400 500 600 700 800 900 1000

6.5

77.

58

8.5

(b) Metis.

Figure 10: On average, Metis reduces the configurationtuning time by 98.41% for BingKV. For illustration, thisfigure shows a one-dimensional case of sampling pointsselected by Metis, as compared to brute-force.

Table 2 shows that Metis-tuning outperforms human-tuning under two weeks of production traffic: (1) theMetis-tuned configuration reports a 2.99% and 3.43%lower 99-percentile latency for BingKV 1 and BingKV 2,respectively. (2) Metis-tuned configuration reportsa 2.43% and 0.49% higher CHR for BingKV 1 andBingKV 2, respectively.

Metis reduces the overall tuning time by 98.41%, ascompared to manual tuning by human experts. Whilehumans are typically guided by intuition based on theirknowledge of the system, much of the manual tuningprocess mostly resembles the random search. In fact,the problem exacerbates as the dimensionality of the pa-rameter space increases, especially in the case where pa-rameters have dependencies. For reference, in one sce-nario deployment, manual-tuning took about 3 weeks,while Metis-tuning took about 8 hours (including sam-pling, modeling and prediction).

To illustrate how predictive modeling reduces the tun-ing time, Figure 10 shows that Gaussian process regres-


sion for a one-dimensional case. GP regression is able toestimate the impact of ShadowAdmissionSize withoutsampling the entire parameter space. We note that dot-ted lines in Figure 10b mark the 95% confidence intervalfor each point, which can guide both the space explo-ration and the stopping criteria. Finally, we note that GPregression operations are negligible as compared to sys-tem benchmarking – training typically takes ∼1.02 sec,and predicting the expected value of an unsampled datapoints takes ∼0.53 msec.

8 Related Work

Black-box parameter tuning services. GoogleVizier [11] is an ongoing research project, and it sup-plies core capabilities to optimize hyper-parameters ofmachine learning models across Google. Similar toGoogle Vizier, SigOpt [18] is a startup that tunes hyper-parameters of ML models, but little technical detailshave been disclosed. Google Slicer [1] is a shardingservice that dynamically generates the optimal resourceassignment, but its goals are not parameter space ex-ploration and sampling. Like Metis, BestConfig [23]uses stratified sampling. However, it heavily leans to-wards exploitation, as it assumes a high probability offinding better-performing configurations around the cur-rently best-performing one.

In contrast, Metis focuses on providing a robust auto-tuning algorithm for cloud systems, and addresses chal-lenges that systems introduce to the optimization prob-lem. Metis has been used to tune parameters of Microsoftservices and networking infrastructure.

Parameter tuning with Bayesian optimization. Metisbuilds on previous efforts that demonstrate the poten-tial of applying Bayesian optimization and Gaussian pro-cess to auto-tuning. iTuned [10] uses Latin HypercubeSampling (LHS) to sample the parameter space, andmodel the parameter space with Gaussian process mod-els. OtterTune [2] uses a combination of supervisedand unsupervised machine learning methods to reducethe parameter dimension, characterize observed work-loads, and recommend configurations. CherryPick [3]follows a similar approach to BO and GP in selecting thebest-performing cloud configuration for a given machinelearning workload.

TPE [5] uses Bayesian optimization with Gaussianmixture model, instead of Gaussian process. It is suitablefor cases where there are some dependencies among pa-rameters. However, since TPE trains with a subset of thetraining data, it needs a larger amount of data to be col-lected for training effectively. Smart Hill-Climbing [22]improves Latin Hypercube Sampling, but the GP-basedapproach has been shown to outperform [10].

Metis introduces customizations to the framework ofBO with GP. These include the diagnostic model to findpotential data outliers for re-sampling, and a mixture ofacquisition functions to balance exploitation, explorationand re-sampling.

9 Discussion

We now discuss limitations concerning the applicabilityof Metis to systems in general.

Support of different system parameter types. Sometypes of system parameters can be non-trivial to modelwith regression models – First, categorical parameterstake on one of a fixed number of non-integer values suchas boolean. Since categorical parameters are not continu-ous in nature, it can be difficult to model the relationshipamong possible values. While categorical parameters areout of scope for this paper, our current implementationconceptually treats each categorical value as a new targetsystem. Second, some systems have multi-step parame-ters, where one single configuration requires the systemto go through a specific sequence of value changes forone or more parameters. Metis does not currently sup-port multi-step parameters.

Costs of changing system configurations. Applyingconfiguration changes can incur costs for some systems.First, server reboots might be necessary after a configu-ration change, thus service interruptions. To handle thiscase, system administrators can decide to push a config-uration change only if the new configuration is predictedto offer a certain level of performance improvement (e.g.,10% latency reduction). Administrators can also boundthe cost of reconfiguration, e.g., by performing recon-figurations gradually over time, or by bounding the pa-rameter space exploration by the distance from the cur-rent running configuration. Second, mis-predictions canresult in system performance degradation. Fortunately,Gaussian Process offers two ways to gain insights regard-ing uncertainties – GP offers a confidence interval foreach prediction, and a log-marginal likelihood score [15]to quantify the model fitness with respect to the trainingdataset.

10 Conclusion

This paper reports the design, implementation, and de-ployment of Metis, an auto-tuning service used by sev-eral Microsoft services for robust system tuning. Wedemonstrate the effectiveness of Metis with controlledexperiments and real-world system workloads. Further-more, real-world deployments show that Metis can sig-nificantly reduce the system tuning time.


References[1] ADYA, A., MYERS, D., HOWELL, J., ELSON, J., MEEK,

C., KHEMANI, V., FULGER, S., GU, P., BHUVANAGIRI, L.,HUNTER, J., PEON, R., KAI, L., SHRAER, A., MERCHANT,A., AND LEV-ARI, K. Slicer: Auto-Sharding for DatacenterApplications. In OSDI (2016), USENIX.

[2] AKEN, D. V., PAVLO, A., GORDON, G. J., AND ZHANG,B. Automatic Database Management System Tuning ThroughLarge-scale Machine Learning. In SIGMOD (2017), ACM.

[3] ALIPOURFARD, O., LIU, H. H., CHEN, J., VENKATARAMAN,S., YU, M., AND ZHANG, M. CherryPick: Adaptively Un-earthing the Best Cloud Configurations for Big Data Analytics.In NSDI (2017), USENIX.

[4] AZIMI, J. Bayesian Optimization (BO).

[5] BERGSTRA, J., BARDENET, R., BENGIO, Y., AND KEGL, B.Algorithms for Hyper-Parameter Optimization. In NIPS (2011).

[6] CAFLISCH, R. E. Monte Carlo and Quasi-Monte Carlo Methods.In Acta Numerica (1988).

[7] CHU, L., CURSI, E. S. D., HAMI, A. E., AND EID, M. Re-liability Based Optimization with Metaheuristic Algorithms andLatin Hypercube Sampling Based Surrogate Models. In Appliedand Computational Mathematics (2015).

[8] COOPER, B. F., SILBERSTEIN, A., TAM, E., RAMAKRISHNAN,R., AND SEARS, R. Benchmarking cloud serving systems withYCSB. In SoCC (2010), ACM.

[9] DEAN, J., AND BARROSO, L. A. The Tail at Scale. In Commu-nications of the ACM (2013), ACM.

[10] DUAN, S., THUMMALA, V., AND BABU, S. Tuning DatabaseConfiguration Parameters with iTuned. In VLDB (2009), ACM.

[11] GOLOVIN, D., SOLNIK, B., MOITRA, S., KOCHANSKI, G.,KARRO, J., AND SCULLEY, D. Google Vizier: A Service forBlack-Box Optimization. In KDD (2017), ACM.

[12] MCKAY, M. D., BECKMAN, R. J., AND CONOVER, W. J. AComparison of Three Methods for Selecting Values of Input Vari-ables in the Analysis of Output from a Computer Code. In Amer-ican Statistical Association and American Society for Quality(2000).

[13] RASMUSSEN, C. E., AND WILLIAMS, C. K. I. Gaussian Pro-cesses for Machine Learning. the MIT Press.

[14] RYZHOV, I. O. On the Covergence Rates of Expected Improve-ment Methods. In Operations Research (2014).

[15] SCHIRRU, A., PAMPURI, S., NICOLAO, G. D., ANDMCLOONE, S. Efficient Marginal Likelihood Computation forGaussian Process Regression. In arXiv:1110.6546 (2011).

[16] SCIKIT-LEARN. GaussianProcessRegressor.http://scikit-learn.org/stable/modules/

generated/sklearn.gaussian_process.

GaussianProcessRegressor.html.

[17] SHAHRIARI, B., SWERSKY, K., WANG, Z., ADAMS, R. P.,AND DE FREITAS, N. Taking the Human Out of the Loop: A Re-view of Bayesian Optimization. Proceedings of the IEEE (2016).

[18] SIGOPT. SigOpt. http://sigopt.com/.

[19] SNOEK, J., LAROCHELLE, H., AND ADAMS, R. P. Practi-cal Bayesian Optimization of Machine Learning Algorithms. InNIPS (2012).

[20] VENKATARAMAN, S., YANG, Z., FRANKLIN, M., RECHT, B.,AND STOICA, I. Ernest: Efficient Performance Prediction forLarge-Scale Advanced Analytic. In NSDI (2016), USENIX.

[21] WILSON, A. G., HU, Z., SALAKHUTDINOV, R., AND XING,E. P. Deep Kernel Learning. In AISTATS (2016).

[22] XI, B., LIU, Z., RAGHAVACHARI, M., XIA, C. H., ANDZHANG, L. A Smart Hill-Climbing Algorithm for ApplicationServer Configuration. In WWW (2004).

[23] ZHU, Y., LIU, J., GUO, M., BAO, Y., MA, W., LIU, Z., SONG,K., AND YANG, Y. BestConfig: Tapping the Performance Po-tential of Systems via Automatic Configuration Tuning. In SoCC(2017), ACM.


http://scikit-learn.org/stable/modules/generated/sklearn.gaussian_process.GaussianProcessRegressor.html



http://sigopt.com/

Metis: Robustly Tuning Tail Latencies of Cloud …...a popular scenario [2,3,20]. Notably, cloud...

Documents

Transcript of Metis: Robustly Tuning Tail Latencies of Cloud …...a popular scenario [2,3,20]. Notably, cloud...