Model-based Resource Allocation in the Public Cloud Talks/2019...Traditional Resource Management...

Neeraja J. Yadwadkar Postdoc, Stanford University

January 22nd, 2019

Model-based Resource Allocation in the Public Cloud

Traditional Resource Management Techniques

Task

Physical ServersVirtual Servers

Private Cloud

Public Cloud

Model

Data

Actions

Data-Driven Models for Resource Management

We need to extract insights from this data to derive effective actions: Data Driven Models

Data is only as important as the actions it enables!

Uncertainty

Cost of Training

Challenges: Data-Driven Models

Research Goal Achieve faster and predictable performance while reducing cost, by building data-driven models

Wrangler [SoCC 2014] Modeling Uncertainty

PARIS [SoCC 2017]Learning to Generalize

from Benchmarks Multi-Task Learning

For Efficient Training[SDM 2015] [JMLR 2016]

Cloud-Hosted Systems

Distributed Systems

Machine Learning

Prev. talk

Prev. talk

Predictive Scheduling Resource Allocation

This talk

This talk

Ø PARIS:Selecting the Best VM across Multiple Public Clouds: A Data-Driven Performance Modeling Approach

Ø ClustQ: Online Covariate Clustering for Efficient Retraining and Data Exploration

WorkloadA

Deploying a workload to the Cloud…

m1.xlarge m1.large

m1.medium

m2.xlargem2.2xlarge

m2.4xlarge

c1.xlarge

c1.medium

What VM type should I use for my workload?

Answer is workload specificand depends on

Cost and performance goals

A1A2

A3A4A5

F2

F4

F8 D11v2

D12v2D13v2

n1-standard-1

n1-standard-4

n1-highmem-2 n1-highcpu-8

f1-micro

How do we choose the best VM?

Rules of thumb?

#1: Smaller is cheaper

#2: Bigger is better

#3: Similar configurations imply similar performance

Smaller isn’t always cheaper

Example: A Video Encoding Task

Increasinghourly

cost

0 20 40 60 80 100 120

m1.large

c3.xlarge

m3.xlarge

c3.2xlarge

m2.2xlarge

m2.4xlarge

Runtime (seconds)

0 0.2 0.4 0.6

m1.large

c3.xlarge

m3.xlarge

c3.2xlarge

m2.2xlarge

m2.4xlarge

Total Cost (cents)

#1: Smaller is cheaper

Building Apache Giraph

Bigger isn’t always better

#2: Bigger is better

Similar configurations may not always imply similar performance

YCSB-benchmarks Workload A

#3: Similar configurations imply similar performance

To select the Best VM: We desire a solution, that is:

Useful: Enable informed cost-performance trade-off decisions

Cost EfficientAccurate

Specify cost/performance goals

VMI VMII

Run user-workload task

… VMk

Trivial! but expensive!

Run on all VM types?

x√

Useful: Enable informed cost-performance trade-off decisions

Cost Efficient

Accurate

VM Types

Key Ingredient: Cost-Perf Trade-off Map

However, learning them simultaneously makes it expensive…

Attempting to learn:• VM type behavior, and• Workload behavior

Our Proposal: PARIS

VMI VMII


… VMk



x√

Cost Efficient

Accurate

VM Types



Our Proposal: PARIS

VMI VMII


… VMk



x√

Cost Efficient

Accurate

VM Types

Key Insight: De-couple learning of VM types and workloads

Learn Workload behaviour

Learn VM Type behaviour



Our Proposal: PARIS


VM1

Extensive benchmarking to model relationship between VM types

Light-weight fingerprinting to model the relationship between user workloads and benchmark workloads

Our Proposal: PARIS

VM2 VM100…

Run Benchmark Workloads

Run User -Workload

Cost Efficient AccurateVM Types


Our Proposal: PARIS

VM1 VM2 VM100…

Extensive VM Benchmarking (Offline)

Light-weight fingerprinting (Online)

Model

Fingerprint

User workload

Fingerprint Generator

Benchmark workloads

Profiler

Model-Builder

Reference VMsVM1 VM2

Specify cost/performance goals

Profiled Data


Cost-Perf Trade-off Map

PARIS’ Offline VM Benchmarking Phase

VM1 VM2 VM100…


Benchmark workloads

Profiler

Run benchmark workloads with diverse resource requirements on all the VM types

Records performance using a range of metrics and resources utilized

CPU- cpu_idle- cpu_system- cpu_user- CPU utilization- …

Network- bytes_in- bytes_out- …

Memory- mem_buffers- mem_cached- mem_free- mem_shared- …

Disk- swap_free- swap_total- disk_free- disk_total- I/O utilization - …

System-level- Number of waiting, running,

terminated, and blocked threads- Average load- …

VM1 VM2 VM100…


Benchmark workloads

Profiler

PARIS’ Offline Benchmarking Phase

Utilization Counters on all VM types

Observed performanceon all VM types,

Profiled Data

VM1

VM2

VM3

VMn

o o

o

Benchmark workloads (Offline)Config

(Allocated)Perf. Metric

c1 r1 d1

c2 r2 d2

c3 r3 d3

cn rn dn

o o

o

o o

o

o o

o

Utilized Resources

[ Utilization counters on VM1 ][ CPU_seconds, PhysicalMem, VirtualMemory, BytesSent, …]

r1 :[ #vCPU, Mem (GiB), Storage, …]c1 : Observed

performance

VM1 VM2 VM100…


Benchmark workloads

Profiler

Profiled Data

Utilization Counters on all VM types

Observed performanceon all VM types, ,Utilization Counters

on reference VMsObserved performance

on reference VMs

Light-weight fingerprinting (Online)

Fingerprint

User workload

Fingerprint Generator

VM1 VM2

PARIS’ Online Fingerprinting Phase

Perf. Metric

c1 r1 d1

c2 r2 d2

User workload (Online)Config

(Allocated)Utilized

Resources

g:{Benchmark Data} à Performance

Building PARIS’ Data-Driven Models

VM1

VM2

VM3

VMn

o o

o

Config(Allocated)

Perf. MetricUtilized

Resources

User workload (Online)

c1 r1 d1

c2 r2 d2

c1 r1 d1

c2 r2 d2

ck ?c3 d3

Learn:

Predict: , ,g:{ck } à

ck

c1 r1 d1

c2 r2 d2

Predicted dk

Reference VMs Config

(Allocated)Perf. Metric

Utilized Resources

c1 r1 d1

c2 r2 d2

c3 r3 d3

cn rn dn

o o

o

o o

o

o o

o

Benchmark workloads (Offline)

Fingerprintsc1 r1 d1, c2, r2, d2

Building PARIS’ Data-Driven Models

Regression Trees and Random Forest

Linear models did not perform well on our datasets…Ø Discontinuities in performance across resource-configs

and workload characteristics…Ø E.g., Hitting a memory wall

We need techniques suitable for data with such discontinuities:

Fingerprints, ,g:{ } àLearn:

Predict: , ,g:{ } àFor each VM typeFor each benchmark workload

Predicted dk

c1 r1 d1c3 c2 r2 d2

ck c1 r1 d1 c2 r2 d2

d3

How accurate are PARIS’ predictions?

Mean Latency Prediction

Selecting the Best VM across Multiple Public Clouds: PARIS SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

Workload Number of tasks Time (hours)

Cloud hosted compression (Benchmark set) 740 112Cloud hosted video encoding (Query set) 12983 433Serving-style YCSB workloads D, B, A (Benchmark set) 1830 2Serving-style new YCSB workloads (Query set) 62494 436

Table 1: Details of the workloads used and Dataset collected forPARIS’ o�line (benchmark) and online (query) phases.

Workload Operations Example Application

D Read latest: 95/5 reads/inserts Status updatesB Read mostly: 95/5 reads/writes Photo taggingA Update heavy: 50/50 reads/writes Recording user-actions

Table 2: Serving benchmarkworkloads we used fromYCSB.We didnot use the Read-Only Workload C, as our benchmark set coversread-mostly and read-latest workloads.

Serving-style workloads: We used four common cloud serv-ing datastores: Aerospike, MongoDB, Redis, and Cassandra. Thesesystems provide read and write access to the data, for tasks likeserving a web page or querying a database. For querying these sys-tems, we used multiple workloads from the YCSB framework [25].We used the core workloads [11], which represent di�erent mixes ofread/write operations, request distributions, and datasizes. Table 2shows the benchmark serving workloads we used in the o�inephase of PARIS. For testing PARIS’ models, we implemented newrealistic serving workloads by varying the read/write/scan/insertproportions and request distribution, for a larger number of opera-tions than the benchmark workloads [10].

Dataset details: Table 1 shows the number of tasks executed inthe o�ine phase (benchmark set) and the corresponding amountof time spent. Also shown are the workloads and the number ofquery tasks used for online evaluation (query set).

Metrics for evaluating model-predictions:We use the sameerror metrics for our predictions of di�erent performance metrics.We measured actual performance by running a task on the di�erentVM types as ground truth, and computed the percentage RMSE(Root Mean Squared Error), relative to the actual performance:

%Relati�e RMSE =

vut1N

NX

i=1

pi � aiai

!2⇤ 100

where N is the number of query tasks, and pi and ai are the pre-dicted and actual performance of the task respectively, in terms ofthe user-speci�ed metric. We want the % Relative RMSE to be aslow as possible.

RMSE is a standard metric in regression, but is scale-dependent:an RMSE of 10 ms in runtime prediction is very bad if the trueruntime is 1 ms, but might be acceptable if the true runtime is1000 ms. Expressing the error as a percentage of the actual valuemitigates this issue.

6.3 Prediction accuracy of PARISWe �rst evaluate PARIS’ prediction accuracy by comparing PARIS’predictions to the actual performance we obtained as ground truth

050100150200250300350

AWS Azure

%R

elat

ive

RM

SE

Baseline1 Baseline2 PARIS

(a) Prediction target: Mean runtime

050

100150200250300350

AWS Azure

%R

elat

ive

RM

SE


(b) Prediction target: p90 runtime

020406080

100120140160

AWS Azure

%R

elat

ive

RM

SEBaseline1 Baseline2 PARIS

(c) Prediction target: Mean latency

020406080

100120140160

AWS Azure

%R

elat

ive

RM

SE


(d) Prediction target: p90 latency

020406080

100120140160

AWS Azure

%R

elat

ive

RM

SE


(e) Prediction Mean throughput

020406080

100120140160

AWS Azure

%R

elat

ive

RM

SE


(f) Prediction target: p90 throughput

Figure 8: Prediction Error for Runtime, Latency, and Throughput(expected and p90) for AWS and Azure. a, b: Runtime prediction forvideo encoding workload tasks, c-f: Latency and throughput predic-tion for Serving-style latency and throughput sensitive OLTP work-loads. The error bars show the standard deviation across di�erentcombinations of reference VMs used.

by exhaustively running the same user-provided task on all VMtypes. We evaluated PARIS on both AWS and Azure for (a) Videoencoding tasks using runtime as the target performance metric, and(b) serving-type OLTP workloads using latency and throughput asthe performance metrics.

Overall Prediction Error: Figure 8 compares PARIS’ predic-tions to those from Baseline1 and Baseline2 for the mean and 90thpercentile runtime, latency and throughput. Results are averagedacross di�erent choices of reference VMs, with standard deviationsshown as error bars.

PARIS reduces errors by a factor of 2 compared to Baseline1, andby a factor of 4 compared to Baseline2. Note that the cost of allthree approaches is the same, corresponding to running the usertask on a few reference VMs. This large reduction is because thenonlinear e�ects of resource availability on performance (such ashitting a memory wall) cannot be captured by linear interpolation(Baseline2) or averaging (Baseline1).

To better understand why Baseline2 gets such a high error forsome VM types, we looked at how predictions by Baseline2 variedwith the di�erent resources of the target VMs (num CPUs, memory,disk). In one case, when using m3.large and c4.2xlarge as our

90-th percentile Latency Prediction

PARIS reduces errors by a factor of 4 compared to Baseline

How robust are PARIS’ predictions?

Is this Accuracy good enough?

PARIS maintains accuracy irrespective of • The choice and number of reference VM types• Set of benchmark workloads • Choice of regressor and Hyperparameters

- Predicting mean and p90 - Other metrics, such as, Latency, Throughput. - Other baselines

More in the paper

The Cost-Performance Trade-off

0

20

40

60

80

Late

ncy

in s

econ

ds Predicted Latency for user-specified representative user-workload task

0

0.2

0.4

0.6

0.8

1

Tota

l Cos

t in

cent

s

Estimated Cost for the corresponding user-specified representative user-workload task

0

0.2

0.4

0.6

0.8

1

Tota

l Cos

t in

cent

s Estimated Cost for the corresponding user-…Groundtruth: Distribution of Actual Latencies observed for new query tasks for the same user-workload

0

20

40

60

80

Late

ncy

in s

econ

ds Predicted Latency for user-specified representative user-workload task

0

0.2

0.4

0.6

0.8

1

Tota

l Cos

t in

cent

s

Estimated Cost for the corresponding user-specified representative user-workload task

0

0.2

0.4

0.6

0.8

1

Tota

l Cos

t in

cent

s Estimated Cost for the corresponding user-…Groundtruth: Distribution of Actual Latencies observed for new query tasks for the same user-workload

Users can define policies for selecting a VM based on this trade-off.

A sample policy reduced the cost by about 45%!

PARIS: Conclusion

PARIS: a system that allows users to choose the right VM type for meeting their performance goals and cost constraints through accurate and economical performance estimation

Key insight: PARIS decouples the characterization of VM types and workloads

Accurate and robust performance prediction that leads to cost savings for users: - Across cloud providers:- Across different workloads: batch, serving workloads- Multiple metrics of interest: runtime, latency, throughput, and their p90 values

This talk



Systems use models…

27

Scheduling…Predicting task execution times...Resource allocation/re-allocation…Understanding Application Behaviour…

Systems use models…

But…Can we build the models once, deploy and forget about them?

Scheduling…Predicting task execution times...Resource allocation/re-allocation…Understanding Application Behaviour… 28

Models: Deploy and forget?

Time

Examples: Query execution times on- hosted service that keeps

getting patched- increased size of a database…- overloaded servers…Domain Shift

29


Nth-Time-window

Examples: - Online shopping behavior

of customers- Change of user’s interest while

following an online news stream

+ (N+1)st-Time-window

Concept Drift

30


Nth-Time-window + (N+1)st-Time-window

Concept Drift

31Time

Domain Shift

ModelData Prediction/Action

Updating the Models: Feedback loop!

we collect and potentially adversely a↵ect our straggler pre-diction models. For example, if in the data collection phase,we use a su�ciently intelligent scheduler that eliminates allstragglers, we will be unable to learn which configurationsresult in stragglers and the resulting model will performpoorly. Alternatively, if the scheduler manages to preventany stragglers due to memory contention, for instance, suchstragglers will also be absent from the data we collect andwill be an error mode for the model we build.

To address this bias, one might consider disabling intelli-gent scheduling altogether during the data collection processthereby assigning tasks randomly. The resulting executingwould likely contain substantially more stragglers, but wouldalso result in costly, poor cluster utilization. Even worse, be-cause many nodes would be in unlikely overloaded states withsubstantial and unrealistic contention, we might introducenew bias or spurious correlations not present in a standardscheduler managed setting.Another consideration to keep in mind is resource us-

age. As described above, our primary goal in building thesestraggler-aware schedulers is to reduce the overall job com-pletion time. If we use a naive scheduler to collect data whilewe run real tasks, this scheduler will likely place tasks poorlyresulting many costly stragglers and thus increase our overallcosts. We would like to deploy our intelligent scheduler assoon as possible. Moreover, our goal is to optimize the jobcompletion times both across the data collection phase andthe model deployment phase.

In summary, the problem we are interested in is to figureout how we can deploy model-aware scheduling while at thesame time collecting data required to build better models,without sacrificing reduction in resource consumption.

These considerations naturally parallel the classic multi-armed bandits setting [22] in theoretical machine learning,where an agent has to choose from one out of k actions, andit has to learn what the best action is while at the sametime minimizing the total cost of exploring di↵erent actions.While the theoretical machine learning community has madesubstantial progress in algorithm development and analysisfor the bandits setting, this research has not made its way intothe very real problems of model-based schedulers for straggleravoidance (and any other system deploying machine learningmodels in real-life systems settings). In the following sectionswe take one such model-based scheduler, demonstrate thatsample bias is a very real problem, and explore how adaptingsimple strategies inspired from the bandits framework leadsto substantial gains in end-to-end performance.

3. WRANGLERWrangler predicts stragglers based on cluster resource

usage counters and then uses these predictions to informscheduling decisions. Figure 1 describes the architectureof Wrangler that consists of two main components: model-builder and predictive scheduler. We first describe howthe model builder learns to predict straggler behavior, andthen detail how these predictions are incorporated into thepredictive scheduler.

3.1 Features and labelsTo predict whether scheduling a task at a particular node

will lead to straggler behavior, Wrangler uses the resourceusage counters of the underlying node. It collects theseresource usage counters just before the task is launched on

Figure 1: Architecture of Wrangler: Model-Builderlearns to predict straggler causing situations, andinforms the Predictive Scheduler about them, withthe aim of avoiding stragglers. Wrangler’s architec-ture employs a feedback loop for collecting new datafor retraining the straggler prediction models.

the node; thus they represent the state of the node at thetime of start of execution of the task.

The resource usage counters we collect are based on the con-clusions of prior work on stragglers. Dean and Ghemawat [13]suggest that stragglers could arise as a result of contentionfor various system resources (e.g., CPU, memory, local disk,network bandwidth). Zaharia et al. [26] further found thatstragglers often result from faulty hardware and system mis-configuration. Finally, Ananthanarayanan et al. [5] reportthat the dynamically changing resource contention patternson an underlying node could give rise to stragglers. Basedon these findings, we collected the performance counters forCPU, memory, disk, network, and other operating systemlevel statistics describing the degree of concurrency beforelaunching a task on a node. The counters we collected spanmultiple broad categories as follows:

1. CPU utilization: CPU idle time, system and user timeand speed of the CPU, etc.

2. Network utilization: Number of bytes sent and received,statistics of remote read and write, statistics of RPCs,etc.

3. Disk utilization: The local read and write statisticsfrom the datanodes, amount of free space, etc.

4. Memory utilization: Amount of virtual, physical mem-ory available, amount of bu↵er space, cache space,shared memory space available, etc.

5. System-level features: Number of threads in di↵erentstates (waiting, running, terminated, blocked, etc.),memory statistics at the system level.

In total, we collect 107 distinct features characterizing thestate of the machine.

To simplify notation, we index the execution of a particulartask by i and define S

n

as the set of tasks executed on node n.Before executing task i 2 S

n

on node n we collect the resourceusage counters described above on node n to form the featurevector x

i

2 R107. We rescale each feature described above by

ProfilingCollaborative Filtering

based Resource Selection and Allocation

Execution

Quasar, ASPLOS’14 Wrangler, SoCC’14

32


Updating the Models: Feedback loop!

But…Is that enough? What can go wrong?

33


Sample Bias due to Feedback loop!

Influence system’s decisions

Biased

Label imbalance

without with

Two types of Biases

Bias in data distribution

34


Sample Bias due to Feedback loop!

Influence system’s decisions

Biased

35

Two questions:

Q. I: When to update models?

Q. II: How to update the models efficiently?

Ways to counter bias

Exploit Explorevs trade-off

This setting leads to an explore-exploit trade-o↵. Becauseeach time step only gives feedback about one action (the ac-tion taken), there is an incentive to try and explore multipleactions to learn the best one. However, taking a suboptimalaction reduces the total reward, and so there is a conflictingincentive to keep taking the best action according to cur-rent estimates. Basic strategies to address this trade-o↵ arebased on combination of taking the action with the highestexpected reward most of the time and occasionally exploringa randomly selected action. We consider four variants:

1. ✏-greedy: The best action is chosen with probability1� ✏, and a random action chosen uniformly at randomis taken with probability ✏.

2. ✏-first: For the first few time steps, actions are takenuniformly at random. This is followed by a pure ex-ploitation phase.

3. ✏-decreasing: Similar to ✏-greedy, except that ✏ is de-creased over time.

4. UCB (Upper confidence bounds): For every action a,the current expected value of the action µ(a) and thecurrent uncertainty �(a) is maintained. At every timestep, the action taken is argmax

a

(µ(a) + ��(a)) where� is a hyperparameter. Thus, this strategy looks foractions that are either high value or highly uncertain.This allows the model to quickly get feedback on actionsit is least certain of.

We can draw a parallel to our setting as follows. Our agentis the model based scheduler, and each time step correspondsto when it must assign a task to a node. The contextualvector is the resource usage counters of the node. The set ofactions that our model based scheduler can take is either toallow the scheduling assignment to go through, or to delaythe scheduling (if straggler behavior is predicted).

However, our actions are not symmetric in the informationthey provide. The “delay” action provides no feedback atall. As such, it doesn’t make sense to “explore” by takingthis action, because taking this action amounts to essentiallydiscarding the data point. Thus, an exploration phase onlymakes sense when straggler behavior is predicted and so thebest action is to delay, in which case to “explore” means tosimply go ahead with the scheduling.

A second di↵erence arises because our setting is not online.Updating models is expensive and so we cannot a↵ord totrain at every time step. Instead, we collect data for severaltime steps and then retrain in the background.

A third, and rather subtle, di↵erence is the fact that in oursetting, prior scheduling decisions can in fact a↵ect futuredata points x owing to overloading of the cluster, for instance,while in contextual bandits, the context vectors x seen ateach time step do not depend on previously taken actions.While we do not explicitly model this dependence on priordecisions, this di↵erence means that the policy we follow alsoimpacts the feature vectors x we see and record.

Finally, instead of starting with a completely random initialmodel, we start with a basic model that is trained using datacollected from a naiive scheduler (i.e., one that does not havea straggler prediction model) in an“o✏ine”phase. In the realsystem, this initial o✏ine exploration phase will only happenonce, and any subsequent retraining of the model will use

data collected while the model-based scheduler is running.This initial o✏ine phase lasts for a fixed time period T

offline

.After this initial explore phase, we train our straggler

prediction models. We then collect more data and use itto retrain models. This forms the second phase and lastsanother period T

online

. We consider four strategies for thesecond phase:

1. No-explore: This is the baseline strategy where we runtasks with Wrangler’s predictive scheduler as-is andsimultaneously collect data.

2. ✏-greedy: We run tasks with our predictive scheduler.Whenever our model predicts with a high confidencethat running a task on a node will cause stragglerbehavior, with probability ✏, we ignore this predictionand launch the task on the node anyway. We recordthe node’s resource usage counters at the time the taskis launched, and when the task is finished, we record ifit became a straggler. ✏ is kept fixed. At the end of thisphase, the collected data is pooled together with datafrom the initial explore phase and is used to retrain themodels.

3. ✏-decreasing: We divide this phase into four di↵erentparts. In each part we run tasks with our predictivescheduler, with straggler predictions getting ignoredwith probability ✏ as above. This epsilon is kept high(=0.9) in the first quarter, and decreased by 0.2 ineach quarter. At the end of each quarter, the modelsare retrained with all data collected till that point.This strategy can allow us to smoothly interpolatebetween the initial exploration phase and the finalexploitation phase and thus between the initial trainingdistribution and the final testing distribution, with eachintermediate model being trained on the mistakes ofits predecessor.

4. Confidence-based exploration: Our model predicts aconfidence or probability p that the task will be astraggler, and Wrangler’s original scheduler schedulesthe task if p < t for some threshold t, and otherwisedelays the task. We modify this scheduler so that ifp > t, we nevertheless go ahead and schedule the taskwith probability 1�p

1�t

. This probability starts at 1 whenthe p is at the threshold t, and gradually drops to 0as our model gets more confident that the task willbe a straggler. This has two advantages. One, we areunlikely to schedule if we are very certain that the taskwill be a straggler, which means we are less at risk ofproducing large delays in our exploration. Two, we aresampling more data points from the region in whichthe model is less certain. These are thus examples onwhich the classifier is less confident.

At the end of this phase the models are frozen and thendeployed. Figure 3 shows this setup.

6. EXPERIMENTAL RESULTSNext, we experiment with these proposed strategies using

Wrangler, and see if they provide any gains in the form ofimproved scheduling.

•






a








offline



online






1�t





•






a








offline



online






1�t





•






a








offline



online






1�t





•

36

Improved accuracy with Exploration

330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384

ClustQ: Efficient retraining for models deployed in systems

ues on the JobScheduling dataset. These TP and TNrates are shown in the top half of Figure 1a. Then for thoseparticular configurations, we plot the corresponding num-ber of labels queried, in the bottom half of Figure 1a. ForJobScheduling dataset, ClustQ clearly outperformsthe baseline strategies with (TP, TN) rates of (85%, 83%)while querying merely 36 data points. The no-explore strat-egy where we do not query for labels at all once the initialmodel is built, has poor accuracies, and shows a skew in theTP and TN rates. ✏-greedy achieves slightly lower TP andTN rates compared to ClustQ, while querying 17x extra la-bels. ✏-decreasing and the confidence-based strategies queryabout 41% and 64% fewer labels than ✏-greedy respectively,but have a comparatively reduced TN rate.

Now we focus on the hyperparameter sets for these strate-gies that explored the least number of data points. Figure 1bshows the corresponding TP, TN rates for all the strategiesin this setting. ClustQ queries significantly fewer labelswhile achieving better TP-TN rates than the other strategiessimilar to the earlier discussion. On JobScheduling

dataset, ✏-greedy with ✏=0.1 queries as few as 69 labels, butachieves much reduced TN rate, and introduces a high skewin the overall TP-TN rate.

In summary, for JobScheduling dataset, ClustQ

achieves better values of TP and TN rates compared to theother strategies while querying significantly fewer labels.ClustQ triggers asking for the label of a data point basedon how sparsely explored the region in the vicinity of thenew point is. In the JobScheduling dataset, the featurevectors (comprised of resource utilization statistics) indicatethe state of the machine. Similar feature vectors seen inthe past are likely to estimate a straggler causing behav-ior. Also, there could be multiple sources of feature vectorsbased on the changes in the workload submission pattern, orthe execution environment on the machines changing overtime. ClustQ allows us to maintain multiple models thatare built using data belonging to different clusters formedover time. This allows us to reduce the number of datapoints queried for labels to only those that indicate a shiftthat is unseen previously. However, the other strategies, be-ing unaware of such changes, query for labels uniformly atrandom, causing a high number of labels queried. Note thatin this instance, asking for a label corresponds to creating astraggler which incurs cost.

No-explore and ✏-greedy: Figure 2 shows the % TP and %TN values for JobScheduling dataset using no-exploreand ✏-greedy strategies. We used 6 different values of ✏,ranging from 0 to 0.9. We note that the no-explore baselinehas a low TN rate. From Figure 2, we see that without anyquery for labels, the models tend to get biased toward oneof the classes. As we query some labels over time, we seethat the classifier gets more balanced and achieves improved

Figure 2. Number of labels queried vs Accuracy (%TP and %TN)for the six different settings of ✏ for the ✏-greedy strategy onJobScheduling dataset. ✏ was set to the following 6 values: 0,0.1, 0.3, 0.5, 0.7, and 0.9

prediction accuracies.

✏-decreasing: For JobScheduling dataset, ✏-decreasing achieves (%TP, %TN) values of (83.29%,78.01%) on the validation set while querying for 368 labels(=44.3% of the total test data points). These results suggestthat gradually decreasing querying of labels with frequentretraining is a good strategy, probably because over timethe model learns to predict on harder and harder examplesaccurately. However, we also note that these models achievehigh prediction accuracies, at the cost of large number oflabels queried.

Confidence-based: On JobScheduling dataset, theconfidence-based query strategy gets improved (TP,TN)rates of (84.7%, 65.92%) compared to no-explore strategythat got (89.44%, 49.53%). However, ✏-greedy strategy,with ✏ = 0.3 achieves better TN with slightly fewer (205,see Figure 2) number of data points queried. This suggeststhat perhaps the confidence-based strategy needs to be tunedfor the threshold hyperparameter to get good performance.In our experiments, we chose the threshold, 0.7, that wasshown to work the best for this workload in Wrangler (Yad-wadkar et al., 2014).

6.4 Problem 2: Predicting performance in the cloud

The top half of Figure 3a shows the best R2 score thateach strategy achieves on the CloudPerf dataset. Thebottom half shows the corresponding number of queriedlabels. We note a trend, similar to one in the results on theJobScheduling dataset shown in Figure 1: ClustQ

is able to improve the prediction accuracies over time asnew data becomes available while querying for as fewerlabels as feasible. Figure 3b, shows the R

2 values for eachstrategy corresponding to the hyperparameter configurationsresulting in the least number of labels queried. We see thatClustQ achieves better prediction accuracy, R2

= 0.87

Ways to counter bias

Exploit Explorevs trade-off

We need:• Smarter exploration• Generalizable (w.r.t. type of models)






a








offline



online






1�t





•






a








offline



online






1�t





•






a








offline



online






1�t





•






a








offline



online






1�t





•

38

39

Initial Training Set

Test Set

Datapoints Queue

Predict class

Exploit Explore

Our solution: Clustering-based Query (ClustQ)

Our solution: Clustering-based Query (ClustQ)

40

Initial Training Set

Test Set

Datapoints Queue

Predict class

Exploit Explore

41

Evaluation

3 distinct applications:

1. an intelligent job scheduler trying to schedule jobs on machines in the face of an evolving cluster,

2. a performance estimator for cloud-based services dealing with changing interference patterns, and

3. an facial-features based gender classifier dealing with changing fashion trends.

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

ClustQ: Efficient retraining for models deployed in systems

(a) For the best TP, TN rates (b) For least number of labels queried

Figure 1. Prediction accuracies (TP, TN) achieved by all strategies and the corresponding number of labels queried on theJobScheduling dataset (TP: True Positives and TN: True Negatives).

2. Number of data points queried for label: We aim atreducing this number while improving the predictionaccuracy.

Before evaluating ClustQ using the three real-worlddatasets, we discuss how it handles dataset shifts usingsynthetically generated data.

6.2 Experiment With Synthetic Data

To gain insights into ClustQ, Algorithm 1, we devised atoy example with 2-D isotropic Gaussian clusters.

Data generation: We selected 10 different means µ

i

, set� = 0.15 and, for each cluster i, drew between 150 to 400samples x

j

from multivariate Gaussians parameterized by(µ

i

,�

2I). To generate labels, we chose a line as a classifier

for each cluster and assigned labels to data points dependingon the side of the plane they lie. We set 80 % of the startingcluster, C0, as the training set. We formed a test datasetstream from the rest of the data, by sampling points from theclusters without replacement, with probability proportionalto the size of a true cluster.

Evaluation process and metrics: For each cluster, welearn a binary classifier. We then create a stream of datapoints by sampling from different clusters. We then runClustQ and show that it starts new clusters when the datastream shifts to a new cluster (as indicated by the groundtruth we have). We measure: (a) Shift-detection RecallR

shift

, the fraction of times we create a cluster, every timethe data stream shifts to a new cluster, and (b) Shift-detectionPrecision P

shift

, the fraction of times the data stream shift,every time we create a cluster. We also compare the accu-racy of multiple per-cluster models to the baseline of justbuilding a single model. And we evaluate how well ourclusters match the true clusters by looking at the purity ofour clusters.

Results on synthetic data: The clustering algorithmachieves: R

shift

=1, i.e., it created new clusters for 100% ofdata stream shifts, P

shift

=0.82, i.e., 9 of 11 clusters werecorrectly initialized. Although a few extraneous clusters arecreated, they comprise a small fraction of the data and havelittle effect on overall cluster purity. We measure the clus-ter purity using the normalized mutual information (NMI)metric, which consists of the mutual information betweenthe algorithm’s clusters and true classes, normalized by theentropy of the clusters and classes. Our algorithm obtaineda score of 0.98, indicating that the recovered clusters wereof high purity.

As baseline, we trained two linear SVMs: one on the giventraining set and another on the training set combined withthe test set, which obtained (TP, TN) rates of (0.56%, 0.47%)and (0.52%, 0.49%) respectively. ClustQ trains multiplemodels from subsets of clustered data points, and achievesimproved (TP, TN) rates of (0.81%, 0.74%).

These results emphasize that our algorithm detects shiftand queries labels for data points if they lie far away frompreviously seen data points. By creating new clusters andbuilding new models for cluster members, we learn specificmodels that are more accurate than a single model.

Next, we present the evaluation on the three real-worldproblem instances described in Section 5.

6.3 Problem 1: Scheduling Jobs in a Cluster

We first compare the different strategies with respect tothe evaluation metrics described earlier. Then we presentmore detailed results using each strategy. All the strate-gies have different sets of hyperparameters, and we did agrid-search for a range of them, except for ✏-decreasing andconfidence-based query that used the set up explained inSection 6.1. As a first result, we pick the hyperparameterset for each strategy that achieves the best TP and TN val-

Evaluation: Job Scheduling Problem

Current status, Limitations, and Next Steps

43

• Working on proving theoretical guarantees for ClustQ

• Plan to extend ClustQ to deal with Concept drift

Thank you!



Model-based Resource Allocation in the Public Cloud Talks/2019...Traditional Resource Management...

Documents

Transcript of Model-based Resource Allocation in the Public Cloud Talks/2019...Traditional Resource Management...