Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof...

24
Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski, Przemysław Zych, Przemek Broniek, Jarek Kuśmierek, Paweł Nowak, Beata Strack, Piotr Witusowski, Steven Hand, John Wilkes (Google) EuroSys 2020 April 2020

Transcript of Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof...

Page 1: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Autopilot: workload autoscaling at GoogleKrzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski, Przemysław Zych, Przemek Broniek, Jarek Kuśmierek, Paweł Nowak, Beata Strack, Piotr Witusowski, Steven Hand, John Wilkes (Google)

EuroSys 2020April 2020

Page 2: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Google runs in containers

In any given week, we

launch over two billion

containers across

Google.

Page 3: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Resource limits are crucial to isolate workloads

container limit:max amount of CPU/mema container can use

container usage:CPU/mem used

container slack:CPU/mem wasted

Page 4: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Borg, our scheduler, packs containers to machines by resource limits.

image source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]

machines

Page 5: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Limits are fine-grained: CPU in milli-coresmemory in bytes

Source: http://dx.doi.org/10.1145/2741948.2741964 [Verma et al., EuroSys’15]

Page 6: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + ConfidentialWe pack containers to machines by limits.

So, precise limits are crucial for efficiency and reliability.

6

limit

machine capacity

container B

container C

container A

Page 7: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + ConfidentialWe pack containers to machines by limits.

So, precise limits are crucial for efficiency and reliability.

7precise limits

good!

limit

limit

usage

machine capacity

container B

container C

container A

Page 8: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + ConfidentialWe pack containers to machines by limits.

So, precise limits are crucial for efficiency and reliability.

out-of-resources crash 8precise

limits

good!

limit

resource arewasted

(underallocated machine)

bad! bad!

limit

usage

requested usage

container limit

machine capacity

container B

container C

container A

Page 9: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Autopilot acts as a controller for Borg limits.

Autopilot continuously adjusts resource limits:

CPU/Mem limits for containers (vertical scaling),

number of replicas (horizontal scaling).

container limitscontainer counts container

limitsstart/stop

containers

Page 10: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Autopilot Recommenders

Page 11: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Moving window recommenders

● Exponentially-decaying samples

(half-life of 48 hours)

● Compute statistics over the

samples, e.g. 95%ile

● add a safety margin

time

resources

usage

98%ile

Page 12: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Moving window recommenders

● Exponentially-decaying samples

(half-life of 48 hours)

● Compute statistics over the

samples, e.g. 95%ile

● add a safety margin

time

resources

usage

exponential decay

98%ile

Page 13: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Moving window recommenders

● Exponentially-decaying samples

(half-life of 48 hours)

● Compute statistics over the

samples, e.g. 95%ile

● add a safety margin

time

resources

usage

exponential decay

safety margin

limit

Page 14: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

● Each model is an arg-max

algorithm picking a limit value

● Each model is parametrized by

the decay rate and the safety

margin.

● The recommender picks the

model performing the best over a

longer time period.

Machine learning recommenders

decay rate

limitmodel nmodel 1 model 2 …….

Page 15: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Evaluation: Observational study of production jobsFocus on memory

Page 16: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Autopilot efficiency - reduction of slack

relative slack: (av_limit - 95%ile usage) / (av_limit)

16

limit(t)

usage(t)

slack(t)

absolute slack:∫ slack(t) dt = ∫ limit(t) dt - ∫ usage(t) dt

unit: capacity of a single (largish) machine

av_limitaverage

limit during the day

95%ile usage during the day

(av_limit - 95%ile usage)

Page 17: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

A random sample of 5000 jobs in each

category.

Cum

ulat

ive

dist

ribut

ion

func

tion

Autopiloted jobs have significantly smaller relative slack.

Relative slack: (av_limit - 95%ile usage) / av_limit

av_limit

(av_limit - 95%ile usage)

better worse

Page 18: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Autopiloted jobs save significant capacity.

Cum

ulat

ive

dist

ribut

ion

func

tion

Absolute slack [machines]

Absolute slack

A random sample of 5000 jobs in each

category.

betterworse

Page 19: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

When jobs migrate to Autopilot, their slack is significantly reduced.

A random sample of 500

jobs that migrated to

autopilot in a certain

month, m0.

CDFs for slack for 2 months

before and after migration

reduction of relative slack

Cum

ulat

ive

dist

ribut

ion

func

tion

betterworse

Page 20: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Autopilot Reliability: how frequent are out-of-memory errors.

We count terminations of

containers.

We weight the number of

terminations by the average

number of containers of a job.

out-of-resources crash

requested usage

container limit

machine capacity

Page 21: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

Autopilot reduces the frequency of out-of-memory events.

OOMs are rare: 99.5% of

autopiloted jobs have no

OOMs.C

umul

ativ

e di

strib

utio

n fu

nctio

n

better

worse

Page 22: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

DevOps:Autopiloted jobs account for over 48% of Google’s fleet-wide resource usage.

Page 23: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + ConfidentialAutopilot’s dynamic limits could help to keep the job running despite bugs.

Page 24: Autopilot: workload autoscaling at Google...Autopilot: workload autoscaling at Google Krzysztof Rzadca (Google & University of Warsaw, Poland), Paweł Findeisen, Jacek Świderski,

Proprietary + Confidential

1. Efficient scheduling requires fine-grained control of jobs’ limits

2. Humans are bad at setting the limits precisely.

3. Autopilot uses past usage to drive future limits

4. Autopilot reduces relative slack by 2x

...and it reduces the number of jobs severely impacted by OOMs 10x

Autopilot: workload autoscaling at Google