From 0 to Capacity Planning

41
From Zero To Capacity Planning

Transcript of From 0 to Capacity Planning

Page 1: From 0 to Capacity Planning

From ZeroTo Capacity Planning

Page 2: From 0 to Capacity Planning

@Randommood

INES Sombra

Page 3: From 0 to Capacity Planning

Globally distributed and Highly available

Page 4: From 0 to Capacity Planning

Why capacity planning?

Or a journey of discovery and ingenuity

Page 5: From 0 to Capacity Planning

The views reflected in this talk are not to be considered a reflection of the skills of my

coworkers who are extremely nice human beings and way better at capacity planning

than I am.

😜

NOT A monitoring person💀

🚨🚨

Page 6: From 0 to Capacity Planning

INSTRUMENT

MONITOR & ALERT

PLAN & PREDICT

The Road to Capacity planning

?

Page 7: From 0 to Capacity Planning

FindingsBooks

0Day One

Some Learning

Our Discoveries

Rituals & Myths

Asking Around

Bringing it Home

our Path today

Checking The Edge

Page 8: From 0 to Capacity Planning

zero… Oh shit!

Page 9: From 0 to Capacity Planning

a convenient ”situation”

Handles StateMany Clients

Other systems depend on this service to be: up, healthy, and available!

A bit F*cked

Page 10: From 0 to Capacity Planning

Our World

Edge Core✨ ✨

Page 11: From 0 to Capacity Planning

a Fastly POP

Page 12: From 0 to Capacity Planning

I Rule the Edge!

Evaluates weekly global POPs performance & makes projections

Publishes capacity performance report in clear location

Plans for our physical capacity & transit capacity

Meet Catharine

Page 13: From 0 to Capacity Planning

Planning Our CapacitySome metrics - Network Capacity (Gb) - Ordered Network Capability (Gb) - Planned Network Capacity (Gb) - RPS Capacity (k) - Network peak (Gb) - RPS peak (k) - Site CPU Peak (%) - Network Utilization (%)

Over 30%: flagged, Over 70%: Red status

Page 14: From 0 to Capacity Planning

Edge InsightsOur ability to correctly plan for capacity is critical to our bottom line

Capacity doesn’t just involve hardware; software optimizations matter

People affect capacity

Page 15: From 0 to Capacity Planning

HittingThe

Books

Page 16: From 0 to Capacity Planning

Defining Capacity planningMeasuring, planning, & managing system growth

Determines what your system needs & when

From the observation of actual traffic. Use current performance as baseline.

Must happen regardless of what you might optimize

Page 17: From 0 to Capacity Planning

ARE WE RIGHT

NOW?

We have to be this fast & reliable

X per second & Y% Uptime

MEASURE HOW/RELIABLE WE ARE

HARDWARESOFTWAREARCHITECTURE

CHANGE / ADD / REMOVE

FIGURE OUT HOW TO STAY

FAST/RELIABLE ENOUGH

Yes!

No!

Allspaw's Wisdom

From The Art of Capacity Planning👈

Page 18: From 0 to Capacity Planning

System’s Ceiling: critical level of a resource that cannot be crossed without failure. Find yours

Another form of Capacity Planning: Controlled load testing

Predictions: ceilings + historical data

Allspaw's Wisdom

Page 19: From 0 to Capacity Planning

Allspaw's WisdomSystem architecture can affect your ability to add capacity

Identify & track your application’s metrics

Tying metrics to user behavior is helpful

If you don’t have ways to measure your current capacity you can’t plan

Page 20: From 0 to Capacity Planning

Little’s Law & Capacity planningL = λW

Capacity (L), Throughput (λ), and Latency (W)

Applies to stable systems

Use this information to better understand our workload and to define constraints

Page 21: From 0 to Capacity Planning

Literature InsightsPossible to have plenty of capacity and a slow site nonetheless

Projections & curve fitting are guesses

Keep track of API calls & their rate

Always gonna be spikes & hiccups. Take the bad with the good & plan for it

Page 22: From 0 to Capacity Planning

Rituals&

Myths

Page 23: From 0 to Capacity Planning

Crowdsourcing Capacity planning

Page 24: From 0 to Capacity Planning

Crowdsourcing Capacity planning

Page 25: From 0 to Capacity Planning

Industry InsightsHard to extrapolate general advice into something applicable for my situation

Simplicity & ability to reason are the only things I could trust

Confusing community stance on the ROI of capacity planning

Page 26: From 0 to Capacity Planning

& Putting things in practiceFindings

Page 27: From 0 to Capacity Planning

Step One Step Two

steps followed

Documented system architecture & request lifecycle

Formalized: clients, SLAs, & operational requirements

DiscoveryConfirmed constraints & determined strategy

Parallelized capacity & optimizations tasks

Organized a team

Gauging & Planning

Page 28: From 0 to Capacity Planning

Edge

Core APP / API APP / API

LB LB

COORDINATOR A COORDINATOR B COORDINATOR C🐤

CACHELON

CACHEDFW

CACHEFRA

CACHELAX

CACHEAMS

CACHESYD

REQUEST flow

📄 📄 📄👉

Page 29: From 0 to Capacity Planning

Step Foursteps followed

Start process again

Tons of tuning left to do. We know we have suboptimal configs!

re-Evaluation

Step Three

Doubled RAM: our constrained resource

Horizontally scaled to 3 servers + 1 canary

Capacity expansion

Page 30: From 0 to Capacity Planning

System Before

Page 31: From 0 to Capacity Planning

System After

Page 32: From 0 to Capacity Planning

System Before System After

Page 33: From 0 to Capacity Planning

System Before System After

Page 34: From 0 to Capacity Planning

Unexpected ChallengesOur goal when adding capacity was no service disruption.

Localhost is the goddamn devil

Gap from metric/graph to insight can be huge

Slowness is the nemesis of distributed system

Page 35: From 0 to Capacity Planning

The Oprah ProblemDeveloping operational insights into non-owned system under pressure is not great

Use playbooks, debug.md, rotations, & rollout owners

Proactivity and clarity are your best tools

Everyone gets more capacity!

Page 36: From 0 to Capacity Planning

Some InsightsAnything API driven ought to carry a rate limit - We can easily DDOS ourselves!

Monitor and alert on expensive API actions

Mind your system dependencies: practice defensive system design & architecture

CAPACITY PLANNING

ALERTING

MONITORING

Page 37: From 0 to Capacity Planning

Some FindingsCapacity tied to murky organizational structure is both good & bad (but mostly bad)

Mind your error descriptions! Cheeky today ⇒ misleading tomorrow!

Page 38: From 0 to Capacity Planning

Finding my system’s ceiling is still tricky

Services owned by engineers means you need to level up on Ops skills

Back to re-evaluate setup to get more out of this new capacity

Performance testing ought to be done on the core’s side (& edge)

My Insights

Page 39: From 0 to Capacity Planning

TL;DR

Is a process not a one time event

Pushes you to better understand your

system, its capacity & its boundaries - that is

good!

Proactivity is best

Capacity planningRequest lifecycle gets

tricky

System boundaries, dependencies & SLAs

must be discussed

Your system’s capacity may bound other systems capacity

Distributed systems

Page 40: From 0 to Capacity Planning

github.com/Randommood/ZerotoCapacityPlanning

Special Thanks to: Catharine Strauss, Alan Kasindorf, Matt Whiteley, Caitie McCaffrey, Thom Mahoney, Mike O’Neill, Devon O’Dell, Katherine Daniels, Nathan Taylor, Bruce Spang, and Greg Bako

Thank you !

Page 41: From 0 to Capacity Planning

github.com/Randommood/ZerotoCapacityPlanning