What To Monitor For Black Friday / Cyber Monday

33
What To Monitor For Black Friday / Cyber Monday Baron Schwartz - VividCortex

Transcript of What To Monitor For Black Friday / Cyber Monday

What To Monitor ForBlack Friday / Cyber Monday

Baron Schwartz - VividCortex

Purpose Of This Webinar● Why talk about Black Friday and

Cyber Monday?

○ Isn’t it just jumping on buzzwords and fluff?

● What kinds of apps are affected?

● What companies don’t see peaks?

● What can we learn from this topic?

Themes● What could go wrong?

● Capacity planning

● Detecting latent issues early

● Understanding technology-specific limits

What CouldGo Wrong?

The Voice Of Your Peers● Disk space capacity

● Disk I/O capacity / IO wait

● CPU versus query latency

● Mutex bottlenecks / waits

● History list length / VACUUM

● DDoS attacks from botnets

● “Legit DoS” from buying bots

● Noisy neighbors / shared cust

Example from Shopify“Shopify uses Rails, which creates a lot of connections to shard masters. At peak times this has the potential to consume a lot of extra memory. I keep a close eye on this.”

-- Sergio Roysen, Shopify

Example from a Hosted E-Commerce Platform“We’ve had issues where some customers fell victim to Drupal Commerce inserting/updating a lot of records for each order. It worked fine during normal operation, but failed during Black Friday and after.”

-- Anonymous DBA

Capacity Planning

Capacity● Ability to serve desired workload

within performance tolerances

● Soft limits on capacity

● Hard limits on capacity

Desired Workload● Workload = both the user

population and their requests

● Do you know what to expect?

Trending and Projections● Use long-term metrics to project

based on historicals

● Use forecasting methods such as Holt-Winters for metrics with trend and seasonality

Key Resources● Compute resources: CPU,

memory, IO (network/disk)

● Models of the user population (e.g. connections, sessions)

● Consider the application, the database, and the OS

Hard Limits● Configured limits

○ Example: max_connections

○ Example: size of redo log

● Inherent limits

○ Example: network bandwidth

Soft LimitsMany resources have “burstable” capacity or will degrade gradually

● Example: the redo log

● Example: NIC buffers

Others will degrade as you approach capacity

● Example: latency under load

● Does your app degrade gracefully?

● Can you do load shedding?

● Is backpressure built-in? At what tier?

● Do you have feature flags?

● Do you know your most expensive features?

Application/Architecture Features

How Much Runway Do We Have?Use models/simulations to

estimate what % of capacity you’re consuming now

Use forecasting to project what you will need to handle for peaks

Are you going to make it?

The Universal Scalability Law● Simple, fast, real model of

capacity under load

● Black-box, easy to measure for

● Gives an idea what % capacity you have used

● Download our ebook and Excel workbook to learn more and do your own modeling

Actual Customer Server with USL Model

Another Real Server with USL Model

Query Latency vs Disk Utilization● Queueing theory explains why

latency spikes at high utilization

● In database servers, it’s often IO that’s the bottleneck, not CPU

● The spike is highly nonlinear

● See our queueing theory ebook

● Real customer screenshot -->

Micro-StallsAll systems stall constantly

When conditions are right, small problems become big, again nonlinearly

If you have 1-second pauses/freezes, would you know it?

VividCortex’s Adaptive Fault Detection algorithm is specifically for this use case

Are You Gonna Make It?If you know your X factor and it

looks like you’re going to fall short on capacity to serve it, you could have a problem.

Next steps? Load simulation / load testing could be a good idea.

Detecting Latent

Problems

Latent Problems● These are the problems that existed long before

● They manifest when they are least convenient

● Together with other issues, they become jointly sufficient to cause problems/outages

● (There’s no single root cause)

Errors You Haven’t Yet Noticed● Check your error logs! Was the

last restart clean?

● Are there errors/warnings in the logs?

● Are there any crashes, failures, restarts, etc you didn’t know about?

Servers with Reboot Risk● Do you have any servers that

have accumulated config drift?

● When’s the last time each server was rebooted?

● Are your servers immutable?

● Is this the first Black Friday with your current hardware, current DB version, etc?

Replication Delay● Replication works fine until it

doesn’t. Then it can’t catch up.

● What’s the “catch-up slope” from small delays in replication?

Database-Specific Stuff● Idle-in-trx sessions

● Locks/mutexes that escalate

○ Per-page locks

○ SELECT FOR UPDATE

○ Buffer pool mutexes

● Overhead of per-XYZ stuff (per-connection overhead…)

● Background worker tasks

○ VACUUM

○ InnoDB buffer pool purge and history list maintenance

Knowing Your Workload

Workload Analytics Is The Killer App● Do you have “new” query types?

● Are queries gradually ramping?

● What’s different now versus last week or last month?

In Conclusion...● Try to understand/forecast your capacity requirements

● Try to understand/forecast your headroom

● Look for latent problems

● Sweep the floors so nobody trips on stuff

Thanks! Questions?● Baron Schwartz

[email protected]

● @xaprb