What To Monitor For Black Friday / Cyber Monday
-
Upload
vividcortex -
Category
Technology
-
view
269 -
download
0
Transcript of What To Monitor For Black Friday / Cyber Monday
Purpose Of This Webinar● Why talk about Black Friday and
Cyber Monday?
○ Isn’t it just jumping on buzzwords and fluff?
● What kinds of apps are affected?
● What companies don’t see peaks?
● What can we learn from this topic?
Themes● What could go wrong?
● Capacity planning
● Detecting latent issues early
● Understanding technology-specific limits
The Voice Of Your Peers● Disk space capacity
● Disk I/O capacity / IO wait
● CPU versus query latency
● Mutex bottlenecks / waits
● History list length / VACUUM
● DDoS attacks from botnets
● “Legit DoS” from buying bots
● Noisy neighbors / shared cust
Example from Shopify“Shopify uses Rails, which creates a lot of connections to shard masters. At peak times this has the potential to consume a lot of extra memory. I keep a close eye on this.”
-- Sergio Roysen, Shopify
Example from a Hosted E-Commerce Platform“We’ve had issues where some customers fell victim to Drupal Commerce inserting/updating a lot of records for each order. It worked fine during normal operation, but failed during Black Friday and after.”
-- Anonymous DBA
Capacity● Ability to serve desired workload
within performance tolerances
● Soft limits on capacity
● Hard limits on capacity
Desired Workload● Workload = both the user
population and their requests
● Do you know what to expect?
Trending and Projections● Use long-term metrics to project
based on historicals
● Use forecasting methods such as Holt-Winters for metrics with trend and seasonality
Key Resources● Compute resources: CPU,
memory, IO (network/disk)
● Models of the user population (e.g. connections, sessions)
● Consider the application, the database, and the OS
Hard Limits● Configured limits
○ Example: max_connections
○ Example: size of redo log
● Inherent limits
○ Example: network bandwidth
Soft LimitsMany resources have “burstable” capacity or will degrade gradually
● Example: the redo log
● Example: NIC buffers
Others will degrade as you approach capacity
● Example: latency under load
● Does your app degrade gracefully?
● Can you do load shedding?
● Is backpressure built-in? At what tier?
● Do you have feature flags?
● Do you know your most expensive features?
Application/Architecture Features
How Much Runway Do We Have?Use models/simulations to
estimate what % of capacity you’re consuming now
Use forecasting to project what you will need to handle for peaks
Are you going to make it?
The Universal Scalability Law● Simple, fast, real model of
capacity under load
● Black-box, easy to measure for
● Gives an idea what % capacity you have used
● Download our ebook and Excel workbook to learn more and do your own modeling
Query Latency vs Disk Utilization● Queueing theory explains why
latency spikes at high utilization
● In database servers, it’s often IO that’s the bottleneck, not CPU
● The spike is highly nonlinear
● See our queueing theory ebook
● Real customer screenshot -->
Micro-StallsAll systems stall constantly
When conditions are right, small problems become big, again nonlinearly
If you have 1-second pauses/freezes, would you know it?
VividCortex’s Adaptive Fault Detection algorithm is specifically for this use case
Are You Gonna Make It?If you know your X factor and it
looks like you’re going to fall short on capacity to serve it, you could have a problem.
Next steps? Load simulation / load testing could be a good idea.
Latent Problems● These are the problems that existed long before
● They manifest when they are least convenient
● Together with other issues, they become jointly sufficient to cause problems/outages
● (There’s no single root cause)
Errors You Haven’t Yet Noticed● Check your error logs! Was the
last restart clean?
● Are there errors/warnings in the logs?
● Are there any crashes, failures, restarts, etc you didn’t know about?
Servers with Reboot Risk● Do you have any servers that
have accumulated config drift?
● When’s the last time each server was rebooted?
● Are your servers immutable?
● Is this the first Black Friday with your current hardware, current DB version, etc?
Replication Delay● Replication works fine until it
doesn’t. Then it can’t catch up.
● What’s the “catch-up slope” from small delays in replication?
Database-Specific Stuff● Idle-in-trx sessions
● Locks/mutexes that escalate
○ Per-page locks
○ SELECT FOR UPDATE
○ Buffer pool mutexes
● Overhead of per-XYZ stuff (per-connection overhead…)
● Background worker tasks
○ VACUUM
○ InnoDB buffer pool purge and history list maintenance
Workload Analytics Is The Killer App● Do you have “new” query types?
● Are queries gradually ramping?
● What’s different now versus last week or last month?
In Conclusion...● Try to understand/forecast your capacity requirements
● Try to understand/forecast your headroom
● Look for latent problems
● Sweep the floors so nobody trips on stuff