Monitoring at a SaaS Startup
Tradeoffs and Tools
Bridget Kromhout
8thbridge.comsmall social commerce startupacquired in the last week by Fluid, Inc.small devteamI am the ops team
twisty maze of little shell scripts
bespoke artisanal monitoring
difficult to modify;doesn’t scale
http://www.pcgameshardware.de/screenshots/1280x1024/2007/07/CA01.jpg
New Relic
pros:nice graphsapplication-level viewgood error analysis
cons:slow to updatemany false-positive alertshigh prices (better now)
MotivatingChange
http://99designs.com/illustrations/contests/illustration-pagerduty-161025/entries
: as hideous as you remember
https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/
“Horrendous interface”“Well, it’s more “old” than anything
else. At least everything is in the
same place as you left it because it’s
been the same since 1912.”
“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”
-- @murphy_slaw (via @lozzd)
HBase: monitor all the ports?!?
hbck: the HBase consistency checker
nagios -> bash script -> parsing output of hbck
http://www.ymc.ch/en/how-to-monitor-hbase-health-by-nagios
adding alert after alert after...
http://modiinhub.com/wp-content/uploads/2014/02/logo-mongodb-tagline.png
MMS (MongoDB Monitoring Service)
“cyber” monday: 1988 called; wants its word back.
the rewards of hubris
MMS showed the issue but we weren't alerting on it didn't understand the global write lock
If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving
yet, just in case it decides to make a run for it. -- @indec
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
Graphite & StatsD
➔ Graphite◆ Store and visualize time-series data◆ http://graphite.readthedocs.org/
➔ StatsD ◆ Measure everything! (Timers, counters, events, …)◆ https://github.com/etsy/statsd/
Where we were
➔ Graphite 0.9.9 (wanted 0.9.12)◆ over 2 years old◆ missing new features (Consolidate by!)
➔ StatsD was newish, but…◆ hand-rolled◆ running in a screen session◆ on a special snowflake box
Community cookbooks?
➔ Graphite ones good, but…◆ focus on Apache (we use nginx)◆ we haven’t moved to Chef 11 (gasp!)
➔ StatsD◆ https://github.com/librato/statsd-cookbook◆ launches daemons via upstart◆ generates config file based on attributes
Graphite cookbook (Part 1)
➔ Install in a virtualenv (django, uwsgi, nginx)➔ Dependencies recommended
◆ https://github.com/graphite-project/graphite-web/blob/master/requirements.txt
➔ libcairo2-dev package on Ubuntu 12.04 LTS➔ install graphite’s 3 parts via pip
Graphite cookbook (Part 2)
➔ graphite-web◆ Django app, renders graphs
➔ whisper◆ fixed-size database for storing time-series data◆ like RRD
➔ carbon◆ carbon-cache.py - stores data◆ carbon-aggregator.py - buffers, then stores◆ carbon-relay.py - for sharding/replication
when in doubt: tcpdump is your friend
http://blog.johngoulah.com/2012/10/looking-under-the-covers-of-statsd/
carbon-aggravator (between 0.9.10 & 0.9.12)
# If set true, metric received will be forwarded to# DESTINATIONS in addition to# the output of the aggregation rules. If set false # the carbon-aggregator will# only ever send the output of aggregation.FORWARD_ALL = True
Carbonate
whisper-fill.py
backfill datapoints between whisper files
2am: sudden drop-off
8am: look at graphs: ?!?!
10am: and we’re back.
What’s next?
❏ finds real problems❏ actionable alerting❏ usable by all❏ …?
the ideal monitoring solution...
http://www.quickmeme.com/img/f5/f512ff9bee084263df5571d3c81388019dcb063173e1dbcfa2babac9274576b6.jpg
What we’re actually using now
StatsDApplication-level error analysis
Alarms for autoscaling
Timers & counters
Log & host-level
Hadoop & HBase visualization
MongoDBGraphs
Time-series data graphing
client-side plugins
External uptime checksoncall rotation/alerting
Threshold-based alarms
Dashboard
Discuss!
Twitter: @bridgetkromhoutEmail: [email protected]
Top Related