Monitoring at a SAAS Startup: Tradeoffs and Tools

download Monitoring at a SAAS Startup: Tradeoffs and Tools

of 28

  • date post

    11-May-2015
  • Category

    Technology

  • view

    620
  • download

    0

Embed Size (px)

description

I gave this talk at MinneBar 2014: http://sessions.minnestar.org/sessions/162 When I joined a SaaS startup already in progress as their first ops hire, what monitoring existed was a twisty maze of half-measures. The devteam dreaded oncall, and our Mean Time To Lost Sleep was way too low. Improving visibility into our infrastructure and application performance required trying new tools and changing how we thought about what we were measuring. Join me for a tragicomic journey from the vale of blissful ignorance through the straits of Nagios and into the mountains of Graphite. We'll talk tools and pitfalls, missteps and dead ends, and everything we haven't yet done but should. Tools covered will include Nagios, StatsD, Graphite, and Sentry, with some digressions into others such as NewRelic and MMS.

Transcript of Monitoring at a SAAS Startup: Tradeoffs and Tools

  • 1.Monitoring at a SaaS Startup Tradeoffs and Tools Bridget Kromhout

2. 8thbridge.com small social commerce startup acquired in the last week by Fluid, Inc. small devteam I am the ops team 3. twisty maze of little shell scripts bespoke artisanal monitoring difficult to modify; doesnt scale http://www.pcgameshardware.de/screenshots/1280x1024/2007/07/CA01.jpg 4. New Relic pros: nice graphs application-level view good error analysis cons: slow to update many false-positive alerts high prices (better now) 5. Motivating Change http://99designs.com/illustrations/contests/illustration-pagerduty-161025/entries 6. : as hideous as you remember 7. https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/ Horrendous interface Well, its more old than anything else. At least everything is in the same place as you left it because its been the same since 1912. 8. Sensu has so many moving parts that I wouldnt be able to sleep at night unless I set up a Nagios instance to make sure they were all running. -- @murphy_slaw (via @lozzd) 9. HBase: monitor all the ports?!? hbck: the HBase consistency checker nagios -> bash script -> parsing output of hbck http://www.ymc.ch/en/how-to-monitor-hbase-health-by-nagios 10. adding alert after alert after... 11. http://modiinhub.com/wp-content/uploads/2014/02/logo-mongodb-tagline.png 12. MMS (MongoDB Monitoring Service) 13. cyber monday: 1988 called; wants its word back. the rewards of hubris MMS showed the issue but we weren't alerting on it didn't understand the global write lock 14. If it moves, we track it. Sometimes well draw a graph of something that isnt moving yet, just in case it decides to make a run for it. -- @indec http://codeascraft.com/2011/02/15/measure-anything-measure-everything/ 15. Graphite & StatsD Graphite Store and visualize time-series data http://graphite.readthedocs.org/ StatsD Measure everything! (Timers, counters, events, ) https://github.com/etsy/statsd/ 16. Where we were Graphite 0.9.9 (wanted 0.9.12) over 2 years old missing new features (Consolidate by!) StatsD was newish, but hand-rolled running in a screen session on a special snowflake box 17. Community cookbooks? Graphite ones good, but focus on Apache (we use nginx) we havent moved to Chef 11 (gasp!) StatsD https://github.com/librato/statsd-cookbook launches daemons via upstart generates config file based on attributes 18. Graphite cookbook (Part 1) Install in a virtualenv (django, uwsgi, nginx) Dependencies recommended https://github.com/graphite-project/graphite- web/blob/master/requirements.txt libcairo2-dev package on Ubuntu 12.04 LTS install graphites 3 parts via pip 19. Graphite cookbook (Part 2) graphite-web Django app, renders graphs whisper fixed-size database for storing time-series data like RRD carbon carbon-cache.py - stores data carbon-aggregator.py - buffers, then stores carbon-relay.py - for sharding/replication 20. when in doubt: tcpdump is your friend http://blog.johngoulah.com/2012/10/looking-under-the-covers-of-statsd/ 21. carbon-aggravator (between 0.9.10 & 0.9.12) # If set true, metric received will be forwarded to # DESTINATIONS in addition to # the output of the aggregation rules. If set false # the carbon-aggregator will # only ever send the output of aggregation. FORWARD_ALL = True 22. Carbonate whisper-fill.py backfill datapoints between whisper files 23. 2am: sudden drop-off 8am: look at graphs: ?!?! 10am: and were back. 24. Whats next? 25. finds real problems actionable alerting usable by all ? the ideal monitoring solution... http://www.quickmeme.com/img/f5/f512ff9bee084263df5571d3c81388019dcb063173e1dbcfa2babac9274576b6.jpg 26. What were actually using now StatsD Application-level error analysis Alarms for autoscaling Timers & counters Log & host-level Hadoop & HBase visualization MongoDB Graphs Time-series data graphing client-side plugins External uptime checks oncall rotation/alerting Threshold-based alarms Dashboard 27. Discuss! Twitter: @bridgetkromhout Email: bridget@kromhout.org