Monitoring at a SAAS Startup: Tradeoffs and Tools

download Monitoring at a SAAS Startup: Tradeoffs and Tools

of 28

  • date post

  • Category


  • view

  • download


Embed Size (px)


I gave this talk at MinneBar 2014: When I joined a SaaS startup already in progress as their first ops hire, what monitoring existed was a twisty maze of half-measures. The devteam dreaded oncall, and our Mean Time To Lost Sleep was way too low. Improving visibility into our infrastructure and application performance required trying new tools and changing how we thought about what we were measuring. Join me for a tragicomic journey from the vale of blissful ignorance through the straits of Nagios and into the mountains of Graphite. We'll talk tools and pitfalls, missteps and dead ends, and everything we haven't yet done but should. Tools covered will include Nagios, StatsD, Graphite, and Sentry, with some digressions into others such as NewRelic and MMS.

Transcript of Monitoring at a SAAS Startup: Tradeoffs and Tools

  • 1.Monitoring at a SaaS Startup Tradeoffs and Tools Bridget Kromhout

2. small social commerce startup acquired in the last week by Fluid, Inc. small devteam I am the ops team 3. twisty maze of little shell scripts bespoke artisanal monitoring difficult to modify; doesnt scale 4. New Relic pros: nice graphs application-level view good error analysis cons: slow to update many false-positive alerts high prices (better now) 5. Motivating Change 6. : as hideous as you remember 7. Horrendous interface Well, its more old than anything else. At least everything is in the same place as you left it because its been the same since 1912. 8. Sensu has so many moving parts that I wouldnt be able to sleep at night unless I set up a Nagios instance to make sure they were all running. -- @murphy_slaw (via @lozzd) 9. HBase: monitor all the ports?!? hbck: the HBase consistency checker nagios -> bash script -> parsing output of hbck 10. adding alert after alert after... 11. 12. MMS (MongoDB Monitoring Service) 13. cyber monday: 1988 called; wants its word back. the rewards of hubris MMS showed the issue but we weren't alerting on it didn't understand the global write lock 14. If it moves, we track it. Sometimes well draw a graph of something that isnt moving yet, just in case it decides to make a run for it. -- @indec 15. Graphite & StatsD Graphite Store and visualize time-series data StatsD Measure everything! (Timers, counters, events, ) 16. Where we were Graphite 0.9.9 (wanted 0.9.12) over 2 years old missing new features (Consolidate by!) StatsD was newish, but hand-rolled running in a screen session on a special snowflake box 17. Community cookbooks? Graphite ones good, but focus on Apache (we use nginx) we havent moved to Chef 11 (gasp!) StatsD launches daemons via upstart generates config file based on attributes 18. Graphite cookbook (Part 1) Install in a virtualenv (django, uwsgi, nginx) Dependencies recommended web/blob/master/requirements.txt libcairo2-dev package on Ubuntu 12.04 LTS install graphites 3 parts via pip 19. Graphite cookbook (Part 2) graphite-web Django app, renders graphs whisper fixed-size database for storing time-series data like RRD carbon - stores data - buffers, then stores - for sharding/replication 20. when in doubt: tcpdump is your friend 21. carbon-aggravator (between 0.9.10 & 0.9.12) # If set true, metric received will be forwarded to # DESTINATIONS in addition to # the output of the aggregation rules. If set false # the carbon-aggregator will # only ever send the output of aggregation. FORWARD_ALL = True 22. Carbonate backfill datapoints between whisper files 23. 2am: sudden drop-off 8am: look at graphs: ?!?! 10am: and were back. 24. Whats next? 25. finds real problems actionable alerting usable by all ? the ideal monitoring solution... 26. What were actually using now StatsD Application-level error analysis Alarms for autoscaling Timers & counters Log & host-level Hadoop & HBase visualization MongoDB Graphs Time-series data graphing client-side plugins External uptime checks oncall rotation/alerting Threshold-based alarms Dashboard 27. Discuss! Twitter: @bridgetkromhout Email: