Why Visibility into Your Stack Matters
-
Upload
amazon-web-services -
Category
Technology
-
view
259 -
download
0
Transcript of Why Visibility into Your Stack Matters
Why visibility into your
stack mattersor, Do you see it all?
Mike Fiedler
Operations
Datadog.comTwitter: @mikefiedler
GitHub: @miketheman
OpsSchool.org
Chef Community
Roller Derby Referee
Skydiver
©Alex Erde
–CEO calling your cellphone at 03:00
“The site is slow.”
What?
• typical monitoring implementation story
• an alternative approach
(CC BY 2.0) http://www.gotcredit.com/ https://flic.kr/p/6439SA
LB
Data
User
Web
(CC BY 2.0) www.futurealpha.com https://flic.kr/p/8PhF4g
(CC BY 2.0) Aristocrats-hat https://flic.kr/p/6qdTC1
–W. Edwards Deming, The Elements of Statistical Learning
“In God we trust; all others bring data.”
You want more?
• graphite
• ganglia
• mongodb
• mysql
• influxdb
• socket.io
• datadog
• …
from bottle import routeimport pymongoimport json
db = pymongo.Connection(‘mongodb://...
@route('/insert/:name')def insert(name):
doc = {'name': name}db.words.update(
doc, {"$inc":{"count": 1}}, upsert=True)return json.dumps(doc, default=default)
from bottle import routeimport pymongoimport jsonfrom statsd import statsd
db = pymongo.Connection(‘mongodb://...
@route(‘/insert/:name')@statsd.increment('wordcount.insert')def insert(name):
doc = {'name': name}db.words.update(
doc, {"$inc":{"count": 1}}, upsert=True)return json.dumps(doc, default=default)
Time is a Cruel Master
(CC BY-SA 2.0)
https://www.flickr.com/theilr/
https://flic.kr/p/8MC5YM
Have
• systems
• applications
• services
• developers
• operators
• customers
Have
• systems
• applications
• services
• developers
• operators
• customers
Polyglot Platforms
Complex Systems
Disparate Locations
Information Overload
–CEO calling your cellphone at 03:00
“The site is slow.”
(CC BY 2.0) www.futurealpha.com https://flic.kr/p/8PhF4g
Does this matter?
Top-down
• work metrics
• resource metrics
• events
Work Metricsthroughput (rps), success/error, performance (latency)
Resource Metricsutilization (%busy), saturation (queued), errors, availability
Eventschange/build/deploy, alerts, anything notable
Trend resource metrics,
notify on changes
Wake people up when
work metrics go awry
Slice and Diceexploration and aggregation
Set-and-Forget
Just-In-Time
Information
Does it scale?
Customer Stats
• AdRoll, ~2m transactions/second
• SimpleReach, ~7b measurements/day
• MercadoLibre, ~18k hosts monitored
• AirBnB, 3000+ monitors defined
–CEO calling your cellphone at 03:00
“The site is slow.”
–You
“Thanks. We know, and are already investigating.”
–You, because you never got that call in the first place due to
proactive data collection and alerting.
“[silence]”
Questions?
–M. Fiedler, Twitter: @mikefiedler
“If you don’t measure, you don’t won’t know.”