Post on 16-Mar-2018
DATA-DRIVEN POSTMORTEMSJASON YEE, DATADOG @GITBISECT
“THE ONLY REAL MISTAKE IS THE ONE FROM WHICH WE LEARN NOTHING.”- Henry Ford
TW: @gitbisect @datadoghq
@gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey Hunter
@datadoghq SaaS-based monitoring Trillions of data points per day http://jobs.datadoghq.com
“The problems we work on at Datadog are hard and often don't have obvious, clean-cut solutions, so it's useful to cultivate your troubleshooting skills, no matter what role you work in.”
Internal Datadog Developer Guide
TW: @gitbisect @datadoghq
BLAMELESS POSTMORTEMS
TW: @gitbisect @datadoghq
DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
WHAT IS DEVOPS? ▸ Culture
▸ Automation
▸Metrics
▸ Sharing
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES
OUR FOCUS AREA ▸ Culture
▸ Sharing
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
CULTURE & SHARING RESOURCES
BLAMELESS POSTMORTEMS▸Blameless Postmortems by John Allspaw
http://bit.ly/etsy-blameless
▸The Human Side of Postmortems by Dave Zwieback
http://bit.ly/human-postmortem
TW: @gitbisect @datadoghq
METRICSCULTURE & SHARING ARE GREAT, BUT WHAT ABOUT
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
COLLECTING DATA IS CHEAP; NOT HAVING IT WHEN YOU NEED IT CAN BE EXPENSIVE
SO INSTRUMENT ALL THE THINGS!
TW: @gitbisect @datadoghq
4 QUALITIES OF GOOD METRICSNOT ALL METRICS ARE EQUAL
TW: @gitbisect @datadoghq
1. WELL UNDERSTOOD
1. WELL UNDERSTOOD
1. WELL UNDERSTOOD
TW: @gitbisect @datadoghq
2. SUFFICIENT GRANULARITY
1 second Peak 46%
1 minute Peak 36%
5 minutes Peak 12%
1 second Peak 46%
1 minute Peak 36%
5 minutes Peak 12%
1 second Peak 46%
1 minute Peak 36%
5 minutes Peak 12%
3. TAGGED & FILTERABLE
TW: @gitbisect @datadoghq
Query Based Monitoring“What’s the average throughput of application:nginx per version ?”
“How many requests per second is my role:accounting-app running application:postgresql hosted in region:us-west-1 compared to region:us-east-1?”
TW: @gitbisect @datadoghq
4. LONG-LIVED
TW: @gitbisect @datadoghq
METRICS 101
HOW LONG?▸ AWS Cloudwatch: Up to 15months at 1h granularity
▸ MS Azure Monitoring Service: Up to 90d at 1d granularity
▸ Google Stackdriver: Up to 6 weeks at 1m granularity
▸ Datadog: Up to 15months at 1s granularity
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
P.S. - June 1! Mark your calendar!
RECURSE UNTIL YOU FIND THE TECHNICAL CAUSES
TW: @gitbisect @datadoghq
There is no singular “Root Cause”
HUMAN ELEMENTTECHNICAL ISSUES HAVE NON-TECHNICAL CAUSES
TW: @gitbisect @datadoghq
IF YOU’RE STILL RESPONDING TO THE INCIDENT, IT’S NOT TIME FOR A POSTMORTEM
TW: @gitbisect @datadoghq
HUMAN DATA
DATA COLLECTION: WHO?▸ Everyone!
▸ Responders
▸ Identifiers
▸ Affected Users
TW: @gitbisect @datadoghq
HUMAN DATA
DATA COLLECTION: WHAT?▸Their perspective
▸What they did
▸What they thought
▸Why they thought/did it
TW: @gitbisect @datadoghq
“WRITING IS NATURE’S WAY OF LETTING YOU KNOW HOW SLOPPY YOUR THINKING IS.”
RICHARD GUINDON
TW: @gitbisect @datadoghq
TELLING STORIES
“A PICTURE IS WORTH A THOUSAND WORDS” - ANCIENT PROVERB
TW: @gitbisect @datadoghq
HUMAN DATA
DATA COLLECTION: WHEN?▸ As soon as possible.
▸Memory drops sharply within 20 minutes
▸ Susceptibility to “false memory” increases
▸Get your project managers involved!
TW: @gitbisect @datadoghq
HUMAN DATA
DATA SKEW/CORRUPTION▸ Stress
▸ Sleep deprivation
▸ Burnout
TW: @gitbisect @datadoghq
HUMAN DATA
DATA SKEW/CORRUPTION▸ Blame
▸ Fear of punitive action
TW: @gitbisect @datadoghq
HUMAN DATA
DATA SKEW/CORRUPTION▸ Bias
▸ Anchoring
▸ Hindsight
▸Outcome
▸ Availability (Recency)
▸ Bandwagon Effect
TW: @gitbisect @datadoghq
HOW WE DO POSTMORTEMS AT DATADOG
TW: @gitbisect @datadoghq
DATADOG POSTMORTEMS
A FEW NOTES▸ Postmortems emailed to company wide
▸ Scheduled recurring postmortem meetings
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE (1/5)
SUMMARY: WHAT HAPPENED?▸Describe what happened here at a high-level --
think of it as an abstract in a scientific paper.
▸What was the impact on customers?
▸What was the severity of the outage?
▸What components were affected?
▸What ultimately resolved the outage?
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE (2/5)
HOW WAS THE OUTAGE DETECTED?▸We want to make sure we detected the issue
early and would catch the same issue if it were to repeat.
▸Did we have a metric that showed the outage?
▸Was there a monitor on that metric?
▸ How long did it take for us to declare an outage?
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE (3/5)
HOW DID WE RESPOND?▸Who was the incident owner & who else was
involved?
▸ Slack archive links and timeline of events!
▸What went well?
▸What didn’t go so well?
TW: @gitbisect @datadoghq
*Names changed
TW: @gitbisect @datadoghq
CHATOPS ARCHIVES FTW!
*Names changed
TW: @gitbisect @datadoghq
*Names changed
TRACK LEARNINGS AS YOU GO
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE (4/5)
WHY DID IT HAPPEN?▸Deep dive into the cause
▸ Examples from this incident:
▸ http://bit.ly/dd-statuspage
▸ http://bit.ly/alq-postmortem
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE (5/5)
HOW DO WE PREVENT IT IN THE FUTURE?▸ Link to Github issues and Trello cards
▸Now?
▸Next?
▸ Later?
▸ Follow up notes
TW: @gitbisect @datadoghq
*Names changed
TW: @gitbisect @datadoghq
DATADOG’S POSTMORTEM TEMPLATE
RECAP:▸What happened (summary)?
▸ How did we detect it?
▸ How did we respond?
▸Why did it happen (deep dive)?
▸ Actionable next steps!
TW: @gitbisect @datadoghq
KEEP LEARNING
MORE RESOURCES▸ Postmortem Template
http://bit.ly/postmortem-template
▸ Post-Incident Reviews by Jason Hand http://bit.ly/post-incident-review
TW: @gitbisect @datadoghq
QUESTIONS?LET’S TALK!@GITBISECT
@DATADOGHQ