Webinar - Data driven postmortems - Jason Yee

Post on 16-Mar-2018

76 views 1 download

Transcript of Webinar - Data driven postmortems - Jason Yee

DATA-DRIVEN POSTMORTEMSJASON YEE, DATADOG @GITBISECT

“THE ONLY REAL MISTAKE IS THE ONE FROM WHICH WE LEARN NOTHING.”- Henry Ford

TW: @gitbisect @datadoghq

@gitbisect Technical Writer/Evangelist “Docs & Talks” Travel Hacker & Whiskey Hunter

@datadoghq SaaS-based monitoring Trillions of data points per day http://jobs.datadoghq.com

“The problems we work on at Datadog are hard and often don't have obvious, clean-cut solutions, so it's useful to cultivate your troubleshooting skills, no matter what role you work in.”

Internal Datadog Developer Guide

TW: @gitbisect @datadoghq

BLAMELESS POSTMORTEMS

TW: @gitbisect @datadoghq

DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES

WHAT IS DEVOPS? ▸ Culture

▸ Automation

▸Metrics

▸ Sharing

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

DAMON EDWARDS & JOHN WILLIS - DEVOPSDAY LOS ANGELES

OUR FOCUS AREA ▸ Culture

▸ Sharing

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

CULTURE & SHARING RESOURCES

BLAMELESS POSTMORTEMS▸Blameless Postmortems by John Allspaw

http://bit.ly/etsy-blameless

▸The Human Side of Postmortems by Dave Zwieback

http://bit.ly/human-postmortem

TW: @gitbisect @datadoghq

METRICSCULTURE & SHARING ARE GREAT, BUT WHAT ABOUT

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

COLLECTING DATA IS CHEAP; NOT HAVING IT WHEN YOU NEED IT CAN BE EXPENSIVE

SO INSTRUMENT ALL THE THINGS!

TW: @gitbisect @datadoghq

4 QUALITIES OF GOOD METRICSNOT ALL METRICS ARE EQUAL

TW: @gitbisect @datadoghq

1. WELL UNDERSTOOD

1. WELL UNDERSTOOD

1. WELL UNDERSTOOD

TW: @gitbisect @datadoghq

2. SUFFICIENT GRANULARITY

1 second Peak 46%

1 minute Peak 36%

5 minutes Peak 12%

1 second Peak 46%

1 minute Peak 36%

5 minutes Peak 12%

1 second Peak 46%

1 minute Peak 36%

5 minutes Peak 12%

3. TAGGED & FILTERABLE

TW: @gitbisect @datadoghq

Query Based Monitoring“What’s the average throughput of application:nginx per version ?”

“How many requests per second is my role:accounting-app running application:postgresql hosted in region:us-west-1 compared to region:us-east-1?”

TW: @gitbisect @datadoghq

4. LONG-LIVED

TW: @gitbisect @datadoghq

METRICS 101

HOW LONG?▸ AWS Cloudwatch: Up to 15months at 1h granularity

▸ MS Azure Monitoring Service: Up to 90d at 1d granularity

▸ Google Stackdriver: Up to 6 weeks at 1m granularity

▸ Datadog: Up to 15months at 1s granularity

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

P.S. - June 1! Mark your calendar!

RECURSE UNTIL YOU FIND THE TECHNICAL CAUSES

TW: @gitbisect @datadoghq

There is no singular “Root Cause”

HUMAN ELEMENTTECHNICAL ISSUES HAVE NON-TECHNICAL CAUSES

TW: @gitbisect @datadoghq

IF YOU’RE STILL RESPONDING TO THE INCIDENT, IT’S NOT TIME FOR A POSTMORTEM

TW: @gitbisect @datadoghq

HUMAN DATA

DATA COLLECTION: WHO?▸ Everyone!

▸ Responders

▸ Identifiers

▸ Affected Users

TW: @gitbisect @datadoghq

HUMAN DATA

DATA COLLECTION: WHAT?▸Their perspective

▸What they did

▸What they thought

▸Why they thought/did it

TW: @gitbisect @datadoghq

“WRITING IS NATURE’S WAY OF LETTING YOU KNOW HOW SLOPPY YOUR THINKING IS.”

RICHARD GUINDON

TW: @gitbisect @datadoghq

TELLING STORIES

“A PICTURE IS WORTH A THOUSAND WORDS” - ANCIENT PROVERB

TW: @gitbisect @datadoghq

HUMAN DATA

DATA COLLECTION: WHEN?▸ As soon as possible.

▸Memory drops sharply within 20 minutes

▸ Susceptibility to “false memory” increases

▸Get your project managers involved!

TW: @gitbisect @datadoghq

HUMAN DATA

DATA SKEW/CORRUPTION▸ Stress

▸ Sleep deprivation

▸ Burnout

TW: @gitbisect @datadoghq

HUMAN DATA

DATA SKEW/CORRUPTION▸ Blame

▸ Fear of punitive action

TW: @gitbisect @datadoghq

HUMAN DATA

DATA SKEW/CORRUPTION▸ Bias

▸ Anchoring

▸ Hindsight

▸Outcome

▸ Availability (Recency)

▸ Bandwagon Effect

TW: @gitbisect @datadoghq

HOW WE DO POSTMORTEMS AT DATADOG

TW: @gitbisect @datadoghq

DATADOG POSTMORTEMS

A FEW NOTES▸ Postmortems emailed to company wide

▸ Scheduled recurring postmortem meetings

TW: @gitbisect @datadoghq

DATADOG’S POSTMORTEM TEMPLATE (1/5)

SUMMARY: WHAT HAPPENED?▸Describe what happened here at a high-level --

think of it as an abstract in a scientific paper.

▸What was the impact on customers?

▸What was the severity of the outage?

▸What components were affected?

▸What ultimately resolved the outage?

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

DATADOG’S POSTMORTEM TEMPLATE (2/5)

HOW WAS THE OUTAGE DETECTED?▸We want to make sure we detected the issue

early and would catch the same issue if it were to repeat.

▸Did we have a metric that showed the outage?

▸Was there a monitor on that metric?

▸ How long did it take for us to declare an outage?

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

TW: @gitbisect @datadoghq

DATADOG’S POSTMORTEM TEMPLATE (3/5)

HOW DID WE RESPOND?▸Who was the incident owner & who else was

involved?

▸ Slack archive links and timeline of events!

▸What went well?

▸What didn’t go so well?

TW: @gitbisect @datadoghq

*Names changed

TW: @gitbisect @datadoghq

CHATOPS ARCHIVES FTW!

*Names changed

TW: @gitbisect @datadoghq

*Names changed

TRACK LEARNINGS AS YOU GO

TW: @gitbisect @datadoghq

DATADOG’S POSTMORTEM TEMPLATE (4/5)

WHY DID IT HAPPEN?▸Deep dive into the cause

▸ Examples from this incident:

▸ http://bit.ly/dd-statuspage

▸ http://bit.ly/alq-postmortem

TW: @gitbisect @datadoghq

DATADOG’S POSTMORTEM TEMPLATE (5/5)

HOW DO WE PREVENT IT IN THE FUTURE?▸ Link to Github issues and Trello cards

▸Now?

▸Next?

▸ Later?

▸ Follow up notes

TW: @gitbisect @datadoghq

*Names changed

TW: @gitbisect @datadoghq

DATADOG’S POSTMORTEM TEMPLATE

RECAP:▸What happened (summary)?

▸ How did we detect it?

▸ How did we respond?

▸Why did it happen (deep dive)?

▸ Actionable next steps!

TW: @gitbisect @datadoghq

KEEP LEARNING

MORE RESOURCES▸ Postmortem Template

http://bit.ly/postmortem-template

▸ Post-Incident Reviews by Jason Hand http://bit.ly/post-incident-review

TW: @gitbisect @datadoghq

QUESTIONS?LET’S TALK!@GITBISECT

@DATADOGHQ

SLIDES: bit.ly/cm-postmortems