Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

19
Anatomy of a real-life incident Alex Solomon CTO & Co-Founder @

Transcript of Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

Page 1: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

Anatomy of a real-life incident

Alex SolomonCTO & Co-Founder @

Page 2: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

THIS IS A TRUE STORY

The events in this presentation took place in San Francisco and Toronto on January 6, 2017

In the interest of brevity, some details have been omitted

Page 3: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

The Services

Web2Kafka Service

Incident Log Entries Service

Docker

Mesos / marathon

Linux Kernel

publishes change events from web monolith to Kafka for other services to consume

stores log entries for incidents

Page 4: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

The People

Eric Incident

Commander

Peter Scribe

Ken Deputy

Luke Communications

Liaison

Major incident response principal roles

David Core on-call

Cees Core eng

Evan SRE on-call

Renee IM People on-call

Zayna Mobile on-call

JD IM Data on-call

Priyam EM on-call

Subject Matter Experts (SMEs)

Page 5: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

The Incident

Page 6: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

[3:21 PM] David:SME

!ic page

Officer URL:Chat BOT

🚨Paging Incident Commander(s)✔ Eric has been paged.✔ Ken has been paged.✔ Peter has been paged. Incident triggered in the following service: https://pd.pagerduty.com/services/PERDDFI

David:SME

web2kafka is down, and I'm not sure what's going on

kicked off the major incident process

[3:21 PM] Eric:IC

Taking IC Eric took the IC role (he was IC primary on-call)

Page 7: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

The Incident Commander• The Wartime General: decision maker during a major incident

• GOAL: drive the incident to resolution quickly and effectively

• Gather data: ask subject matter experts to diagnose and debug various aspects of the system

• Listen: collect proposed repair actions from SMEs

• Decide: decide on a course of action

• Act (via delegation): once a decision is made, ask the team to act on it. IC should always delegate all diagnosis and repair actions to the rest of the team.

Page 8: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

Priyam:SME

I’m here from EM

Evan:SME

lmk if you need SRE sounds like IHM might be down too

Ken:DEPUTY

@renee, please join the call[3:22 PM] Ken took the deputy role

Other SMEs joined

Page 9: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

The Deputy (backup IC)

• The Sidekick: right hand person for the IC

• Monitor the status of the incident

• Be prepared to page other people

• Provide regular updates to business and/or exec stakeholders

Page 10: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

Peter:SCRIBE

I am now the scribe Eric: Looking to find Mesos experts Evan: Looking for logs & dashboards

Zayna:SME

seeing a steady rise in crashes in Android app around trigger incident log entires

[3:24 PM]

JD:SME

No ILEs will be generated due to LES not being able to query web2kafka

[3:25 PM]

Eric: David, what have you looked at? David: trolling logs, see errors David: tried restarting, doesn’t help

[3:23 PM] Ken:DEPUTY

Notifications are still going out, subject lines are filled in but not email bodies (they use ILEs)

Renee:SME

Peter becomes the scribe

Discussing customer-visible impact of the incident

Ken is both deputy and scribe

Page 11: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

The Scribe• The Record-keeper

• Add notes to the chatroom when findings are determined or significant actions are taken

• Add TODOs to the room that indicate follow-ups for later (generally after the incident)

• Monitor tasks assigned by the IC to other team members, remind the IC to follow-up

Page 12: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

Renee:SME

Can’t expand incident details

Luke:CUST LIAISON

suggested tweet: `There is currently an issue affecting the incident log entries component of our web application causing the application to display errors. We are actively investigating.`

[3:29 PM]

David: No ILEs can be created Renee: no incident details, error msg in the UI

[3:27 PM] Peter:SCRIBE

Eric: Comms rep on the phone? Luke Eric to Luke: Please compose a tweet

Peter:SCRIBE

Eric: What’s the customer impact?[3:26 PM] Peter:SCRIBE

Luke to tweetPeter:SCRIBE

IC asked the customer liaison to write a msg to customers

Msg was sent out to customers

Page 13: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

The Communications Liaison

• The link to the customer

• Monitor customer and business impact

• Provide regular updates to customers (and/or to customer-facing folks in the business)

• (Optional) Provide regular updates to stakeholders

Page 14: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

Cees:SME

I’m away from any laptops, just arrived at a pub for dinner.

[3:36 PM]

@cees Would you join us on the bridge? We have a few Mesos questions

Eric:IC

Evan: might need to kick new hardware if system is actually unreachable.Evan: slave01 is reachableDavid: slave02 is not reachable.David: slave03 is not reachable.David: only 3 slaves for mesosEric: We are down to only one hostEvan: Seeing some stuff. Memory exhaustion.

[3:37 PM] Peter:SCRIBE

TODO: Create a runbook for mesos to stop the world and start again

Peter:SCRIBE

David added Cees to the incident Eric: Is there a runbook for mesos? David: Yes, but not for this issue.

[3:34 PM] Peter:SCRIBE

Scribe captured a TODO to record & remember a follow-up that should

happen after the incident is resolved

We paged a Mesos expert who is not on-call

The Mesos expert joined the chat

Page 15: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

David: Only 3 slaves in that cluster, we have another cluster in us-west-1 Eric: Two options: kick more slaves or restart marathon

[3:38 PM] Peter:SCRIBE

Evan: OOM killer has kicked in on slave01

[3:39 PM] Peter:SCRIBE

Eric: Stop slaves in west2, startup web2kafka in west1 Evan: slave02 is alive! Eric: Waiting 2 minutes

[3:47 PM] Peter:SCRIBE

David: Consider bringing up another cluster? Cees: Should be trivial

[3:44 PM] Peter:SCRIBE

Eric to evan: please reboot slave02 and slave03

[3:41 PM] Peter:SCRIBE

Restart slaves firstCees:SME

slave01 is now down[3:42 PM] Evan:SME They are considering

bringing up another Mesos cluster in west1

slave02 is back up after reboot, so they hold off

on flipping to west1

Noticed that oom-killer killed the docker

process on slave01

Page 16: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

Evan: Slave02 is quiet. Evan: Slave02 is trying to start, exiting with code 137

[3:49 PM] Peter:SCRIBE

Evan: Slave02 is quiet. Evan: 137 means it’s being killed by OOM, OOM is killing docker containers continuously

Peter:SCRIBE

[3:53 PM] Proposed Action: David is going to configure marathon to allow more memory

Peter:SCRIBE

[3:54 PM] Proposed Action: Evan to force reboot slave01

Peter:SCRIBE

[3:56 PM] David: Web2kafka appears to be running Eric: Looks like all things are running Renee: Things are fine with notifications JD: LES is seeing progress

Peter:SCRIBE

[3:55 PM] Customer impact: there are 4 tickets so far and 2 customers chatting with us, which is another 2 tickets

Luke:CUST LIAISON

They realized the problem: oom-killer is

killing the docker containers over and over

The resolution action was to redeploy web2kafka with a higher cgroup/Docker memory limit:

2GB (vs 512 MB before)

The customer liaison provided an update on the customer impact

The system is recovering

Page 17: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

The Punchline• Root cause

• Increase in traffic caused web2kafka to increase its memory usage

• This caused the Linux oom-killer to kill the process

• Then, mesos / marathon immediately restarted it, it ramped up memory again, oom-killer killed it, and so on.

• After doing this restart-kill cycle multiple times, we hit a race-condition bug in the Linux kernel causing a kernel panic and killing the host

• Other services running on the host were impacted, notably LES

Page 18: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

Summary• Incident Command

• The most important role, crucial to fast decision making and action!

• Takes practice and experience

• Deputy

• The right-hand person for the IC, can step in and take over Incident Command for long-running incidents

• Responsible for business & exec stakeholder communications, allowing the technical team to focus on incident resolution

• Scribe

• Essential for providing context in the chatroom and tracking follow-ups & action items (for example, the IC saying “Evan, do X, report back in 5 min”)

• Produces step-by-step documentation which very helpful for constructing the timeline later (in the post-mortem)

• Communications liaison

• Essential for tracking customer impact and communicating status to customers

Page 19: Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

The EndAlex Solomon

CTO & Co-Founder @ [email protected]

The PagerDuty Incident Response process and training is open-source: https://response.pagerduty.com