Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

Anatomy of a real-life incident

Alex SolomonCTO & Co-Founder @

THIS IS A TRUE STORY

The events in this presentation took place in San Francisco and Toronto on January 6, 2017

In the interest of brevity, some details have been omitted

The Services

Web2Kafka Service

Incident Log Entries Service

Docker

Mesos / marathon

Linux Kernel

publishes change events from web monolith to Kafka for other services to consume

stores log entries for incidents

The People

Eric Incident

Commander

Peter Scribe

Ken Deputy

Luke Communications

Liaison

Major incident response principal roles

David Core on-call

Cees Core eng

Evan SRE on-call

Renee IM People on-call

Zayna Mobile on-call

JD IM Data on-call

Priyam EM on-call

Subject Matter Experts (SMEs)

The Incident

[3:21 PM] David:SME

!ic page

Officer URL:Chat BOT

🚨Paging Incident Commander(s)✔ Eric has been paged.✔ Ken has been paged.✔ Peter has been paged. Incident triggered in the following service: https://pd.pagerduty.com/services/PERDDFI

David:SME

web2kafka is down, and I'm not sure what's going on

kicked off the major incident process

[3:21 PM] Eric:IC

Taking IC Eric took the IC role (he was IC primary on-call)

The Incident Commander• The Wartime General: decision maker during a major incident

• GOAL: drive the incident to resolution quickly and effectively

• Gather data: ask subject matter experts to diagnose and debug various aspects of the system

• Listen: collect proposed repair actions from SMEs

• Decide: decide on a course of action

• Act (via delegation): once a decision is made, ask the team to act on it. IC should always delegate all diagnosis and repair actions to the rest of the team.

Priyam:SME

I’m here from EM

Evan:SME

lmk if you need SRE sounds like IHM might be down too

Ken:DEPUTY

@renee, please join the call[3:22 PM] Ken took the deputy role

Other SMEs joined

The Deputy (backup IC)

• The Sidekick: right hand person for the IC

• Monitor the status of the incident

• Be prepared to page other people

• Provide regular updates to business and/or exec stakeholders

Peter:SCRIBE

I am now the scribe Eric: Looking to find Mesos experts Evan: Looking for logs & dashboards

Zayna:SME

seeing a steady rise in crashes in Android app around trigger incident log entires

[3:24 PM]

JD:SME

No ILEs will be generated due to LES not being able to query web2kafka

[3:25 PM]

Eric: David, what have you looked at? David: trolling logs, see errors David: tried restarting, doesn’t help

[3:23 PM] Ken:DEPUTY

Notifications are still going out, subject lines are filled in but not email bodies (they use ILEs)

Renee:SME

Peter becomes the scribe

Discussing customer-visible impact of the incident

Ken is both deputy and scribe

The Scribe• The Record-keeper

• Add notes to the chatroom when findings are determined or significant actions are taken

• Add TODOs to the room that indicate follow-ups for later (generally after the incident)

• Monitor tasks assigned by the IC to other team members, remind the IC to follow-up

Renee:SME

Can’t expand incident details

Luke:CUST LIAISON

suggested tweet: `There is currently an issue affecting the incident log entries component of our web application causing the application to display errors. We are actively investigating.`

[3:29 PM]

David: No ILEs can be created Renee: no incident details, error msg in the UI

[3:27 PM] Peter:SCRIBE

Eric: Comms rep on the phone? Luke Eric to Luke: Please compose a tweet

Peter:SCRIBE

Eric: What’s the customer impact?[3:26 PM] Peter:SCRIBE

Luke to tweetPeter:SCRIBE

IC asked the customer liaison to write a msg to customers

Msg was sent out to customers

The Communications Liaison

• The link to the customer

• Monitor customer and business impact

• Provide regular updates to customers (and/or to customer-facing folks in the business)

• (Optional) Provide regular updates to stakeholders

Cees:SME

I’m away from any laptops, just arrived at a pub for dinner.

[3:36 PM]

@cees Would you join us on the bridge? We have a few Mesos questions

Eric:IC

Evan: might need to kick new hardware if system is actually unreachable.Evan: slave01 is reachableDavid: slave02 is not reachable.David: slave03 is not reachable.David: only 3 slaves for mesosEric: We are down to only one hostEvan: Seeing some stuff. Memory exhaustion.


TODO: Create a runbook for mesos to stop the world and start again

Peter:SCRIBE

David added Cees to the incident Eric: Is there a runbook for mesos? David: Yes, but not for this issue.


Scribe captured a TODO to record & remember a follow-up that should

happen after the incident is resolved

We paged a Mesos expert who is not on-call

The Mesos expert joined the chat

David: Only 3 slaves in that cluster, we have another cluster in us-west-1 Eric: Two options: kick more slaves or restart marathon


Evan: OOM killer has kicked in on slave01


Eric: Stop slaves in west2, startup web2kafka in west1 Evan: slave02 is alive! Eric: Waiting 2 minutes


David: Consider bringing up another cluster? Cees: Should be trivial


Eric to evan: please reboot slave02 and slave03


Restart slaves firstCees:SME

slave01 is now down[3:42 PM] Evan:SME They are considering

bringing up another Mesos cluster in west1

slave02 is back up after reboot, so they hold off

on flipping to west1

Noticed that oom-killer killed the docker

process on slave01

Evan: Slave02 is quiet. Evan: Slave02 is trying to start, exiting with code 137


Evan: Slave02 is quiet. Evan: 137 means it’s being killed by OOM, OOM is killing docker containers continuously

Peter:SCRIBE

[3:53 PM] Proposed Action: David is going to configure marathon to allow more memory

Peter:SCRIBE

[3:54 PM] Proposed Action: Evan to force reboot slave01

Peter:SCRIBE

[3:56 PM] David: Web2kafka appears to be running Eric: Looks like all things are running Renee: Things are fine with notifications JD: LES is seeing progress

Peter:SCRIBE

[3:55 PM] Customer impact: there are 4 tickets so far and 2 customers chatting with us, which is another 2 tickets

Luke:CUST LIAISON

They realized the problem: oom-killer is

killing the docker containers over and over

The resolution action was to redeploy web2kafka with a higher cgroup/Docker memory limit:

2GB (vs 512 MB before)

The customer liaison provided an update on the customer impact

The system is recovering

The Punchline• Root cause

• Increase in traffic caused web2kafka to increase its memory usage

• This caused the Linux oom-killer to kill the process

• Then, mesos / marathon immediately restarted it, it ramped up memory again, oom-killer killed it, and so on.

• After doing this restart-kill cycle multiple times, we hit a race-condition bug in the Linux kernel causing a kernel panic and killing the host

• Other services running on the host were impacted, notably LES

Summary• Incident Command

• The most important role, crucial to fast decision making and action!

• Takes practice and experience

• Deputy

• The right-hand person for the IC, can step in and take over Incident Command for long-running incidents

• Responsible for business & exec stakeholder communications, allowing the technical team to focus on incident resolution

• Scribe

• Essential for providing context in the chatroom and tracking follow-ups & action items (for example, the IC saying “Evan, do X, report back in 5 min”)

• Produces step-by-step documentation which very helpful for constructing the timeline later (in the post-mortem)

• Communications liaison

• Essential for tracking customer impact and communicating status to customers

The EndAlex Solomon

CTO & Co-Founder @ [email protected]

The PagerDuty Incident Response process and training is open-source: https://response.pagerduty.com

mailto:[email protected]

https://response.pagerduty.com

Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty

Engineering

Transcript of Anatomy of a real-life incident -Alex Solomon, CTO and Co-Founder of PagerDuty