Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys...

93
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick Root Cause Analysis – Beginner 1 A Hands-on Tutorial Your Pre-Flight Check List 1. Write your first name on the card stock, display prominently 2. Locate the courseware on the USB stick 3. Grab the latest version of the slide deck, dated 2013-11-05 http://www.skendric.com/seminar/rca/Root-Cause-Analysis-Beginner-Deck.pdf 4. Configure Wireshark columns (see p.5 of this presentation) 5. Introduce yourself to your potential teammates: figure out who will play which roles 6. Examine the diagrams on the walls Copyright Stuart Kendrick ©2013 All Rights Reserved

Transcript of Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys...

Page 1: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick

Root Cause Analysis – Beginner

1

A Hands-on Tutorial

Your Pre-Flight Check List

1. Write your first name on the card stock, display prominently

2. Locate the courseware on the USB stick

3. Grab the latest version of the slide deck, dated 2013-11-05

http://www.skendric.com/seminar/rca/Root-Cause-Analysis-Beginner-Deck.pdf

4. Configure Wireshark columns (see p.5 of this presentation)

5. Introduce yourself to your potential teammates: figure out who will play which roles

6. Examine the diagrams on the walls

Copyright Stuart Kendrick ©2013 All Rights Reserved

Page 2: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

IntroductionExample CaseSplit into Small GroupsCase Studies

Remote Office BumpsMany Applications Crash

Tips & ToolsWrap-up

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 2

Workshop Outline

Page 3: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 3

IntroductionMechanics

Me and My Biases

What is Root Cause Analysis?

How Does This Class Work?

Recommendations

Page 4: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 4

Mechanics

We use Google Docs … you don’t need an account: I will provide links

9:00 – 10:30 Class Ask questions whenever you want

10:30 – 11:00 Break

11:00 – 12:30 Class

12:30 – 13:30 Lunch Your Laptop

13:30 – 15:00 Class • has Internet connectivity

15:00 – 15:30 Break • can display & search PDF, PNG, TXT, XLS

15:30 – 16:30 Class • Wireshark configured per next slide

16:30 – 17:00 Wrap-up

Page 5: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 5

Configure Wireshark Columns

• Use a recent version of Wireshark … 1.10.0 at a minimum – I recommend the latest and greatest • If you are an experienced Wireshark user, feel free to ignore this and use your favorite column choices• If you are really experienced and prefer a different analyzer, feel free to use it

You

rea

lly w

ant

Del

ta t

ime

dis

pla

yed

An

d C

ust

om

(tc

p.s

trea

m)

will

be

hel

pfu

l

Page 6: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Multi-disciplinary IT trouble-shooter / Root Cause Analysishttp://www.skendric.com

sbk@cornella student 1981stuart@cpvax5 (Science Applications Inc) programmer [email protected] desktop / server [email protected] server / network [email protected] multidisciplinary 1993stuart.kendrick {at} isi lon dot com sustaining engineer 2014

IT Architect | ITIL Problem Manager | Problem Analyst | Device Monitoring | Transport

Geeky HighlightsPL/1 on IBM mainframes Cornell University Ithaca 1981FORTRAN on CRAY-1 SAIC San Diego 1984Terak, DisplayWriter, IBM PC, Macintosh Cornell University Ithaca 1985Netware, Corvus Omninet, TCP-IP / IPX / AppleTalk Cornell University Ithaca 1988AppleShare, QuickMail, Farallon, NRC, Cisco, Sniffers Cornell Medical College Manhattan 1991Solaris, Windows, Linux, Perl, SNMP, Wireshark, Cisco ,Fluke FHCRC Seattle 1993OneFS EMC Isilon Seattle 2013

Me

2014-04-12 Myth-Busting | xxx 2014 | Stuart Kendrick / Chris Shaiman 6

Page 7: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

You are a mid-level engineerPerhaps you function as a sys admin, network engineer, database admin, or developerPerhaps you support desktops and want to expand into another spacePerhaps you work for a small outfit and are a jack-of-all-trades

You look at logs regularly when tackling a problem, perhaps you’ve even looked at packet traces, though without nearly as much success as you would like. You’re curious about how things work and you’re persistent: you beat your head against a problem, trying to solve it from various angles.

You are here because you want a chance to tackle problems on your own and then receive coaching on techniques for analyzing packet traces, extracting insights from performance charts, correlating log entries from multiple devices.

Or … perhaps you are a people or process person – resource manager, project manager, ITILProblem Manager. You don’t have the skills to analyze bits & bytes, but you want to practice a problem solving methodology. You’ll help keep your team on track, coordinating subject matter experts, bringing the results together for reports to the larger class.

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 7

You

Page 8: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

• I do not claim to be good at trouble-shooting• I do not claim to know how to teach trouble-shooting• I am not the smartest or fastest guy on the block

However …

• I have ~30 years experience in this business• I have trained under gurus• I have accumulated a grab bag of tips which you may find useful• I have converted real-world events into these case studies • The result is a set of puzzle-solving labs which I predict you’ll enjoy

After all, it is more fun to trouble-shoot someone else’s issues …

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 8

Caveats

Page 9: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

I have made a ceaseless effort not to ridicule, not to bewail, not to scorn human actions, but to understand them.

--Baruch Spinoza

Anything worth doing is worth doing badly.--Marshall Rosenberg

The first principle is that you must not fool yourself -- and you are the easiest person to fool.

--Richard Feynman

Doubt is uncomfortable; certainty is absurd.--Voltaire

The goal of education is to make up for the shortcomings in our instinctive ways of thinking about the physical and social world.

--Steven Pinker

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 9

My World View

Page 10: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 10

Confidence & KnowledgeC

on

fid

ence

Knowledge

Certainty

Doubt

Little Lots

Newbie Jedi

Ignorance more frequently begets confidence than does knowledge. --Charles Darwin

Page 11: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

As I age, I increasingly value the following from myself and my colleagues:

• I don’t know• I made a mistake• Here’s how I will clean up the mess I made

I predict that you will follow many blind avenues during RCAs … I wish you success in keeping shoshin, aka, beginner’s mind, as you wander along your path …

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 11

Music to My Ears

Page 12: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Science is not truth; it is, instead, a method for diminishing ignorance.--J.M. Adovasio, Olga Soffer, Jake Page

A scientific theory accurately describes a large class of observations, makes definite predictions about future observations that could be falsifiable, i.e. disproven by observation.

--Derived from Stephen Hawking

Credible explanations grow from the combined testimony of three more or less independent, mutually reinforcing sources -- explanatory theory, empirical evidence, and rejection of competing alternative explanations.

--Edward Tufte

I recommend Tufte’s day-long seminar, as an introduction to critical thinking --sk2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 12

My Biases

Page 13: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 13

Quantum Mechanics

http://xkcd.com/1240/

Page 14: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Any structured approach for identifying the contributors to an IT service disruption

There is no such thing as a Root Cause … nevertheless, Root Cause Analysis remains a useful tool

RCA is not complete until we’ve applied the fix and verified that the problem is resolved

Business reality: competing priorities distract us from completing RCAs

Most folks use the term RCA to refer to a post-mortem process … I use the term in its ITIL sense, tightly bound to Problem Management

How Complex Systems Fail – Richard CookA Few Thoughts on Uptime – me

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 15

What is Root Cause Analysis?

Page 15: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 16

Why No Root Cause?Why do I claim there is no such thing as a Root Cause? Consider the server which goes down; your monitoring system pages you; you investigate. Turns out the power supply died – you replace the power supply, the server reboots, everyone is happy again. Then, you notice that the second power supply is dead, too. Turns out your monitoring system wasn’t checking power supplies when the first one fried a few months ago. Why wasn’t your monitoring system checking power supplies? Because it can’t – and upgrading to the newer version which can costs time & money – your management looked at the costs, weighed the risks, and decided to spend your time and those dollars on upgrading the aging e-mail server, which was close to collapse. Why doesn’t your department have enough staff and money to upgrade both the e-mail server and the monitoring server? Because management has to juggle the costs of IT against the costs of core business requirements – both of which look critical from different vantage points.

So what’s the Root Cause? A failed power supply? An inadequate monitoring system? Insufficient process in your leadership’s prioritization tactics, that they let the aging e-mail system stumble along for far too long? Insufficient resources to meet both core business requirements and IT requirements? Not enough market for your product, which is why you don’t have sufficient resources to meet both sets of needs?

Still not convinced? Why have you lost two power supplies across as many months? Because your local utility is straining to meet demand in your area and frequently inflicts brownouts, which age power supplies prematurely. Why hasn’t the utility beefed up capacity in your area? Because that would cost money, and politicians are reluctant to approve the rate increases necessary to support an expansion, given current voter sentiment. Why are voters annoyed at politicians? … Reality is complex: There is no such thing as Root Cause …

Page 16: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Oh boy, that’s a big question. But let’s take a stab at answering it. A tech might start asking themselves, or the person reporting the problem, questions similar to the following:

• What makes you think there is an issue?• What are you expecting that you’re not getting?• Has it ever performed well?• What changed recently? Software or hardware? Load?• Can it be expressed in terms of latency or run time?• Does the problem affect other people or applications?• What is the environment? What software and hardware is used? Versions? Configuration?• …

Most issues get fixed somewhere during the process of asking these questions and uncovering the answers …

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 17

How Do Techs Fix Issues?

Page 17: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

As the issue resists resolution, less skilled techs will start employing less effective approaches.

Street Lamp MethodThe student comes across his professor on the Arts Quad at night, down on his hands & knees, staring at the sidewalk. “What are you doing, sir?” “Looking for my car keys”. The student joins the professor but after looking unsuccessfully in widening circles, asks him “Do you recall precisely where you were when you dropped the keys?” “Yes, over there, in the middle of the quad” points the professor, toward the dimly perceived middle of the grassy acre. “Well, why are you looking here?” asks the student. “Because the light is better here” responds the professor.

More formally:1. List available tools2. Examine the output of each one, looking for clues3. Purchase more tools4. Goto #1

Use The Force, Luke“I know that we are experiencing a broadcast storm … you should check your {switch | router | firewall | server | client | application | whatever-belongs-to-some-other-group}”

I enjoyed Star Wars … but it was fiction … that distinction is hard for human brains to make. --sk

2013-11-05 18Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick

Anti-Patterns

Page 18: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

The issue typically gets escalated to a more experienced tech. I have yet to be satisfied with an account of what an experienced human does when engaging on their field of expertise. That said, here is one way to express what might be happening.

For every Resource, check Utilization, Saturation and Errors.

Intended to be used early in a performance investigation, to identify systemic bottlenecks.

Terminology definitions:• Resource all physical server functional components (CPUs, disks, busses, …)• Utilization the average time that the resource was busy servicing work• Saturation the degree to which the resource has extra work which it can’t service, often queued• Error the count of error events

Stuart’s version:1. Scan the logs, looking for error messages Errors2. Are requests waiting in queues? Saturation3. How busy are the boxes? Utilization

I am cribbing from Brendan Gregg: http://dtrace.org/blogs/brendan/2012/02/29/the-use-method

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 19

The USE Method

Page 19: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Most problems get solved using any number of techniques, a few of which I sketched in the previous slides

But that’s not what I will be pushing you to do today

I will be pushing you to employ a methodology called Rapid Problem Resolution (RPR) ®

RPR is an evidence-based process … it is a heavy process … it is a sledgehammer. Sledgehammers are generally overkill …

But for a certain class of problems – the ones which have defeated experienced techs for weeks, months, or years – sledgehammers offer plenty of value

The case studies in this class belong to that class of problems

I will push you to employ RPR. You may resist. That’s OK

The official goal of this class is to introduce you to RPR

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 20

But Not Today

Page 20: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

This workshop borrows heavily from the Rapid Problem Resolution® methodology codified by Paul Offord of Advance7, which fits into ITIL’sProblem Management schema.

I’ve slashed Advance7’s 19 step approach into 9 steps. This makes the methodology less effective but teachable in a single day. And suitable for smaller RCAs.

RPR is not a silver bullet. It is merely a tool for your tool bag, like ping, top, PerfMon …

There are no silver bullets.

Life is pain, Highness. Anyone who says differently is selling something.--The Man in Black

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 21

Rapid Problem Resolution ®

Page 21: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Derived from the Rapid Problem Resolution® methodology

1. Understand the Symptoms2. Pick One Symptom3. Draw the Diagram4. Design Capture Plan5. Capture Diagnostic Data6. Analyze Captured Data7. Identify Fix8. Implement Fix9. Verify Fix

RCA Methodology

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 22

Phase 1

Phase 2

Phase 3

Page 22: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 23

Notes on the Nine Steps1. Humans want instant gratification: we start trouble-shooting before we understand the

problem. Resist that urge.2. Natural desire to want to fix everything fast – myself, I rarely succeed when I try. Be

particularly wary of thrashing: jumping from one symptom to another. Pick One Symptom, One Symptom only, and stick to it.

3. Common to start trouble-shooting before understanding the environment. Draw the Diagram and Sit with the User. You may discover that you didn’t understand the Symptom, in which case, start over.

4. As you learn more about the Environment and make mistakes in your capture methodology, you’ll cycle through Steps #4-6 numerous times. This is normal. As you become more experienced, you’ll spend more time on #3 and fewer time s cycling through #4-#6.

5. If the problem is intermittent, you can spend a lot of time waiting here. That is reality.6. Naturally, you need time to think about the data you capture.7. At some point, you exit the #4-#6 loop because you think you understand what is happening

and you have identified a fix.8. You apply the fix9. Key step: verify that your fix actually works. If it doesn’t, start over.

Page 23: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

RCA Roles & Responsibilities

Who What

Facilitator

(often a Problem Manager)

Accountable for

o Owns the RCA

o Acquire resources

o Use and execute the methodology

o Communicate within the team

o Report & escalate to leadership

o Schedule meetings

Problem Analyst

(often a senior engineer)

Responsible for

o Unify & synthesize information from SMEs

o Keep team on track technically

o Breadth & depth

Subject Matter Experts

Responsible for

o Strong fundamental knowledge of area

o Facilitating access

o Capturing data

o Analyzing

SME Desirable Characteristics

Skills / Predilections

o Problem solving skills

o Inquiring mind – passion for understanding how things work

o Determination & stamina – pursuing a tough problem can be wearing

o T-shaped – broad background in IT with specialization in one or two particular areas

The Problem Solving Group (aka RCA Team) consists of the Facilitator, the Problem Analyst, and one or more Subject Matter Experts

Process-oriented person

Sees the forest, not the treesRespected / trusted by SMEs

Like getting their hands dirty

Page 24: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 25

Draw the DiagramDesign Capture Plan

Fibre Channel Switch

Request

Response

Who talks to whom?Where to insert probes?Where to gather logs / debug output?

(DNS, LDAP, NIS …)

Page 25: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

We will work through case studies – real situations drawn from my experience at FHCRC – alternating between small group and seminar style sessions.

Typically, we will oscillate in 15-30 minute increments – spending 15-30 minutes together as a class, working privately in our small groups for 15-30 minutes, coming together for 15-30 minutes …

Course materials on the USB stick include packet traces, log extracts, trending charts, ‘show’ output from clients, servers, switches/routers, storage systems, captured during the actual RCA.

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 26

How Does This Class Work?

Page 26: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Whirlwind tour: At the Hutch, we typically spent weeks of an RCA team’s time on these cases – in this workshop, we will just taste each experience, merely touching on key points – we will not have time to dig through any of them in detail.

Variable expertise: As a group, we differ wildly in our expertise –some of us have never seen Wireshark before, have never touched an Ethernet switch or a storage array. I will play to a range of levels: sometimes you may be bored, sometimes you may be drowning.

We will not finish: I do not expect to reach all the case studies. We may not even get through the first one – it contains a lot of material – all depends on where your curiousity leads us.

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 27

Expectations

Page 27: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Detours: Using your questions as cues, I will stop the flow of the course and explore related topics: how striping affects the performance of arrays, how TCP Window works, how to perform a particular function in Wireshark.

Contribute: If you have expertise to contribute, please speak up –group dialogue contributes to learning.

Methodology: I will be a stickler for the RPR Methodology and will attempt to push you into following it, following each step in order. Naturally, you may choose to resist. I’m OK with dissent and rebellion – you know yourself better than I do – if you’ll learn better doing things differently, ignore me + blaze your own trail.

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 28

More Expectations

Page 28: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Red Herrings: I will include data and clues which are irrelevant to solving the problem … that’s what happened to us, so I intend to share the pain.

Misinformation: When I am wearing a hat, I may give you inaccurate information, based on the limitations of the person whose role I am playing. When I am bare-headed, I am playing the role of the instructor and will try to describe reality as accurately as I know how.

Chaos: I am trying to recreate the fog of war, the confusion of a real-world situation: practicing ways to bring order from chaos is a deep lesson of this class

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 29

Great Expectations

Page 29: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Embarrass me: I make mistakes – find them and point them out. I’d rather feel embarrassed and learn than feel comfortable and remain ignorant.

Embarrass yourself: Take risks, ask dumb questions, reveal your ignorance. If you don’t understand my answer, ask again. This is your laboratory, a safe place for you to learn. Ex ignoratia ad sapientium, E luce ad tenebras.

Data: The USB stick contains data – packet traces, ‘show’ output, screen shots – as you work through the scenario and ask for data, I will point you to the relevant directory. If you get stuck, feel free to poke around.

Results Folders: The USB stick also contains the answers to the case studies in folders named Results. I recommend avoiding the Results folder until we’re done for the day.

Wave me down: If you are stuck and thrashing, wave me down – I’m happy to assess where you are and offer you direction to get you unstuck

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 30

Recommendations

Page 30: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

We are about to walk through the Example Case.

Questions up to this point?

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 31

Questions?

Page 31: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 32

Example Case1. Understand the Problem

2. Choose One Symptom

3. Draw the Diagram

4. Design Capture Plan

5. Capture Diagnostic Data

6. Analyze Captured Data

7. Identify Fix

8. Implement Fix

9. Verify Fix

Results

Page 32: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Server Disconnects Telnet Client

The End-User (Angie) keeps getting disconnected from the Server (Daffy). This has been going on for a while; Angie has a high-profile job and a high-profile boss; management has spun up a Root Cause Analysis team and assigned you and a Desktop Tech (Bob) to the team. Bob explains to you that he has been working the issue for several weeks, that a Router is causing the problem, and that he needs help finding and fixing the Router.

We start with 15 minutes together focused on Methodology Step #1: Understand the Symptoms

Walk Through an Example Case

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 33

Page 33: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Questions for the Desktop Tech

You: What do you know about Angie?Bob: She is a power user located in the Fairview Building, runs

Windows XP and the Attachmate Reflection terminal emulator.

You: What do you know about the Server?Bob: It is a Unix server called Daffy located in the Yale data center

and run by the Sys Admin Rick.

You: How long has the problem been occurring?Bob: Several weeks, happens multiple times per day, no pattern.

#1 Understand the Symptoms

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 34

Page 34: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Questions for End-User

You: When did this start?Angie: It has happened for years, but I didn’t bother to report it

because, until several weeks ago, I hardly used Daffy. Now, I spend all day in it, and the problem is really annoying.

You: What do you notice?Angie: Multiple times per day, I get disconnected and have to log

back in.

You: See any patterns?Angie: Not really. Sometimes I’m typing along and get disconnected.

Sometimes, I turn back to my machine or unhide Reflection and see that I’ve been disconnected.

#1 Understand the Symptoms

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 35

Page 35: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Questions for End-User

You: What do you do with this application?Angie: I enter data into the FALCON database. The forms from which

I acquire the data are irregular – requires a lot of interpretation. Sometimes, I spend time looking up related cases in other databases or calling relevant people on the phone for input. Sometimes, I just type like a mad woman. Sometimes, I run reports – it’s really annoying when a report takes half an hour to run and I get disconnected just before it finishes, because then I have to re-run the report.

#1 Understand the Symptoms

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 36

Page 36: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Questions for End-User

You: When you’re typing like a mad woman, how long before you get disconnected?

Angie: I figure I get 45 minutes. That’s my guess – I figure I get disconnected every 45 minutes. I might be wrong about that – I haven’t timed it or anything. But if I’ve been logged in for half an hour or so and need to run a report, I generally wait until I get disconnected, log back in, and then run the report immediately.

#1 Understand the Symptoms

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 37

Page 37: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Questions for the Sys Admin

You: What can you tell me about Angie’s problem?Rick: Got me. It can’t be my server: Daffy has about 40 users and

10 developers, and Angie is the only person reporting this problem. They all use the Reflection SSH client.

You: What can you tell me about Daffy?Rick: It is an HP Alpha server running OpenVMS located here in the

D5 data center. It runs the Ingres database manager. Angie uses the FALCON database: everyone uses FALCON; it’s the most popular database we offer.

#1 Understand the Symptoms

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 38

Page 38: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Questions for the Sys Admin

You: How often does Angie have this problem?Rick: Seems to me that Angie gets disconnected every hour or two;

I’ve checked the server configuration – I haven’t configured a timeout: everyone gets unlimited access as long as they want.

You: What do your logs say?Rick: Not much. Angie has called me plenty of times, right after

getting disconnected, but all the Alpha logs say is:“Username angie: Client disconnected”

#1 Understand the Symptoms

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 39

Page 39: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Split: If this were a real case, we would split into our small groups. You have 15 minutes.

Choose: Your first task in small group is to select one and only one symptom on which to focus. In this example, it’s pretty easy –there’s only one symptom. In future cases, this task will be harder – there will be many symptoms. Generally, I recommend picking either the easiest to analyze, the easiest to replicate, or the most costly to the business.

Phrase: Find a precise way to phrase the symptom. Example:Angie gets intermittently disconnected from Daffy.

#2 Pick One Symptom

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 40

Page 40: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

This will involve asking IT staff technical questions about the environment – this is where I start swapping hats (End-User, HelpDesk, Desktop, Sys Admin, Network, Database, Security, Vendor, Manager …), depending on the group to which you address the question

Ideally, the Ops staff already have this diagram and keep it updated as they make changes … but in my experience, only the most mature shops manage this

Sometimes, we identify the cause during the process of diagramming!

There’s a lot of experience & judgment here – what to include, what not to include

Focus on the components which surround the Symptom you have picked and how they relate to one another: dependencies.

If you solve a problem without drawing a diagram, you got lucky.

#3 Draw the Diagram

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 41

Page 41: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Diagram for Example Case

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 42

Page 42: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

This is done in small group; you have 15 minutes. In this step, you figure out how you’ll gather the data you identified in the previous step.

Typically, you will want to gather logs and/or metrics from applications and operating systems as well as insert sniffers

As much as possible, I will also support your performing ‘show’ commands, grepping through logs, trending parameters across time, rebooting devices …

#4 Design Capture Plan

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 43

Page 43: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Example Data Capture Plan

1. Plug sniffers into lf-esx and d5sr-esx and SPAN Angie’s port and Daffy’s port, filtering on Angie’s IP address

2. Enable debug tracing on Angie’s copy of Reflection, gather both syslog and Ingres logs on Daffy

3. Validate capture set-up by asking Angie to ssh into Daffy, then verifying that we can see Angie’s login in all logs and packet traces

4. Sit with Angie and watch her work for a day, precisely recording the times when she gets disconnected

5. While we’re waiting: Gather ‘show port’ output from Angie’s and Daffy’sswitch ports plus version and configuration information (idle timer setting) from Reflection

#4 Design Capture Plan

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 44

Page 44: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

This is done as a class. The instructor executes each group’s Diagnostic Capture Plan and returns the resulting information.

Each group benefits from hearing the results of every group’s Diagnostic Capture Plan.

Typically 15 minutes.

In this example, the instructor returns:Reflection debug tracePacket CapturesLogsAngie & Daffy’ Ethernet port statisticsReflection Version & Settings (idle timer)

#5 Capture Diagnostic Data

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 45

Page 45: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Angie’s Ethernet Port Statslf-esx#sh ver

[…]

lf-esx uptime is 3 years, 3 weeks, 5 days, 12 hours, 44 minutes

[…]

lf-esx#sh int Fa2/19

FastEthernet2/19 is up, line protocol is up (connected)

Hardware is Fast Ethernet Port, address is 0011.21f5.46c2 (bia 0011.21f5.46c2)

MTU 1500 bytes, BW 100000 Kbit, DLY 100 usec,

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation ARPA, loopback not set

Keepalive set (10 sec)

Full-duplex, 100Mb/s, link type is auto, media type is 10/100BaseTX

input flow-control is unsupported output flow-control is unsupported

ARP type: ARPA, ARP Timeout 04:00:00

Last input 00:00:19, output never, output hang never

Last clearing of "show interface" counters never

Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0

Queueing strategy: fifo

Output queue: 0/40 (size/max)

5 minute input rate 0 bits/sec, 0 packets/sec

5 minute output rate 4000 bits/sec, 6 packets/sec

161282073 packets input, 48475519613 bytes, 0 no buffer

Received 2004674 broadcasts (1689326 multicasts)

0 runts, 0 giants, 0 throttles

0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

0 input packets with dribble condition detected

831253443 packets output, 116132425387 bytes, 0 underruns

0 output errors, 0 collisions, 0 interface resets

0 babbles, 0 late collision, 0 deferred

1 lost carrier, 0 no carrier

0 output buffer failures, 0 output buffers swapped out

lf-esx#

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 46

If t

he

err

or

cou

nte

rs w

ere

hig

h, p

erh

aps

we

hav

e a

bad

NIC

| c

able

| s

wit

ch p

ort

… b

ut

they

are

ze

ro o

r cl

ose

en

ou

gh.

Ru

le o

ut

ba

d p

hysi

cal l

aye

r

Page 46: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Packet Trace

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 47

daffy = ingress

An

gie

abru

ptl

y h

angs

up

(TC

P R

ST)

on

Daf

fy (

aka

Ingr

ess)

. Lo

oks

like

An

gie

init

iate

d t

he

dis

con

nec

t

Page 47: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Reflection Settings

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 48

Rea

din

g th

e m

anu

al t

ells

us

that

Co

nn

ect

ion

Se

ttin

g Ti

meo

ut

is a

n Id

le T

imer

. A

nd

th

at a

ch

oic

e o

f ‘0

’ fo

r th

is t

imer

me

ans

‘un

limit

ed’,

i.e.

nev

er d

isco

nn

ect

, no

mat

ter

ho

w lo

ng

the

use

r re

mai

ns

idle

.D

an

g, w

e re

ally

wa

nte

d t

o s

ee a

set

tin

g o

f, o

h,

60

min

ute

s h

ere

Page 48: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Application Version

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 49

Res

earc

h t

ells

us

tha

t th

e la

test

pa

tch

leve

l fo

r R

efle

ctio

n 1

4 is

v1

4.0

.7.

An

d t

he

late

st v

ersi

on

fo

r th

is t

rain

of

Ref

lect

ion

of

14

.1.1

88

SP

1.

An

gie

is r

un

nin

g a

n o

ld v

ersi

on

Page 49: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Back to small group; you have 30 minutes to analyze the data you have acquired and

In real life, you will likely cycle through Steps #4 - #6 multiple times.

Feel free to continue to #7 Identify Fix when you are ready.

Your team consults together … hmm …• The Ethernet port shows trivial errors, so that looks fine.• The packet trace shows Angie initiating the disconnect• Reflection settings show an unlimited idle timer• We’re running an old version of Reflection … probably full of bugs

#6 Analyze Captured Data

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 50

Page 50: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

At some point, you believe you’ve identified the cause; now you can develop a fix.

#7 Identify Fix

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 51

Your team says:We know that Attachmate has shipped numerous updates to Reflection – the latest version is 14.0.7. We propose to upgrade Angie’s copy to the latest version.

Page 51: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

We reconvene as a class. Each group proposes its fix, and the instructor reports the results of the fixes.

In this example, Bob doesn’t want to upgrade – he wants to keep all his users at the same revision.

Instead, he uninstalls and re-installs Reflection.

#8 Implement Fix

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 52

Page 52: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

We remain regrouped as a class and review the results of the Fixes. In this case, Angie runs for a week without any disconnects.

Bob doesn’t want to invest more time into this, so we quit.

Ideally, we would re-image Angie’s machine and verify that the problem returned … as scientists, we realize that we have demonstrated correlation, but not cause and effect.

#9 Verify Fix

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 53

Page 53: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

We declare the Problem resolved, with an undefined Root Cause –something related to Angie’s local Application configuration which gets reset when the Application was re-installed, no explanation for why this only affected Angie and not any of the other 55 users.

In a perfect world, we would re-image Angie’s machine and verify that the problem returned … in the real world, we did not implement that last step of RPR, which requires that we Verify the Fix …

As a Problem Manager, you are responsible for ensuring that management hears the risk they have adopted by skipping this step.

Results

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 54

Page 54: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

End of Example Case

• For the rest of our day we cycle between small group and large group

• In large group, you ask questions; in small group, you analyze

• I am available for questions and coaching during both

Questions about the mechanics of what we will be doing?

Questions about the 9 step RCA process?

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 55

Page 55: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

#1 SplitIn a moment, you will split into groups of 3-6 people

# 2 Assign RolesI recommend assigning roles & responsibilities, e.g.

Facilitator Tracks who is doing what, spokespersonProblem Analyst Big pictureSubject Matter Experts Sys admin, network, storage …

Successful teams divide & conquer the material …Ideally, one person per role …

# 3 Pick NamePick a cool name for your group; write it on one of the name plates

You have 5 minutes – go

Split into Small Groups

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 56

Page 56: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Remote Office Bumps (morning)Many Applications Crash (afternoon)

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 57

Case Studies

Page 57: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Established in the mid-1990s, the users at a clinic on First Hill have intermittently reported various issues – Outlook appointments vanishing, printing slowness (takes minutes to hours for print jobs to appear), browser-based applications malfunctioning, faxing problems, scratchy quality on voice calls , “Not responding” in the application menu bar … “the computer is slow”. Over the years, we’ve gradually upgraded their WAN connection from BRI (128Kb/s) to PRI (1.544 Mb/s) to bonded PRI (3.588Mb/s) to Metro Ethernet (10Mb/s). And we’ve gradually upgraded their workstations through versions of Windows, Office, and browsers, replacing their PCs along the way. The upgrades have helped but have not eliminated the problems.

The research project behind the clinic has landed a new grant, which will allow them to expand their status from a Clinical Research Site to a Clinical Trials Unit – this will translate into more staff, more equipment, more participants volunteering for their studies.

Management is concerned that the expansion will exacerbate the already unreliable quality of the IT services available at this location and figures that upgrading the WAN circuit to 100 Mb, while expensive, will fix the problems. But before they sign a three year contract, they want a sanity-check: Will upgrading this circuit resolve the issues?

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 58

Remote Office Bumps

Page 58: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Welcome to the first meeting of the Cabrini Tower PSG; today is Friday January 11th 2013.

We start the RPR Methodology working together as a class.

Step #1: Understand the SymptomsWhat questions do you want to ask of the various constituencies?

Database, Desktop, Helpdesk, Manager, Network, Server, Storage, User, Vendor …

After Understand the Symptoms, we will separate into small groups and proceed with:

Step #2: Choose One Symptom

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick

Set the Stage

59

Page 59: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Large group | Small group

1. Understand the Symptoms2. Choose One Symptom3. Draw the Diagram4. Design Capture Plan5. Capture Diagnostic Data6. Analyze Captured Data7. Identify Fix8. Implement Fix9. Verify Fix

RCA MethodologyDerived from Advance7’s Rapid Problem Resolution® methodology

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 60

Phase 1

Phase 2

Phase 3

Page 60: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

What aspect of the process would you like to review?

What section of a diagram or graph would you like to explore?

What hunk of data would you like to re-examine?

Which link in the chain of reasoning doesn’t make sense to you?

Additional questions?

This is your opportunity to consolidate your learning.

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 70

Q&A

Page 61: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 72

PGP

http://xkcd.com/1181/

Page 62: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

This is the last week in November 2005. Earlier this year, we bought a mass storage device –a BlueArc Titan NAS head named Indigo sitting in front of 14 TB of Fibre Channel, SATA, and ATA attached disk trays. We have been migrating home + shared directories for two divisions (~1200 staff) from a flock of aging DAS-equipped file servers onto Indigo, along with scratch space for the MIS group.

The experience has been rocky. Starting in June, an OS memory leak caused key processes to hang and sometimes even head freezes, both requiring reboots to fix. A controller fried, requiring emergency downtime for replacement. A controller firmware bug mangled a volume, leading to data loss. We have been applying hot fixes, firmware upgrades, and OS upgrades every few weeks. Starting in August, users began reporting crashing applications –notably Outlook, although Word and Excel and other applications hang as well, intermittently – some days are fine, some days are bad. The MIS group’s Tidal jobs fail regularly.

Backups are slow and sometimes don’t complete – we aren’t meeting our 24 hour Recovery Point Objective, and we have no confidence that we can meet our 48 hour Recovery Time Objective. Sometimes even simple file copies are slow!

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 76

Many Applications Crash I

Page 63: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

The storage team was convinced that antivirus scanning was causing the application crashes and has worked with BlueArc for months to resolve this, finally disabling AV over Thanksgiving. However, the intermittent application crashes continued this week.

The local BlueArc team visited a few days ago and identified the Catalyst 4000 Ethernet switches as the likely culprits: “The Catalyst 4003 servicing the backup systems dates to 1998; the Catalyst 4006 servicing the Titan itself dates to 2000 – they are getting overwhelmed by traffic.”

The remaining ~1500 users who have not migrated to Indigo are watching with dismay –currently, they are unaffected, scattered as they are between small NetApp NAS heads and a flock of aging file servers.

Management has made every Sunday night in December available to you for Indigodowntime – just ask.

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 77

Many Applications Crash II

Page 64: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Welcome to the first meeting of the BlueHeat PSG; today is Friday December 2nd.

We start the RPR Methodology working together as a class.

Step #1: Understand the SymptomsWhat questions do you want to ask of the various constituencies?

Database, Desktop, Helpdesk, Manager, Network, Server, Storage, User, Vendor …

After Understand the Symptoms, we will separate into small groups and proceed with:

Step #2: Choose One Symptom

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick

Set the Stage

78

Page 65: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Large group | Small group

1. Understand the Symptoms2. Choose One Symptom3. Draw the Diagram4. Design Capture Plan5. Capture Diagnostic Data6. Analyze Captured Data7. Identify Fix8. Implement Fix9. Verify Fix

RCA MethodologyDerived from the Rapid Problem Resolution® methodology

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 79

Phase 1

Phase 2

Phase 3

Page 66: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

What aspect of the process would you like to review?

What section of a diagram or graph would you like to explore?

What hunk of data would you like to re-examine?

Which link in the chain of reasoning doesn’t make sense to you?

Additional questions?

This is your opportunity to consolidate your learning.

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 84

Q&A

Page 67: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 85

The Mother of All Suspicious Files

http://xkcd.com/1247/

Page 68: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 872013-11-05

Tech Flash: Sanity-Checking Throughput Claims Sometimes, I’ll hear a tech claim that we need a fatter WAN pipe, because the file copy | backup job | database synchronization | whatever is slow:

“I’m only getting 400 MB/hour to Chicago: we need to rent a fatter network pipe.”

Well, we geeks often confuse ourselves when translating between bits per second and bytes per second … this whole performance zone is a popular place for error … and fatter WAN pipes are expensive. Let’s sanity check this claim.

Name Bit Rate Effective Data Rate*Vanilla Ethernet 10Mb/s 1MB/sFast Ethernet 100Mb/s 10MB/sGigabit Ethernet 1000Mb/s 100MB/sTen Gig Ethernet 10000Mb/s 1000MB/s (aka 1GB/s)

Assume we have a 100Mb/s pipe to Chicago: would buying a fatter one help?

*These numbers constitute an easily remembered rule-of-thumb: well-tuned clients/servers can actually deliver 10-15% better than this

Page 69: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 88

Tips & Tools

Wiggly Charts are Overrated

Validate the Diagram

Rich Pingers

Draw the Pie

When to Use Full-Blown RPR

Musings on IT Architecture

Page 70: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 89

Wiggly Charts are OverratedMy Solexa run failed last night shortly before midnight. You can see that Fred’s switch port was extremely busy then, far busier than usual, and you can also see the IO spike which happened at the same time.Therefore, Fred needs a 10GigENIC and faster disks.

Your task:• Think of ways to support this• Think of ways to refute it

Page 71: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 90

Validate the DiagramSend a sample transaction from one end of the infrastructure to another, capturing along the way. If your diagram is accurate, you’ll see the transaction at every single capture point

If you don’t see that transaction … then you know your diagram is inaccurate: return to Draw the Diagram

Once you’ve validated the diagram, you are positioned to capture the pathology you’re investigating

Sample transaction: Write a test file, update a database record, send a Rich Ping … in each case, include easy-to-spot ASCII

Page 72: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 91

TextPinghttp://www.packetiq.com/Tools/PacketIQ-TextPing.aspx

Page 73: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 92

Command LineSend a TCP port 2049 frame to server.company.comhost> echo “Starting NFS Mount now –marker” | nc -4 –w 1 server.company.com 2049

C:\Temp> echo “Starting NFS Mount now –marker” | ncat -4 –w 1 server.company.com 2049

For Windows, install the open source ncat utility http://www.insecure.org, part of the Nmap distribution

Send a UDP port 666 frame to server.company.comhost> echo “Starting app now –marker” | nc -4 –w 1 -u server.company.com 666

C:\Temp> echo “Starting app now –marker” | ncat -4 –w 1 -u server.company.com 666

Create a file, the name of the file will appear in Wireshark’s Summary screenhost> touch /mnt/whatever/slowness-starting-now--marker.txt

C:\Temp> copy /y nul z:slowness-starting-now--marker.txt

Drop the message into /var/log/syslog on loghosthost> logger –l loghost.company.com slowness starting now –marker

C:\Temp> logger –l loghost.company.com slowness starting now --marker

For Windows, install the freeware logger utility http://www.monitorware.com/logger

Drop the message into the Web server’s logs:host> wget http://www.company.com/slowness-starting-now--marker.html

C:\Temp> wget http://www.company.com/slowness-starting-now--marker.html

For Windows, install the open source GNU wget utility

Page 74: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 93

PingIn a pinch, you can use ping, manually maintaining a written table associating ping packet length to message:

host> ping –n 1 –l 101 server.company.com

host> ping –n 1 –l 102 server.company.com

host> ping –n 1 –l 102 server.company.com

Ping Packet Length Event101 bytes Mounting file system102 bytes Starting application103 bytes Slowness beginning now

Or, depending on your filters, ping a fake host … the ping won’t show up in the trace, but the failed DNS query will:host > ping www.slowness-starting-now--marker.com

C:\Temp> ping www.slowness-starting-now--marker.com

Page 75: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 94

Send-UDP-Msghttp://www.skendric.com/app

vishnu> ./send-udp-msg -m "This is a test ping" rhino1 rhino2 rhino3

vishnu>

Or, if you want to show off, write your own … here’s mine

Page 76: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 95

Many problems are intermittent – you set your debugs and packet captures going and then wait hours/days/weeks for the issue to reoccur. How might one capture across such long time frames?

Ring BufferMost capture utilities will produce a ring buffer of files. In this example, dumpcap writes those bytes to a file named in the following way:

server-side_00001_20130325120842.pcapwhere the first field is a serial number and the second field encodes the date/time of start of this capture. After it has captured 50,000 bytes, it gets started on the next file:

server-side_00002_20130325120958.pcapwhere the question marks are replaced with the start time of this capture. Dumpcap will repeat 10,000 times, whereupon it will start deleting the first files, in order to limit the number of files to 10,000.

Windowsdumpcap –i 1 –b files:10000 –b filesize:50000 –f “ip host 10.1.2.3 and not ip host 192.168.20.30” –w

c:\temp\cabrini\server-side.pcap

Linux/usr/sbin/dumpcap –i eth1 –b files:10000 –b filesize:50000 –f “ip host 10.1.2.3 and not ip host 192.168.20.30” –w

/home/skendric/cabrini/server-side.pcap &

Long-Term Captures - CLI

Page 77: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 96

Or perhaps you prefer the GUI

Long-Term Captures - GUI

Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick

Page 78: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis | Sharkfest 2013 | Stuart Kendrick 97

Extract PacketsOK, so now you have 10,000 files, and you want to look at an incident which occurred between noon and 1:00pm on 2013-03-25

The names of the files allow you to focus on the window in question, so you copy those to a working directory. But that can still be a lot of files, in a busy environment. Perhaps you realize that you only care about DNS frames. I write littlescripts to extract the interesting packets and merge them into a single file.

Windowsecho off

setlocal ENABLEDELAYEDEXPANSION

mkdir c:\temp\cabrini\extract

cd \temp\cabrini

FOR /F %%a IN ('dir /b *.pcap') DO (

echo Processing %%a

tshark -r %%a -R “udp.port==53 or tcp.port==53" -w extract\%%a-filtered.pcap

)

cd \temp\cabrini\extract

FOR /F %%a IN ('dir /b *.pcap') DO (

set tmp=%filelist%

set "filelist=!filelist! %%a"

)

mergecap -w c:\temp\cabrini\server-side-extract.pcap %filelist%

Linuxhttp://www.skendric.com/problem/rca/extract-frames

Page 79: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

User, application, memory, CPU, disk …Client

5s

Network120 seconds

Server15s

Contribution to the Problem

Switches, Routers, Firewalls, VPN Tunnels …

Client / Network / Server Pie

Application, OS, memory, CPU, storage …

I find drawing the CNS Pie useful when analyzing performance issues:How much does the Client contribute to the Problem? The Network? The Server?http://www.skendric.com/app

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 98

Page 80: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Rule-of-thumb

Application architecture: 1,000x• SQL, query optimization, caching, system calls

Server & Storage Configuration: 100x• Disk striping, spindle tiering, paging, NFS tuning

Application fine-tuning: 2-10x• Threads, asynchronous I/O

Kernel tuning: less than 2x- Caveats:

• If kernel bottleneck is present, then 10-100x• Kernel can be a binary performance gate

Version 3.10Copyright 1994-2007 Hal Stern, Marc Staveley System & Network Performance Tuning LISA 2007

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 99

Tuning Potential

Page 81: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Advance7 is a consulting outfit which helps customers resolve critical Problems – they put an analyst at your site to coordinate your staff plus vendors to fix the issue, using the RPR methodology.

They designed RPR to work against Grey Problems.

Most Problems are not Grey … unless the Problem is Grey, RPR is overkill.

So what are Grey Problems?

The following sides are cribbed from Advance7 materials -- full credit to Paul Offord & colleagues.

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 100

Rapid Problem Resolution ®

Page 82: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 101

The Grey Problem

• Intermittent Application Error• Poor Application Logic• Transient Overload• Intermittent Infrastructure Error• Incorrect Failover Operation

Single Incident Recurring Problem

Tech

no

logy

Kn

ow

nTe

chn

olo

gy U

nkn

ow

n

• Intermittent Hardware Failure• Known Error• Intermittent Software Failure

• Change-related Cause• Hardware Failure• Software Failure• Misconfiguration• Operations Error

• User Error• Operations Error• Rare Software or Hardware Error

The majority of issues that are passed to 2nd and 3rd line technical support teams are investigated in a straightforward manner. The nature of the issue or an indication from a monitoring system identifies the failing component and the issue is allocated tothe correct technical support team. Q1: the bulk of support work falls into this area. Q2 is harder but tends to be resolved by experienced support staff. Q3 is tough; we tend not to solve these.

An intermittent response-time or error issue is not so easily handled due to its transient nature. Not only does the cause sneak under the radar of monitoring systems, but investigation often starts after the issue has passed, making it impossible to usemany of the tools available. The result is a recurring problem where the causing technology is unknown: Q4, aka the Grey Problem. The Rapid Problem Resolution methodology targets Q4.

Q1 Q2

Q3 Q4

Page 83: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 102

Grey Problem CharacteristicsBecause the causing technology is unknown, a grey problem will bounce between Technical Support Teams as each in turn produces evidence (often in the form of a health check) to prove that their technology is not to blame.

Typical characteristics of a grey problem• An ever-growing number of people become involved• Long meetings to discuss what might be the cause• Support people shy away from becoming involved• Repeated changes with no clear reason or objective

Consequences of grey problems• An ever growing backlog of problems• A fog that hinders the investigation of other, more urgent problems• A growing pool of problems that escalate into Major Incidents as patterns of use and business priorities change• Wasted IT budget as money is spent on poorly targeted upgrades• Barriers to integration due to concerns about the stability of component systems• Loss of confidence and satisfaction with the IT department• Pressure to outsource IT services• Reduced customer satisfaction• Higher costs as the business adjusts to accommodate the problem• Higher IT staffing costs

Service Desk

Inci

den

t M

anag

emen

t

Pro

ble

m M

anag

emen

t

Server Desktop NetworkVendor ApplicationVendor Vendor Vendor

?

Page 84: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 103

The RPR Methdology

Page 85: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

1.2 Choose One SymptomThe single largest reason I’ve thrashed in my RCA career.

1.4 Draw the Diagram & Sit with the UserIf I can’t draw it, I don’t understand it & Seeing leads to Understanding

2.2 Definitive Diagnostic DataInsert capture gear at critical points along the path, synchronize time using a distinctive transaction, capture data simultaneously from all points while replicating the pathology.Hardest to implement; most likely to make you successful.

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 104

Key Elements of RPR

Page 86: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Full methodologyhttp://www.skendric.com/problem/rca/RPR-RCA-Methodology.pdf

Checklisthttp://www.advance7.com/misc/rpm_wb.html

Manualhttp://www.advance7.com/information/publications

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 105

RPR References

Page 87: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

• Business stake holders want features• We IT geeks love to turn every fancy knob

The result is complexityComplexity is the enemy of uptime … and the raison d’être for RCA

Insights from our gurusIncreasingly, people seem to misinterpret complexity as sophistication, which is baffling - the incomprehensible should cause suspicion rather than admiration. --Niklaus Wirth

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. --Brian Kernighan

KISS

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 106

Musings on Architecturehttp://www.skendric.com/philosophy/uptime

Page 88: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Folks sometimes ask me how I learn this stuffKey contributors to my path: Mentorship, Failure, Independent Study, Training

MentorshipI have had the good fortune to work for highly skilled people who have mentored me.• During 1991-1993, I joined my boss on Saturdays … worked as his gofer boy … he used the opportunity to train me• In the mid-2000s, we hired Mike Pennacchi to coach us. Mike came on-site once/month for a half-day; we brought

whatever problem was troubling us to the session; Mike would not solve it for us … rather, he would coach us through solving it. We did this for ~three years before budget contraction interfered

FailureI have had the good fortune to work for bosses who believe that we learn through mistakes … “Fail early and often” … I’ve learned a lot this way

Independent StudyI set aside a slot every week (mostly!) to practice what I’ve learned, push myself to learn something new … these days, Sunday mornings

TrainingI have had the good fortune to work for bosses with training budgets … I typically spend a couple weeks per year in classes … I occasionally augment this using my own shekels

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 107

Musings on Skill

Page 89: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Humans (all living creatures) are wired for fast-twitch: our nervous systems respond rapidly to the unusual, not to the mundane. So a bomb explodes, kills three people: that fires our adrenalin … but the annual toll of smoking (~500,000 per year in the US alone) drifts past our consciousness without a quiver.

Of course we’re wired this way … that’s how we stayed alive on the African savannah: by paying attention to the howl of the hyena, rather than to the gradual constriction of our arteries

But the result is that we have trouble paying attention to slow-twitch threat … to saving for retirement or a rainy day, daily exercise, spending time with our family … investing in the power grid, roads, bridges, anything which seems like a long way off …

Ditto with IT – we focus on the glitzy new projects, ignore the underpinnings … until the infrastructure breaks catastrophically … that drama fires our nervous systems, then we pay attention (for a while)

I don’t have a solution for this design flaw (trade-off) in our brains … but it does keep me employed, as a Problem Manager and a Problem Analyst If we maintained our infrastructure (shrank technical debt), many of our RCAs would not occur

Musings on Nervous Systems

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 108

Page 90: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

It has been said that man is a rational animalAll my life I have been searching for evidence which could support this

--Bertrand Russell

Your brain will be predisposed to certain answers and will cling to them, blinding you to reality

Definitive Data Capture is RPR’s effort to counteract this tendency

I wish you success in scrabbling for rationality

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 109

This is Hard

Page 91: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

Insight

2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 110

http://xkcd.com/1215/

Page 92: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and
Page 93: Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys admin, network engineer, database admin, or developer Perhaps you support desktops and

On-Line ResourcesRapid Problem Resolution by Paul OffordLinkedIn Protocol Analysis & Troubleshooting GroupOld Comm Guy http://www.lovemytool.com

Trouble-shooting & Training Outfits Based Here (will travel for $$)James Baxter http://www.packetiq.com Daytona Beach, FLTony Fortunato http://www.thetechfirm.com Toronto, CanadaChris Greer http://www.packetpioneer.com Central AmericaPaul Offord http://www.advance7.com London (international)Mike Pennacchi http://www.nps-llc.com Seattle, WARay Tompkins http://www.gearbit.com Austin, TX…

ConferencesSharkfest http://www.sharkfest.org Berkeley, CA

Follow-up stuart.kendrick.sea {at} gee mail dot com2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 114

Thank you