Root Cause Analysis - skendric.com · 2013-11-05 Root Cause Analysis Intermediate ... Corvus...

2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 1

Root Cause Analysis – IntermediateA Hands On Tutorial

Your Pre-Flight Check List

1. Write your first name on the card stock, display prominently

2. Locate the course files on your USB stick

3. Grab the latest version of the slide deck, dated 2013-11-05

http://www.skendric.com/seminar/rca/Root-Cause-Analysis-Advanced-Deck.pdf

4. Configure Wireshark columns (see p.5 of this presentation)

5. Introduce yourself to your neighbors (teammates): figure out who will play which roles

6. Read printed materials at your table, examine the diagrams on the walls

Copyright Stuart Kendrick ©2013 All Rights Reserved

http://www.skendric.com/seminar/rca/Root-Cause-Analysis-Advanced-Deck.pdf

IntroductionExample CaseSplit into Small GroupsCase Studies

HPC Cluster WoesStorage Stumbles

Tips & ToolsWrap-up


Workshop Outline


IntroductionMechanics

Me and My Biases

What is Root Cause Analysis?

How Does This Class Work?

Recommendations


Mechanics

We use Google Docs … you don’t need an account: I will provide links

9:00 – 10:30 Class Ask questions whenever you want

10:30 – 11:00 Break

11:00 – 12:30 Class

12:30 – 13:30 Lunch Your Laptop

13:30 – 15:00 Class • has Internet connectivity

15:00 – 15:30 Break • can display & search PDF, PNG, TXT, XLS

15:30 – 16:30 Class • has grep or similar

16:30 – 17:00 Wrap-up • Wireshark configured per next slide


Configure Wireshark Columns

• Use a recent version of Wireshark … 1.10.0 at a minimum – I recommend the latest and greatest • If you are an experienced Wireshark user, feel free to ignore this and use your favorite column choices• If you are really experienced and prefer a different analyzer, feel free to use it

You

rea

lly w

ant

Del

ta t

ime

dis

pla

yed

An

d C

ust

om

(tc

p.s

trea

m)

will

be

hel

pfu

l

Multi-disciplinary IT trouble-shooter / Root Cause Analysishttp://www.skendric.com

sbk@cornella student 1981stuart@cpvax5 (Science Applications Inc) programmer [email protected] desktop / server [email protected] server / network [email protected] multidisciplinary 1993stuart.kendrick {at} isi lon dot com sustaining engineer 2014

IT Architect | ITIL Problem Manager | Problem Analyst | Device Monitoring | Transport

Geeky HighlightsPL/1 on IBM mainframes Cornell University Ithaca 1981FORTRAN on CRAY-1 SAIC San Diego 1984Terak, DisplayWriter, IBM PC, Macintosh Cornell University Ithaca 1985Netware, Corvus Omninet, TCP-IP / IPX / AppleTalk Cornell University Ithaca 1988AppleShare, QuickMail, Farallon, NRC, Cisco, Sniffers Cornell Medical College Manhattan 1991Solaris, Windows, Linux, Perl, SNMP, Wireshark, Cisco ,Fluke FHCRC Seattle 1993OneFS EMC Isilon Seattle 2013

Me

2014-04-12 Myth-Busting | xxx 2014 | Stuart Kendrick / Chris Shaiman 6

http://www.skendric.com/

You are a senior engineer with a decade or more experience in the industryPerhaps you function as a sys admin, network engineer, database admin, or developerPerhaps you work for a large outfit and function as an ITIL Problem AnalystPerhaps you work for a small outfit and are a jack-of-all-trades

In any case, you are T-shaped: you have a strong fundamental knowledge in one or two areas and have expertise (possibly rusting!) across a range of technologies

Problem solving skills You enjoy difficultyInquiring mind Passion for understanding how things workDetermination & stamina Pursuing a tough problem can be wearingT-shaped Broad background in IT with specialization in one or two areas

You are here because you want to practice skills in small group, rather than listen to a lecture

Or … perhaps you are a people or process person – resource manager, project manager, ITILProblem Manager. You don’t have the skills to analyze bits & bytes, but you want to practice a problem solving methodology. You’ll help keep your team on track, coordinating subject matter experts, bringing the results together for reports to the larger class.

Or … perhaps you are a junior engineer, jumping into the lake with bigger kids, knowing you’ll be out of your depth, hoping to learn from the experience nevertheless.I’m OK with this … but realize that you’ll be inhaling water … wave me down as needed


You

• I do not claim to be good at trouble-shooting• I do not claim to know how to teach trouble-shooting• I am not the smartest or fastest guy on the block

However …

• I have ~30 years experience in this business• I have trained under gurus• I have accumulated a grab bag of tips which you may find useful• I have converted real-world events into these case studies • The result is a set of puzzle-solving labs which I predict you’ll enjoy

After all, it is more fun to trouble-shoot someone else’s issues …


Caveats

I have made a ceaseless effort not to ridicule, not to bewail, not to scorn human actions, but to understand them.

--Baruch Spinoza

Anything worth doing is worth doing badly.--Marshall Rosenberg

The first principle is that you must not fool yourself -- and you are the easiest person to fool.

--Richard Feynman

Doubt is uncomfortable; certainty is absurd.--Voltaire

The goal of education is to make up for the shortcomings in our instinctive ways of thinking about the physical and social world.

--Steven Pinker


My World View


Confidence & KnowledgeC

on

fid

ence

Knowledge

Certainty

Doubt

Little Lots

Newbie Jedi

Ignorance more frequently begets confidence than does knowledge. --Charles Darwin

As I age, I increasingly value the following from myself and my colleagues:

• I don’t know• I made a mistake• Here’s how I will clean up the mess I made

I predict that you will follow many blind avenues during RCAs … I wish you success in keeping shoshin, aka, beginner’s mind, as you wander along your path …

2013-11-05 Root Cause Analysis Intermediate| LISA 2013 | Stuart Kendrick 11

Music to My Ears

Science is not truth; it is, instead, a method for diminishing ignorance.--J.M. Adovasio, Olga Soffer, Jake Page

A scientific theory accurately describes a large class of observations, makes definite predictions about future observations that could be falsifiable, i.e. disproven by observation.

--Derived from Stephen Hawking

Credible explanations grow from the combined testimony of three more or less independent, mutually reinforcing sources -- explanatory theory, empirical evidence, and rejection of competing alternative explanations.

--Edward Tufte

I recommend Tufte’s day-long seminar, as an introduction to critical thinking --sk2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 12

My Biases

http://www.edwardtufte.com/tufte/courses

http://www.edwardtufte.com/tufte/courses


Quantum Mechanics

http://xkcd.com/1240/

Any structured approach for identifying the contributors to an IT service disruption

There is no such thing as a Root Cause … nevertheless, Root Cause Analysis remains a useful tool

RCA is not complete until we’ve applied the fix and verified that the problem is resolved

Business reality: competing priorities distract us from completing RCAs

Most folks use the term RCA to refer to a post-mortem process … I use the term in its ITIL sense, tightly bound to Problem Management

How Complex Systems Fail – Richard CookA Few Thoughts on Uptime – me


What is Root Cause Analysis?

http://www.ctlab.org/documents/How Complex Systems Fail.pdf

http://www.skendric.com/philosophy/uptime/A-Few-Thoughts-on-Uptime.pdf

Why do I claim there is no such thing as a Root Cause? Consider the server which goes down; your monitoring system pages you; you investigate. Turns out the power supply died – you replace the power supply, the server reboots, everyone is happy again. Then, you notice that the second power supply is dead, too. Turns out your monitoring system wasn’t checking power supplies when the first one fried a few months ago. Why wasn’t your monitoring system checking power supplies? Because it can’t – and upgrading to the newer version which can costs time & money – your management looked at the costs, weighed the risks, and decided to spend your time and those dollars on upgrading the aging e-mail server, which was close to collapse. Why doesn’t your department have enough staff and money to upgrade both the e-mail server and the monitoring server? Because management has to juggle the costs of IT against the costs of core business requirements – both of which look critical from different vantage points.

So what’s the Root Cause? A failed power supply? An inadequate monitoring system? Insufficient process in your leadership’s prioritization tactics, that they let the aging e-mail system stumble along for far too long? Insufficient resources to meet both core business requirements and IT requirements? Not enough market for your product, which is why you don’t have sufficient resources to meet both sets of needs?

Still not convinced? Why have you lost two power supplies across as many months? Because your local utility is straining to meet demand in your area and frequently inflicts brownouts, which age power supplies prematurely. Why hasn’t the utility beefed up capacity in your area? Because that would cost money, and politicians are reluctant to approve the rate increases necessary to support an expansion, given current voter sentiment. Why are voters annoyed at politicians? … Reality is complex: There is no such thing as Root Cause …


Why No Root Cause?

Oh boy, that’s a big question. But let’s take a stab at answering it. A tech might start asking themselves, or the person reporting the problem, questions similar to the following:

• What makes you think there is an issue?• What are you expecting that you’re not getting?• Has it ever performed well?• What changed recently? Software or hardware? Load?• Can it be expressed in terms of latency or run time?• Does the problem affect other people or applications?• What is the environment? What software and hardware is used? Versions? Configuration?• …

Most issues get fixed somewhere during the process of asking these questions and uncovering the answers …


How Do Techs Fix Issues?

As the issue resists resolution, less skilled techs will start employing less effective approaches.

Street Lamp MethodThe student comes across his professor on the Arts Quad at night, down on his hands & knees, staring at the sidewalk. “What are you doing, sir?” “Looking for my car keys”. The student joins the professor but after looking unsuccessfully in widening circles, asks him “Do you recall precisely where you were when you dropped the keys?” “Yes, over there, in the middle of the quad” points the professor, toward the dimly perceived middle of the grassy acre. “Well, why are you looking here?” asks the student. “Because the light is better here” responds the professor.

More formally:1. List available tools2. Examine the output of each one, looking for clues3. Purchase more tools4. Goto #1

Use The Force, Luke“I know that we are experiencing a broadcast storm … you should check your {switch | router | firewall | server | client | application | whatever-belongs-to-some-other-group}”

I enjoyed Star Wars … but it was fiction … that distinction is hard for human brains to make. --sk

2013-11-05 18Root Cause Analysis Intermediate| LISA 2013 | Stuart Kendrick

Anti-Patterns

The issue typically gets escalated to a more experienced tech. I have yet to be satisfied with an account of what an experienced human does when engaging on their field of expertise. That said, here is one way to express what might be happening.

For every Resource, check Utilization, Saturation and Errors.

Intended to be used early in a performance investigation, to identify systemic bottlenecks.

Terminology definitions:• Resource all physical server functional components (CPUs, disks, busses, …)• Utilization the average time that the resource was busy servicing work• Saturation the degree to which the resource has extra work which it can’t service, often queued• Error the count of error events

Stuart’s version:1. Scan the logs, looking for error messages Errors2. Are requests waiting in queues? Saturation3. How busy are the boxes? Utilization

I am cribbing from Brendan Gregg: http://dtrace.org/blogs/brendan/2012/02/29/the-use-method


The USE Method

http://www.brendangregg.com/

http://dtrace.org/blogs/brendan/2012/02/29/the-use-method


But Not TodayMost problems get solved using any number of techniques, a few of which I sketched in the previous slides

But that’s not what I will be pushing you to do today

I will be pushing you to employ a methodology called Rapid Problem Resolution (RPR) ®

RPR is an evidence-based process … it is a heavy process … it is a sledgehammer. Sledgehammers are generally overkill …

But for a certain class of problems – the ones which have defeated experienced techs for weeks, months, or years – sledgehammers offer plenty of value

The case studies in this class belong to that class of problems

I will push you to employ RPR. You may resist. That’s OK

The official goal of this class is to introduce you to RPR

This workshop borrows heavily from the Rapid Problem Resolution® methodology codified by Paul Offord of Advance7, which fits into ITIL’sProblem Management schema.

I’ve slashed Advance7’s 19 step approach into 9 steps. This makes the methodology less effective but teachable in a single day. And suitable for smaller RCAs.

RPR is not a silver bullet. It is merely a tool for your tool bag, like ping, top, PerfMon …

There are no silver bullets.

Life is pain, Highness. Anyone who says differently is selling something.--The Man in Black


Rapid Problem Resolution ®

Derived from the Rapid Problem Resolution® methodology

1. Understand the Symptoms2. Choose One Symptom3. Draw the Diagram4. Design Capture Plan5. Capture Diagnostic Data6. Analyze Captured Data7. Identify Fix8. Implement Fix9. Verify Fix

RCA Methodology


Phase 1

Phase 2

Phase 3


Notes on the Nine Steps1. Humans want instant gratification: we start trouble-shooting before we understand the

problem. Resist that urge.2. Natural desire to want to fix everything fast – myself, I rarely succeed when I try. Be

particularly wary of thrashing: jumping from one symptom to another. Pick One Symptom, One Symptom only, and stick to it.

3. Common to start trouble-shooting before understanding the environment. Draw the Diagram and Sit with the User. You may discover that you didn’t understand the Symptom, in which case, start over.

4. As you learn more about the Environment and make mistakes in your capture methodology, you’ll cycle through Steps #4-6 numerous times. This is normal. As you become more experienced, you’ll spend more time on #3 and fewer time s cycling through #4-#6.

5. If the problem is intermittent, you can spend a lot of time waiting here. That is reality.6. Naturally, you need time to think about the data you capture.7. At some point, you exit the #4-#6 loop because you think you understand what is happening

and you have identified a fix.8. You apply the fix9. Key step: verify that your fix actually works. If it doesn’t, start over.

RCA Roles & Responsibilities

Who What

Facilitator

(often a Problem Manager)

Accountable for

o Owns the RCA

o Acquire resources

o Use and execute the methodology

o Communicate within the team

o Report & escalate to leadership

o Schedule meetings

Problem Analyst

(often a senior engineer)

Responsible for

o Unify & synthesize information from SMEs

o Keep team on track technically

o Breadth & depth

Subject Matter Experts

Responsible for

o Strong fundamental knowledge of area

o Facilitating access

o Capturing data

o Analyzing

SME Desirable Characteristics

Skills / Predilections

o Problem solving skills

o Inquiring mind – passion for understanding how things work

o Determination & stamina – pursuing a tough problem can be wearing

o T-shaped – broad background in IT with specialization in one or two particular areas

The Problem Solving Group (aka RCA Team) consists of the Facilitator, the Problem Analyst, and one or more Subject Matter Experts

Process-oriented person

Sees the forest, not the treesRespected / trusted by SMEs

Like getting their hands dirty


Draw the DiagramDesign Capture Plan

Fibre Channel Switch

Request

Response

Who talks to whom?Where to insert probes?Where to gather logs / debug output?

(DNS, LDAP, NIS …)

We will work through case studies – real situations drawn from my experience at FHCRC – alternating between small group and seminar style sessions.

Typically, we will oscillate in 15-30 minute increments – spending 15-30 minutes together as a class, working privately in our small groups for 15-30 minutes, coming together for 15-30 minutes …

Course materials on the USB stick include packet traces, log extracts, trending charts, ‘show’ output from clients, servers, switches/routers, storage systems, captured during the actual RCA.


How Does This Class Work?

Whirlwind tour: At the Hutch, we typically spent weeks of an RCA team’s time on these cases – in this workshop, we will just taste each experience, merely touching on key points – we will not have time to dig through any of them in detail.

Variable expertise: As a group, we differ wildly in our expertise –some of us have never seen Wireshark before, have never touched an Ethernet switch or a storage array. I will play to a range of levels: sometimes you may be bored, sometimes you may be drowning.

We will not finish: I do not expect to reach all the case studies. We may not even get through the first one – it contains a lot of material – all depends on where your curiousity leads us.


Expectations

Detours: Using your questions as cues, I will stop the flow of the course and explore related topics: how striping affects the performance of arrays, how TCP Window works, how to perform a particular function in Wireshark.

Contribute: If you have expertise to contribute, please speak up –group dialogue contributes to learning.

Methodology: I will be a stickler for the RPR Methodology and will attempt to push you into following it, following each step in order. Naturally, you may choose to resist. I’m OK with dissent and rebellion – you know yourself better than I do – if you’ll learn better doing things differently, ignore me + blaze your own trail.


More Expectations

Red Herrings: I will include data and clues which are irrelevant to solving the problem … that’s what happened to us, so I intend to share the pain.

Misinformation: When I am wearing a hat, I may give you inaccurate information, based on the limitations of the person whose role I am playing. When I am bare-headed, I am playing the role of the instructor and will try to describe reality as accurately as I know how.

Chaos: I am trying to recreate the fog of war, the confusion of a real-world situation: practicing ways to bring order from chaos is a deep lesson of this class


Great Expectations

Embarrass me: I make mistakes – find them and point them out. I’d rather feel embarrassed and learn than feel comfortable and remain ignorant.

Embarrass yourself: Take risks, ask dumb questions, reveal your ignorance. If you don’t understand my answer, ask again. This is your laboratory, a safe place for you to learn. Ex ignoratia ad sapientium, E luce ad tenebras.

Data: The USB stick contains data – packet traces, ‘show’ output, screen shots – as you work through the scenario and ask for data, I will point you to the relevant directory. If you get stuck, feel free to poke around.

Results Folders: The USB stick also contains the answers to the case studies in folders named Results. I recommend avoiding the Results folder until we’re done for the day.

Wave me down: If you are stuck and thrashing, wave me down – I’m happy to assess where you are and offer you direction to get you unstuck


Recommendations

We are about to walk through the Example Case.

Questions up to this point?


Questions?


Example Case1. Understand the Problem

2. Choose One Symptom

3. Draw the Diagram

4. Design Capture Plan

5. Capture Diagnostic Data

6. Analyze Captured Data

7. Identify Fix

8. Implement Fix

9. Verify Fix

Results

Server Disconnects Telnet Client

The End-User (Angie) keeps getting disconnected from the Server (Ingres). This has been going on for a while; Angie has a high-profile job and a high-profile boss; management has spun up a Root Cause Analysis team and assigned you and a Desktop Tech (Bob) to the team. Bob explains to you that he has been working the issue for several weeks, that a Router is causing the problem, and that he needs help finding and fixing the Router.

We start with 15 minutes together focused on Methodology Step #1: Understand the Symptoms


Walk Through an Example Case

Questions for the Desktop Tech

You: What do you know about Angie?Bob: She is a power user located in the Fairview Building, runs

Windows XP and the Attachmate Reflection terminal emulator.

You: What do you know about the Server?Bob: It is a Unix server called Ingres located in the Yale data center

and run by the Sys Admin Rick.

You: How long has the problem been occurring?Bob: Several weeks, happens multiple times per day, no pattern.


#1 Understand the Symptoms

Questions for End-User

You: When did this start?Angie: It has happened for years, but I didn’t bother to report it

because, until several weeks ago, I hardly used Ingres. Now, I spend all day in it, and the problem is really annoying.

You: What do you notice?Angie: Multiple times per day, I get disconnected and have to log

back in.

You: See any patterns?Angie: Not really. Sometimes I’m typing along and get disconnected.

Sometimes, I turn back to my machine or unhide Reflection and see that I’ve been disconnected.




You: What do you do with this application?Angie: I enter data into the FALCON database. The forms from which

I acquire the data are irregular – requires a lot of interpretation. Sometimes, I spend time looking up related cases in other databases or calling relevant people on the phone for input. Sometimes, I just type like a mad woman. Sometimes, I run reports – it’s really annoying when a report takes half an hour to run and I get disconnected just before it finishes, because then I have to re-run the report.




You: When you’re typing like a mad woman, how long before you get disconnected?

Angie: I figure I get 45 minutes. That’s my guess – I figure I get disconnected every 45 minutes. I might be wrong about that – I haven’t timed it or anything. But if I’ve been logged in for half an hour or so and need to run a report, I generally wait until I get disconnected, log back in, and then run the report immediately.



Questions for the Sys Admin

You: What can you tell me about Angie’s problem?Rick: Got me. It can’t be my server: Ingres has about 40 users and

10 developers, and Angie is the only person reporting this problem. They all use the Reflection SSH client.

You: What can you tell me about Ingres?Rick: It is an HP Alpha server running OpenVMS located here in the

D5 data center. It runs the Ingres database manager (can you tell by its name?) Angie uses the FALCON database: everyone uses FALCON; it’s the most popular database we offer.



Questions for the Sys Admin

You: How often does Angie have this problem?Rick: Seems to me that Angie gets disconnected every hour or two;

I’ve checked the server configuration – I haven’t configured a timeout: everyone gets unlimited access as long as they want.

You: What do your logs say?Rick: Not much. Angie has called me plenty of times, right after

getting disconnected, but all the Alpha logs say is:“Username angie: Client disconnected”



Split: If this were a real case, we would split into our small groups. You have 15 minutes.

Choose: Your first task in small group is to select one and only one symptom on which to focus. In this example, it’s pretty easy –there’s only one symptom. In future cases, this task will be harder – there will be many symptoms. Generally, I recommend picking either the easiest to analyze, the easiest to replicate, or the most costly to the business.

Phrase: Find a precise way to phrase the symptom. Example:Angie gets intermittently disconnected from Ingres.


#2 Choose One Symptom

This will involve asking IT staff technical questions about the environment – this is where I start swapping hats (End-User, HelpDesk, Desktop, Sys Admin, Network, Database, Security, Vendor, Manager …), depending on the group to which you address the question

Ideally, the Ops staff already have this diagram and keep it updated as they make changes … but in my experience, only the most mature shops manage this

Sometimes, we identify the cause during the process of diagramming!

There’s a lot of experience & judgment here – what to include, what not to include

Focus on the components which surround the Symptom you have picked and how they relate to one another: dependencies.

If you solve a problem without drawing a diagram, you got lucky.


#3 Draw the Diagram


Diagram for Example Case

This is done in small group; you have 15 minutes. In this step, you figure out how you’ll gather the data you identified in the previous step.

Typically, you will want to gather logs and/or metrics from applications and operating systems as well as insert sniffers

As much as possible, I will also support your performing ‘show’ commands, grepping through logs, trending parameters across time, rebooting devices …


#4 Design Capture Plan

Example Data Capture Plan

1. Plug sniffers into lf-esx and d5sr-esx and SPAN Angie’s port and Daffy’ port, filtering on Angie’s IP address

2. Enable debug tracing on Angie’s copy of Reflection, gather both syslog and Ingres logs on Daffy

3. Validate capture set-up by asking Angie to ssh into Daffy, then verifying that we can see Angie’s login in all logs and packet traces

4. Sit with Angie and watch her work for a day, precisely recording the times when she gets disconnected

5. While we’re waiting: Gather ‘show port’ output from Angie’s and Daffy’sswitch ports plus version and configuration information (idle timer setting) from Reflection


#4 Design Capture Plan

This is done as a class. The instructor executes each group’s Diagnostic Capture Plan and returns the resulting information.

Each group benefits from hearing the results of every group’s Diagnostic Capture Plan.

Typically 15 minutes.

In this example, the instructor returns:Reflection debug tracePacket CapturesLogsAngie & Daffy’ Ethernet port statisticsReflection Version & Settings (idle timer)


#5 Capture Diagnostic Data

Angie’s Ethernet Port Statslf-esx#sh ver

[…]

lf-esx uptime is 3 years, 3 weeks, 5 days, 12 hours, 44 minutes

[…]

lf-esx#sh int Fa2/19

FastEthernet2/19 is up, line protocol is up (connected)

Hardware is Fast Ethernet Port, address is 0011.21f5.46c2 (bia 0011.21f5.46c2)

MTU 1500 bytes, BW 100000 Kbit, DLY 100 usec,

reliability 255/255, txload 1/255, rxload 1/255

Encapsulation ARPA, loopback not set

Keepalive set (10 sec)

Full-duplex, 100Mb/s, link type is auto, media type is 10/100BaseTX

input flow-control is unsupported output flow-control is unsupported

ARP type: ARPA, ARP Timeout 04:00:00

Last input 00:00:19, output never, output hang never

Last clearing of "show interface" counters never

Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0

Queueing strategy: fifo

Output queue: 0/40 (size/max)

5 minute input rate 0 bits/sec, 0 packets/sec

5 minute output rate 4000 bits/sec, 6 packets/sec

161282073 packets input, 48475519613 bytes, 0 no buffer

Received 2004674 broadcasts (1689326 multicasts)

0 runts, 0 giants, 0 throttles

0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored

0 input packets with dribble condition detected

831253443 packets output, 116132425387 bytes, 0 underruns

0 output errors, 0 collisions, 0 interface resets

0 babbles, 0 late collision, 0 deferred

1 lost carrier, 0 no carrier

0 output buffer failures, 0 output buffers swapped out

lf-esx#


If t

he

err

or

cou

nte

rs w

ere

hig

h, p

erh

aps

we

hav

e a

bad

NIC

| c

able

| s

wit

ch p

ort

… b

ut

they

are

ze

ro o

r cl

ose

en

ou

gh.

Ru

le o

ut

ba

d p

hysi

cal l

aye

r

Packet Trace


daffy = ingress

An

gie

abru

ptl

y h

angs

up

(TC

P R

ST)

on

Daf

fy (

aka

Ingr

ess)

. Lo

oks

like

An

gie

init

iate

d t

he

dis

con

nec

t

Example-Case/client-shuts-down-ssh-session-trace.png

Example-Case/client-shuts-down-ssh-session-trace.png

Reflection Settings


Rea

din

g th

e m

anu

al t

ells

us

that

Co

nn

ect

ion

Se

ttin

g Ti

meo

ut

is a

n Id

le T

imer

. A

nd

th

at a

ch

oic

e o

f ‘0

’ fo

r th

is t

imer

me

ans

‘un

limit

ed’,

i.e.

nev

er d

isco

nn

ect

, no

mat

ter

ho

w lo

ng

the

use

r re

mai

ns

idle

.D

an

g, w

e re

ally

wa

nte

d t

o s

ee a

set

tin

g o

f, o

h,

60

min

ute

s h

ere

Application Version


Res

earc

h t

ells

us

tha

t th

e la

test

pa

tch

leve

l fo

r R

efle

ctio

n 1

4 is

v1

4.0

.7.

An

d t

he

late

st v

ersi

on

fo

r th

is t

rain

of

Ref

lect

ion

of

14

.1.1

88

SP

1.

An

gie

is r

un

nin

g a

n o

ld v

ersi

on

Back to small group; you have 30 minutes to analyze the data you have acquired and

In real life, you will likely cycle through Steps #4 - #6 multiple times.

Feel free to continue to #7 Identify Fix when you are ready.

Your team consults together … hmm …• The Ethernet port shows trivial errors, so that looks fine.• The packet trace shows Angie initiating the disconnect• Reflection settings show an unlimited idle timer• We’re running an old version of Reflection … probably full of bugs

#6 Analyze Captured Data


At some point, you believe you’ve identified the cause; now you can develop a fix.

#7 Identify Fix


Your team says:We know that Attachmate has shipped numerous updates to Reflection – the latest version is 14.0.7. We propose to upgrade Angie’s copy to the latest version.

We reconvene as a class. Each group proposes its fix, and the instructor reports the results of the fixes.

In this example, Bob doesn’t want to upgrade – he wants to keep all his users at the same revision.

Instead, he uninstalls and re-installs Reflection.


#8 Implement Fix

We remain regrouped as a class and review the results of the Fixes. In this case, Angie runs for a week without any disconnects.

Bob doesn’t want to invest more time into this, so we quit.

Ideally, we would re-image Angie’s machine and verify that the problem returned … as scientists, we realize that we have demonstrated correlation, but not cause and effect.


#9 Verify Fix

We declare the Problem resolved, with an undefined Root Cause –something related to Angie’s local Application configuration which gets reset when the Application was re-installed, no explanation for why this only affected Angie and not any of the other 55 users.

In a perfect world, we would re-image Angie’s machine and verify that the problem returned … in the real world, we did not implement that last step of RPR, which requires that we Verify the Fix …

As a Problem Manager, you are responsible for ensuring that management hears the risk they have adopted by skipping this step.


Results

• For the rest of our day we cycle between small group and large group

• In large group, you ask questions; in small group, you analyze

• I am available for questions and coaching during both

Questions about the mechanics of what we will be doing?

Questions about the 9 step RCA process?

2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick

End of Example Case

55

#1 SplitIn a moment, you will split into groups of 3-6 people

# 2 Assign RolesI recommend assigning roles & responsibilities, e.g.

Facilitator Tracks who is doing what, spokespersonProblem Analyst Big pictureSubject Matter Experts Sys admin, network, storage …

Successful teams divide & conquer the material …Ideally, one person per role …

# 3 Pick NamePick a cool name for your group; write it on one of the name plates

You have 5 minutes – go

Split into Small Groups


HPC Cluster Woes (morning)Storage Stumbles (afternoon)


Case Studies

Researchers submit tissue samples to the Genome Sequencing Shared Resource. A full sequencing run takes days – the sequencers dump the resulting files, typically dozens to hundreds of gigabytes, onto the server Fred. From there, users run custom code on the High Performance Computing (HPC) cluster Hyrax to analyze the results. They write their own code, typically in a mix of Perl, Python, and R, and tweak this code regularly, as they explore various avenues of inquiry. The cluster has a handful of heavy users (daily or weekly), plus several dozen light users (monthly).

The scheduler behind Hyrax submits jobs to the nodes which comprise the cluster, keeping track of various parameters, like how many nodes a given researcher owns (condo-model), which nodes are already busy, how much time a given job has already consumed, and so on. Some jobs finish in minutes, some take hours, others take days or even weeks -- this is normal.

A few of the nodes are unusual: they are large memory nodes, typically equipped with 64 GB of RAM plus fast processors; they are named RhinoX and OrcaX (e.g. rhino1, rhino2, rhino3 … orca1, orca2, orca3 …)


HPC Cluster Woes I

In the summer of 2011, the Hutch hired a promising young researcher, Robert Bradley (aka rbradley), who had recently completed postdoctoral work at MIT. Bradley analyzes alternative splicing, a process by which a single gene contributes to producing multiple protein isoforms –a normal event in cells and one which plays an important role in various diseases, including cancers. Bradley’s work makes heavy use of large memory HPC machines.

By September, Bradley had transferred his data and code from MIT to the Hutch; almost immediately, he started encountering problems. Interactive ssh sessions to Rhino/Orca stall, sometimes for seconds, minutes, perhaps even hours. Nodes hang for minutes at a time, with no progress on the job. Nodes crash and must be rebooted. Jobs crash and must be restarted.

This is not the kind of service we want to offer anyone, much less a new recruit.

Management input:You cannot talk to End-Users, rbradley in particular.You can ask their Desktop Support staff questions, and they will answer as best they can.Scheduling downtime on any of the storage systems is hard.


HPC Cluster Woes II

Welcome to the first meeting of the Rhino PSG, on Wednesday November 2nd 2011. The meeting kicks off with the Ops team delivering their briefing. Read it. Understand nls

We start the RPR Methodology working together as a class.

Step #1: Understand the ProblemWhat questions do you want to ask of the various constituencies?

Database, Desktop, Helpdesk, Manager, Network, Server, Storage, User, Vendor …

After Understand the Problem, we will separate into small groups and proceed with:

Step #2: Choose One Symptom


Set the Stage

60

Large group | Small group


RCA MethodologyDerived from Advance7’s Rapid Problem Resolution® methodology


Phase 1

Phase 2

Phase 3

What aspect of the process would you like to review?

What section of a diagram or graph would you like to explore?

What hunk of data would you like to re-examine?

Which link in the chain of reasoning doesn’t make sense to you?

Additional questions?

This is your opportunity to consolidate your learning.


Q&A


PGP


During the first decade of the new century, our storage silos start to multiply – Hitachi, IBM, Compaq, Dell, NetApp … in January 2010, after a year-long project, we go live with Consolidated Storage, an attempt to reduce staff / capital costs and meet future storage needs via a single system. Consolidated Storage consists of a clustered NetApp V3170providing SMB v1/v2, NFS v3/v4, and iSCSI access to a backend 3Par T800, containing 528 SATA drives (both 1 and 2TB). The T800 is a wide-striped system, meaning that every LUN it offers has been striped across all 528 drives. We estimated that the ~600TB usable space on CS would last us until mid-2012. By March 2010, almost all that space has been allocated …

Consolidated Storage services vColo, our VMWare farm (~600 guests residing on ~7 hosts), along with hundreds of servers (HPC nodes, database, custom applications) and thousands of desktop clients (home and shared directories).

By the summer of 2010, CPU utilization on Tungsten-A pegs, at which point performance problems severely impact daily use. We convert Tungsten-B from Standby to Active . Early in 2011, CPU on both systems pegs; we launch an emergency project to purchase another NetApp onto which we offload a particularly IO intensive application.


Storage Stumbles - Background

Severity: Major (unplanned) Start: Wednesday, March 23, 2011 16:31Stop: -Duration: ongoingScope: 3Par storage and systems reliant on it (NetApps, vColo, others)

Description: The 3Par system experienced a drive failure, which caused a large latency spike. One of the NetApp heads subsequently lost access to the 3Parand initiated a failover to the other NetApp head. We are currently consulting with the vendors involved in order to determine next steps.

Service/User Impact: A number of systems have been impacted, including Zimbra, various Internet Services web servers, Outlook Web Access (partial), many others.

Technician/IT Operations Group performing work: xxx xxx, Center IT, InfraOps---------------------------------------------------------------------------------

Cleaning this up took a week. After much discussion, NetApp determined that our storage admin had followed an incorrect procedure some months earlier, in retiring several LUNs. This left ‘ghost’ traces behind which, for unknown reasons, triggered the latency spike and subsequent head failover.


Storage Stumbles – First Event

Severity: Major (unplanned)Start: Tuesday, January 10, 2012 12:55Stop: ongoingDuration: ongoingScope: Consolidated Storage (Tungsten-A & Tungsten-B)

Description: Tungsten-A failed over to Tungsten-B due to a triple disk failure on the 3Par disk system. InfraOps is looking into the issue and will be working with system owners across the Center to get their machines up and running. We currently have a call open to NetApp. Right now all resources are running on Tungsten-B except for PHSDATA Aggregate 1 and ADMHOME. There may be a performance degradation since all the resources are running on a single head. When Tungsten-A is back to a stable condition, we will be scheduling a give-back of resources to that head. In the meantime we strongly suggest that all systems owners turn off any non-critical guests as this will help alleviate the load.

Service/User Impact: All services running on Tungsten-B have been impacted by the failover, including Zimbra, the Enterprise SQL clusters, EMS ... Tungsten-B cannot see a couple of disks, therefore there are resources that will be affected – known resources are PHSDATA Aggregate 1 and ADMHOME directories.

Technician/IT Operations Group performing work: xxx xxx, Center IT, InfraOps


Storage Stumbles – Second Event

Severity: Major (unplanned)Start: Wednesday, February 01, 2012 16:30Stop: ongoingDuration: ongoingScope: Consolidated Storage (Tungsten-A)

Description: Tungsten-A failed over to Tungsten-B due to a disk failure on the 3Par disk system. Infrastructure Operations is looking into the issue and will be working with system owners across the Center to get their machines up and running.

Service/User Impact: All services running on Tungsten-A have been impacted by the failover. CIT is currently working to resolve the issue and will keep communications open as we go forward.

Technician/IT Operations Group performing work: xxx xxx, Center IT, InfraOps


Storage Stumbles – Third Event


Storage Stumbles – Ops Team InputThe latency-sensitive VMs (typically old versions of SuSE for which we do not know how to adjust disk timeout parameters) running in vColo regularly complain about disk access, flag their file systems as read-only, and require reboots … this has been going on since the summer of 2010, when we moved chunks of vColo to Tungsten.

Our Telemetry charts show read/write latency spikes on the T800 whenever it fries a disk … and it fries far more disks than any other system at the Center.

Review your printed copy of Ops Team Briefing Storage Stumbles, along with the contents of the Diagrams folder (that folder includes the Timeline which management wants you to build – we’ve done that for you).

See Storage-Stumbles/Report-to-Management.ppt for the format in which management likes to see reports. Remember, IT mgmt is mostly composed of business folks, not technologists. You may want to produce two reports: one aimed at us – where you get to explain all the cool stuff – and the other report aimed at mgmt.

Make the mgmt report the Fisher Price version: keep it simple, speak their language, one page only.


Storage Stumbles – Mgmt DirectionYou have the CIO’s attention: These events have knocked out most IT services company-wide for hours.

The CIO and his team expect our systems to behave predictably – your job is to figure out why they don’t. For example, if Consolidated Storage is suffering so badly, why do some systems float through the event without issue while others crash and require days of recovery? Why does Cobalt fry so many disks per year while our other systems don’t lose any?

1. Sanity-check Cobalt disk failure rate against industry averages2. Build a Timeline for the week surrounding the Incident3. Explain why Cobalt disk maintenance seems to trigger Tungsten failovers4. Explain why Tungsten failovers do not go smoothly5. Explain why different clients/services behave differently6. Explain the impact of Tungsten CPU utilization on our failover capabilities7. Describe how we fumbled the communication to end-users8. Propose next steps


Your MissionThere is way too much in this case study for you to tackle over the next few hours

Focus on one, maybe two, of the eight tasks, produce a one-page report for mgmtFeeling frisky? Produce a 1-2 page report aimed at the tech-savvy managers

1. Sanity-check Cobalt disk failure rate against industry averagesLots of googling & reading Misc/History-of-Cobalt-Frying-Disks

2. Build a Timeline for the week surrounding the IncidentOps team has done that already Diagrams

3. Explain why Cobalt disk failure triggers Tungsten failoversComplex, requires a sophisticated understanding of SCSI Incidents

4. Explain why Tungsten failovers do not go smoothlyHard but interesting Incidents

5. Explain why different clients/services behave differentlyInvolves a rich understanding of various clients, protocols, and HA schemes -

6. Explain the impact of Tungsten CPU utilization on our failover capabilitiesI predict that you’ll learn a thing or two about ONTAP Misc/Tungsten-Struggles

7. Describe how we fumbled the communication to end-usersLet the Ops Team do that

8. Propose next steps

Welcome to the first meeting of the Storage Stumbles PSG, today is January 12, 2012. The meeting kicks off with the Ops team delivering their briefing. Read it.

We start the RPR Methodology working together as a class.

Step #1: Understand the ProblemWhat questions do you want to ask of the various constituencies?

Database, Desktop, Helpdesk, Manager, Network, Server, Storage, User, Vendor …

After Understand the Problem, we will separate into small groups and proceed with:

Step #2: Choose One Symptom


Set the Stage

79

Large group | Small group


RCA MethodologyDerived from Advance7’s Rapid Problem Resolution® methodology


Phase 1

Phase 2

Phase 3

What aspect of the process would you like to review?

What section of a diagram or graph would you like to explore?

What hunk of data would you like to re-examine?

Which link in the chain of reasoning doesn’t make sense to you?

Additional questions?

This is your opportunity to consolidate your learning.


Q&A


2013-11-05

Real Programmers

Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 90

Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 912013-04-14

Tech Flash: Sanity-Checking Throughput Claims Sometimes, I’ll hear a tech claim that we need a fatter WAN pipe, because the file copy | backup job | database synchronization | whatever is slow:

“I’m only getting 400 MB/hour to Chicago: we need to rent a fatter network pipe.”

Well, we geeks often confuse ourselves when translating between bits per second and bytes per second … this whole performance zone is a popular place for error … and fatter WAN pipes are expensive. Let’s sanity check this claim.

Name Bit Rate Effective Data Rate*Vanilla Ethernet 10Mb/s 1MB/sFast Ethernet 100Mb/s 10MB/sGigabit Ethernet 1000Mb/s 100MB/sTen Gig Ethernet 10000Mb/s 1000MB/s (aka 1GB/s)

Assume we have a 100Mb/s pipe to Chicago: would buying a fatter one help?

*These numbers constitute an easily remembered rule-of-thumb: well-tuned clients/servers can actually deliver 10-15% better than this


Tips & Tools

Wiggly Charts are Overrated

Validate the Diagram

Rich Pingers

Draw the Pie

When to Use Full-Blown RPR

Musings on IT Architecture


Wiggly Charts are OverratedMy Solexa run failed last night shortly before midnight. You can see that Fred’s switch port was extremely busy then, far busier than usual, and you can also see the IO spike which happened at the same time.Therefore, Fred needs a 10GigENIC and faster disks.

Your task:• Think of ways to support this• Think of ways to refute it


Validate the DiagramSend a sample transaction from one end of the infrastructure to another, capturing along the way. If your diagram is accurate, you’ll see the transaction at every single capture point

If you don’t see that transaction … then you know your diagram is inaccurate: return to Draw the Diagram

Once you’ve validated the diagram, you are positioned to capture the pathology you’re investigating

Sample transaction: Write a test file, update a database record, send a Rich Ping … in each case, include easy-to-spot ASCII


TextPinghttp://www.packetiq.com/Tools/PacketIQ-TextPing.aspx

http://www.packetiq.com/Tools/PacketIQ-TextPing.aspx


Command LineSend a TCP port 2049 frame to server.company.comhost> echo “Starting NFS Mount now –marker” | nc -4 –w 1 server.company.com 2049

C:\Temp> echo “Starting NFS Mount now –marker” | ncat -4 –w 1 server.company.com 2049

For Windows, install the open source ncat utility http://www.insecure.org, part of the Nmap distribution

Send a UDP port 666 frame to server.company.comhost> echo “Starting app now –marker” | nc -4 –w 1 -u server.company.com 666

C:\Temp> echo “Starting app now –marker” | ncat -4 –w 1 -u server.company.com 666

Create a file, the name of the file will appear in Wireshark’s Summary screenhost> touch /mnt/whatever/slowness-starting-now--marker.txt

C:\Temp> copy /y nul z:slowness-starting-now--marker.txt

Drop the message into /var/log/syslog on loghosthost> logger –l loghost.company.com slowness starting now –marker

C:\Temp> logger –l loghost.company.com slowness starting now --marker

For Windows, install the freeware logger utility http://www.monitorware.com/logger

Drop the message into the Web server’s logs:host> wget http://www.company.com/slowness-starting-now--marker.html

C:\Temp> wget http://www.company.com/slowness-starting-now--marker.html

For Windows, install the open source GNU wget utility

http://www.insecure.org/

http://www.monitorware.com/logger

http://www.company.com/slowness-starting-now--marker.html


PingIn a pinch, you can use ping, manually maintaining a written table associating ping packet length to message:

host> ping –n 1 –l 101 server.company.com



Ping Packet Length Event101 bytes Mounting file system102 bytes Starting application103 bytes Slowness beginning now

Or, depending on your filters, ping a fake host … the ping won’t show up in the trace, but the failed DNS query will:host > ping www.slowness-starting-now--marker.com

C:\Temp> ping www.slowness-starting-now--marker.com

http://www.slowness-starting-now--marker.com/

http://www.slowness-starting-now--marker.com/


Send-UDP-Msghttp://www.skendric.com/app

vishnu> ./send-udp-msg -m "This is a test ping" rhino1 rhino2 rhino3

vishnu>

Or write your own … here’s mine

http://www.skendric.com/app

2013-11-05 99

Many problems are intermittent – you set your debugs and packet captures going and then wait hours/days/weeks for the issue to reoccur. How might one capture across such long time frames?

Ring BufferMost capture utilities will produce a ring buffer of files. In this example, dumpcap writes those bytes to a file named in the following way:

server-side_00001_20130325120842.pcapwhere the first field is a serial number and the second field encodes the date/time of start of this capture. After it has captured 50,000 bytes, it gets started on the next file:

server-side_00002_20130325120958.pcapwhere the question marks are replaced with the start time of this capture. Dumpcap will repeat 10,000 times, whereupon it will start deleting the first files, in order to limit the number of files to 10,000.

Windowsdumpcap –i 1 –b files:10000 –b filesize:50000 –f “ip host 10.1.2.3 and not ip host 192.168.20.30” –w

c:\temp\cabrini\serer-side.pcap

Linux/usr/sbin/dumpcap –i eth1 –b files:10000 –b filesize:50000 –f “ip host 10.1.2.3 and not ip host 192.168.20.30” –w

/home/skendric/cabrini/server-side.pcap &

Long-Term Captures - CLI

2013-11-05 100

Or perhaps you prefer the GUI

Long-Term Captures - GUI

Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick

2013-11-05 Root Cause Analysis | Sharkfest 2013 | Stuart Kendrick 101

Extract PacketsOK, so now you have 10,000 files, and you want to look at an incident which occurred between noon and 1:00pm on 2013-03-25

The names of the files allow you to focus on the window in question, so you copy those to a working directory. But that can still be a lot of files, in a busy environment. Perhaps you realize that you only care about DNS frames. I write littlescripts to extract the interesting packets and merge them into a single file.

Windowsecho off

setlocal ENABLEDELAYEDEXPANSION

mkdir c:\temp\cabrini\extract

cd \temp\cabrini

FOR /F %%a IN ('dir /b *.pcap') DO (

echo Processing %%a

tshark -r %%a -R “udp.port==53 or tcp.port==53" -w extract\%%a-filtered.pcap

)

cd \temp\cabrini\extract

FOR /F %%a IN ('dir /b *.pcap') DO (

set tmp=%filelist%

set "filelist=!filelist! %%a"

)

mergecap -w c:\temp\cabrini\server-side-extract.pcap %filelist%

Linuxhttp://www.skendric.com/problem/rca/extract-frames

http://www.skendric.com/problem/rca/extract-frames

User, application, memory, CPU, disk …Client

5s

Network120 seconds

Server15s

Contribution to the Problem

Switches, Routers, Firewalls, VPN Tunnels …

Application, OS, memory, CPU, storage …

Client / Network / Server Pie

I find drawing the CNS Pie useful when analyzing performance issues:How much does the Client contribute to the Problem? The Network? The Server?http://www.skendric.com/app


http://www.skendric.com/app

Rule-of-thumb

Application architecture: 1,000x• SQL, query optimization, caching, system calls

Server & Storage Configuration: 100x• Disk striping, spindle tiering, paging, NFS tuning

Application fine-tuning: 2-10x• Threads, asynchronous I/O

Kernel tuning: less than 2x- Caveats:

• If kernel bottleneck is present, then 10-100x• Kernel can be a binary performance gate

Version 3.10Copyright 1994-2007 Hal Stern, Marc Staveley System & Network Performance Tuning LISA 2007


Tuning Potential

Advance7 is a consulting outfit which helps customers resolve critical Problems – they put an analyst at your site to coordinate your staff plus vendors to fix the issue, using the RPR methodology.

They designed RPR to work against Grey Problems.

Most Problems are not Grey … unless the Problem is Grey, RPR is overkill.

So what are Grey Problems?

The following sides are cribbed from Advance7 materials -- full credit to Paul Offord & colleagues.


Rapid Problem Resolution ®


The Grey Problem

• Intermittent Application Error• Poor Application Logic• Transient Overload• Intermittent Infrastructure Error• Incorrect Failover Operation

Single Incident Recurring Problem

Tech

no

logy

Kn

ow

nTe

chn

olo

gy U

nkn

ow

n

• Intermittent Hardware Failure• Known Error• Intermittent Software Failure

• Change-related Cause• Hardware Failure• Software Failure• Misconfiguration• Operations Error

• User Error• Operations Error• Rare Software or Hardware Error

The majority of issues that are passed to 2nd and 3rd line technical support teams are investigated in a straightforward manner. The nature of the issue or an indication from a monitoring system identifies the failing component and the issue is allocated tothe correct technical support team. Q1: the bulk of support work falls into this area. Q2 is harder but tends to be resolved by experienced support staff. Q3 is tough; we tend not to solve these.

An intermittent response-time or error issue is not so easily handled due to its transient nature. Not only does the cause sneak under the radar of monitoring systems, but investigation often starts after the issue has passed, making it impossible to usemany of the tools available. The result is a recurring problem where the causing technology is unknown: Q4, aka the Grey Problem. The Rapid Problem Resolution methodology targets Q4.

Q1 Q2

Q3 Q4


Grey Problem CharacteristicsBecause the causing technology is unknown, a grey problem will bounce between Technical Support Teams as each in turn produces evidence (often in the form of a health check) to prove that their technology is not to blame.

Typical characteristics of a grey problem• An ever-growing number of people become involved• Long meetings to discuss what might be the cause• Support people shy away from becoming involved• Repeated changes with no clear reason or objective

Consequences of grey problems• An ever growing backlog of problems• A fog that hinders the investigation of other, more urgent problems• A growing pool of problems that escalate into Major Incidents as patterns of use and business priorities change• Wasted IT budget as money is spent on poorly targeted upgrades• Barriers to integration due to concerns about the stability of component systems• Loss of confidence and satisfaction with the IT department• Pressure to outsource IT services• Reduced customer satisfaction• Higher costs as the business adjusts to accommodate the problem• Higher IT staffing costs

Service Desk

Inci

den

t M

anag

emen

t

Pro

ble

m M

anag

emen

t

Server Desktop NetworkVendor ApplicationVendor Vendor Vendor

?


The RPR Methdology

1.2 Choose One SymptomThe single largest reason I’ve thrashed in my RCA career.

1.4 Draw the Diagram & Sit with the UserIf I can’t draw it, I don’t understand it & Seeing leads to Understanding

2.2 Definitive Diagnostic DataInsert capture gear at critical points along the path, synchronize time using a distinctive transaction, capture data simultaneously from all points while replicating the pathology.Hardest to implement; most likely to make you successful.


Key Elements of RPR

Full methodologyhttp://www.skendric.com/problem/rca/RPR-RCA-Methodology.pdf

Checklisthttp://www.advance7.com/misc/rpm_wb.html

Manualhttp://www.advance7.com/information/publications


RPR References

http://www.skendric.com/problem/rca/RPR-RCA-Methodology.pdf

http://www.advance7.com/misc/rpm_wb.html

http://www.advance7.com/information/publications

• Business stake holders want features• We IT geeks love to turn every fancy knob

The result is complexityComplexity is the enemy of uptime … and the raison d’être for RCA

Insights from our gurusIncreasingly, people seem to misinterpret complexity as sophistication, which is baffling - the incomprehensible should cause suspicion rather than admiration. --Niklaus Wirth

Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. --Brian Kernighan

KISS


Musings on Architecturehttp://www.skendric.com/philosophy/uptime

http://www.skendric.com/philosophy/uptime

Folks ask me how I learn this stuffKey contributors to my path: Mentorship, Failure, Independent Study, Training

MentorshipI have had the good fortune to work for highly skilled people who have mentored me.• During 1991-1993, I joined my boss on Saturdays … worked as his gofer boy … he used the opportunity to train me• In the mid-2000s, we hired Mike Pennacchi to coach us. Mike came on-site once/month for a half-day; we brought

whatever problem was troubling us to the session; Mike would not solve it for us … rather, he would coach us through solving it. We did this for ~three years before budget contraction interfered

FailureI have had the good fortune to work for bosses who believe that we learn through mistakes … “Fail early and often” … I’ve learned a lot this way

Independent StudyI set aside a slot every week (mostly!) to practice what I’ve learned, push myself to learn something new … typically a weekend morning

TrainingI have had the good fortune to work for bosses with training budgets … I typically spend a couple weeks per year in classes … I occasionally augment this using my own shekels

2013-11-05 111Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick

Musings on Skill

Humans (all living creatures) are wired for fast-twitch: our nervous systems respond rapidly to the unusual, not to the mundane. So a bomb explodes, kills three people: that fires our adrenalin … but the annual toll of smoking (~500,000 per year in the US alone) drifts past our consciousness without a quiver.

Of course we’re wired this way … that’s how we stayed alive on the African savannah: by paying attention to the howl of the hyena, rather than to the gradual constriction of our arteries

But the result is that we have trouble paying attention to slow-twitch threat … to saving for retirement or a rainy day, daily exercise, spending time with our family … investing in the power grid, roads, bridges, anything which seems like a long way off …

Ditto with IT – we focus on the glitzy new projects, ignore the underpinnings … until the infrastructure breaks catastrophically … that drama fires our nervous systems, then we pay attention (for a while)

I don’t have a solution for this design flaw (trade-off) in our brains … but it does keep me employed, as a Problem Manager and a Problem Analyst If we maintained our infrastructure (shrank technical debt), many of our RCAs would not occur


Musings on Nervous Systems

It has been said that man is a rational animalAll my life I have been searching for evidence which could support this

--Bertrand Russell

Your brain will be predisposed to certain answers and will cling to them, blinding you to reality

Definitive Data Capture is RPR’s effort to counteract this tendency

I wish you success in scrabbling for rationality


This is Hard


Insight


On-Line ResourcesRapid Problem Resolution by Paul OffordLinkedIn Protocol Analysis & Troubleshooting GroupOld Comm Guy http://www.lovemytool.com

Trouble-shooting & Training Outfits Based Here (will travel for $$)James Baxter http://www.packetiq.com Daytona Beach, FLTony Fortunato http://www.thetechfirm.com Toronto, CanadaChris Greer http://www.packetpioneer.com Central AmericaPaul Offord http://www.advance7.com London (international)Mike Pennacchi http://www.nps-llc.com Seattle, WARay Tompkins http://www.gearbit.com Austin, TX…

ConferencesSharkfest http://www.sharkfest.org Berkeley, CA

Follow-up stuart.kendrick.sea {at} gee mail dot com2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 118

Thank you

http://www.amazon.com/Rpr-Problem-Diagnosis-Method-Professionals/dp/1447844432

http://www.linkedin.com/groups?gid=1116847&trk=hb_side_g

http://www.lovemytool.com/

http://www.packetiq.com/

http://www.thetechfirm.com/

http://www.packetpioneer.com/

http://www.advance7.com/

http://www.nps-llc.com/

http://www.gearbit.com/

http://www.sharkfest.org/

Root Cause Analysis - skendric.com · 2013-11-05 Root Cause Analysis Intermediate ... Corvus...

Documents

Transcript of Root Cause Analysis - skendric.com · 2013-11-05 Root Cause Analysis Intermediate ... Corvus...