Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys...
Transcript of Root Cause Analysis - skendric.com...You are a mid-level engineer Perhaps you function as a sys...
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick
Root Cause Analysis – Beginner
1
A Hands-on Tutorial
Your Pre-Flight Check List
1. Write your first name on the card stock, display prominently
2. Locate the courseware on the USB stick
3. Grab the latest version of the slide deck, dated 2013-11-05
http://www.skendric.com/seminar/rca/Root-Cause-Analysis-Beginner-Deck.pdf
4. Configure Wireshark columns (see p.5 of this presentation)
5. Introduce yourself to your potential teammates: figure out who will play which roles
6. Examine the diagrams on the walls
Copyright Stuart Kendrick ©2013 All Rights Reserved
IntroductionExample CaseSplit into Small GroupsCase Studies
Remote Office BumpsMany Applications Crash
Tips & ToolsWrap-up
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 2
Workshop Outline
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 3
IntroductionMechanics
Me and My Biases
What is Root Cause Analysis?
How Does This Class Work?
Recommendations
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 4
Mechanics
We use Google Docs … you don’t need an account: I will provide links
9:00 – 10:30 Class Ask questions whenever you want
10:30 – 11:00 Break
11:00 – 12:30 Class
12:30 – 13:30 Lunch Your Laptop
13:30 – 15:00 Class • has Internet connectivity
15:00 – 15:30 Break • can display & search PDF, PNG, TXT, XLS
15:30 – 16:30 Class • Wireshark configured per next slide
16:30 – 17:00 Wrap-up
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 5
Configure Wireshark Columns
• Use a recent version of Wireshark … 1.10.0 at a minimum – I recommend the latest and greatest • If you are an experienced Wireshark user, feel free to ignore this and use your favorite column choices• If you are really experienced and prefer a different analyzer, feel free to use it
You
rea
lly w
ant
Del
ta t
ime
dis
pla
yed
An
d C
ust
om
(tc
p.s
trea
m)
will
be
hel
pfu
l
Multi-disciplinary IT trouble-shooter / Root Cause Analysishttp://www.skendric.com
sbk@cornella student 1981stuart@cpvax5 (Science Applications Inc) programmer [email protected] desktop / server [email protected] server / network [email protected] multidisciplinary 1993stuart.kendrick {at} isi lon dot com sustaining engineer 2014
IT Architect | ITIL Problem Manager | Problem Analyst | Device Monitoring | Transport
Geeky HighlightsPL/1 on IBM mainframes Cornell University Ithaca 1981FORTRAN on CRAY-1 SAIC San Diego 1984Terak, DisplayWriter, IBM PC, Macintosh Cornell University Ithaca 1985Netware, Corvus Omninet, TCP-IP / IPX / AppleTalk Cornell University Ithaca 1988AppleShare, QuickMail, Farallon, NRC, Cisco, Sniffers Cornell Medical College Manhattan 1991Solaris, Windows, Linux, Perl, SNMP, Wireshark, Cisco ,Fluke FHCRC Seattle 1993OneFS EMC Isilon Seattle 2013
Me
2014-04-12 Myth-Busting | xxx 2014 | Stuart Kendrick / Chris Shaiman 6
You are a mid-level engineerPerhaps you function as a sys admin, network engineer, database admin, or developerPerhaps you support desktops and want to expand into another spacePerhaps you work for a small outfit and are a jack-of-all-trades
You look at logs regularly when tackling a problem, perhaps you’ve even looked at packet traces, though without nearly as much success as you would like. You’re curious about how things work and you’re persistent: you beat your head against a problem, trying to solve it from various angles.
You are here because you want a chance to tackle problems on your own and then receive coaching on techniques for analyzing packet traces, extracting insights from performance charts, correlating log entries from multiple devices.
Or … perhaps you are a people or process person – resource manager, project manager, ITILProblem Manager. You don’t have the skills to analyze bits & bytes, but you want to practice a problem solving methodology. You’ll help keep your team on track, coordinating subject matter experts, bringing the results together for reports to the larger class.
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 7
You
• I do not claim to be good at trouble-shooting• I do not claim to know how to teach trouble-shooting• I am not the smartest or fastest guy on the block
However …
• I have ~30 years experience in this business• I have trained under gurus• I have accumulated a grab bag of tips which you may find useful• I have converted real-world events into these case studies • The result is a set of puzzle-solving labs which I predict you’ll enjoy
After all, it is more fun to trouble-shoot someone else’s issues …
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 8
Caveats
I have made a ceaseless effort not to ridicule, not to bewail, not to scorn human actions, but to understand them.
--Baruch Spinoza
Anything worth doing is worth doing badly.--Marshall Rosenberg
The first principle is that you must not fool yourself -- and you are the easiest person to fool.
--Richard Feynman
Doubt is uncomfortable; certainty is absurd.--Voltaire
The goal of education is to make up for the shortcomings in our instinctive ways of thinking about the physical and social world.
--Steven Pinker
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 9
My World View
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 10
Confidence & KnowledgeC
on
fid
ence
Knowledge
Certainty
Doubt
Little Lots
Newbie Jedi
Ignorance more frequently begets confidence than does knowledge. --Charles Darwin
As I age, I increasingly value the following from myself and my colleagues:
• I don’t know• I made a mistake• Here’s how I will clean up the mess I made
I predict that you will follow many blind avenues during RCAs … I wish you success in keeping shoshin, aka, beginner’s mind, as you wander along your path …
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 11
Music to My Ears
Science is not truth; it is, instead, a method for diminishing ignorance.--J.M. Adovasio, Olga Soffer, Jake Page
A scientific theory accurately describes a large class of observations, makes definite predictions about future observations that could be falsifiable, i.e. disproven by observation.
--Derived from Stephen Hawking
Credible explanations grow from the combined testimony of three more or less independent, mutually reinforcing sources -- explanatory theory, empirical evidence, and rejection of competing alternative explanations.
--Edward Tufte
I recommend Tufte’s day-long seminar, as an introduction to critical thinking --sk2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 12
My Biases
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 13
Quantum Mechanics
http://xkcd.com/1240/
Any structured approach for identifying the contributors to an IT service disruption
There is no such thing as a Root Cause … nevertheless, Root Cause Analysis remains a useful tool
RCA is not complete until we’ve applied the fix and verified that the problem is resolved
Business reality: competing priorities distract us from completing RCAs
Most folks use the term RCA to refer to a post-mortem process … I use the term in its ITIL sense, tightly bound to Problem Management
How Complex Systems Fail – Richard CookA Few Thoughts on Uptime – me
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 15
What is Root Cause Analysis?
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 16
Why No Root Cause?Why do I claim there is no such thing as a Root Cause? Consider the server which goes down; your monitoring system pages you; you investigate. Turns out the power supply died – you replace the power supply, the server reboots, everyone is happy again. Then, you notice that the second power supply is dead, too. Turns out your monitoring system wasn’t checking power supplies when the first one fried a few months ago. Why wasn’t your monitoring system checking power supplies? Because it can’t – and upgrading to the newer version which can costs time & money – your management looked at the costs, weighed the risks, and decided to spend your time and those dollars on upgrading the aging e-mail server, which was close to collapse. Why doesn’t your department have enough staff and money to upgrade both the e-mail server and the monitoring server? Because management has to juggle the costs of IT against the costs of core business requirements – both of which look critical from different vantage points.
So what’s the Root Cause? A failed power supply? An inadequate monitoring system? Insufficient process in your leadership’s prioritization tactics, that they let the aging e-mail system stumble along for far too long? Insufficient resources to meet both core business requirements and IT requirements? Not enough market for your product, which is why you don’t have sufficient resources to meet both sets of needs?
Still not convinced? Why have you lost two power supplies across as many months? Because your local utility is straining to meet demand in your area and frequently inflicts brownouts, which age power supplies prematurely. Why hasn’t the utility beefed up capacity in your area? Because that would cost money, and politicians are reluctant to approve the rate increases necessary to support an expansion, given current voter sentiment. Why are voters annoyed at politicians? … Reality is complex: There is no such thing as Root Cause …
Oh boy, that’s a big question. But let’s take a stab at answering it. A tech might start asking themselves, or the person reporting the problem, questions similar to the following:
• What makes you think there is an issue?• What are you expecting that you’re not getting?• Has it ever performed well?• What changed recently? Software or hardware? Load?• Can it be expressed in terms of latency or run time?• Does the problem affect other people or applications?• What is the environment? What software and hardware is used? Versions? Configuration?• …
Most issues get fixed somewhere during the process of asking these questions and uncovering the answers …
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 17
How Do Techs Fix Issues?
As the issue resists resolution, less skilled techs will start employing less effective approaches.
Street Lamp MethodThe student comes across his professor on the Arts Quad at night, down on his hands & knees, staring at the sidewalk. “What are you doing, sir?” “Looking for my car keys”. The student joins the professor but after looking unsuccessfully in widening circles, asks him “Do you recall precisely where you were when you dropped the keys?” “Yes, over there, in the middle of the quad” points the professor, toward the dimly perceived middle of the grassy acre. “Well, why are you looking here?” asks the student. “Because the light is better here” responds the professor.
More formally:1. List available tools2. Examine the output of each one, looking for clues3. Purchase more tools4. Goto #1
Use The Force, Luke“I know that we are experiencing a broadcast storm … you should check your {switch | router | firewall | server | client | application | whatever-belongs-to-some-other-group}”
I enjoyed Star Wars … but it was fiction … that distinction is hard for human brains to make. --sk
2013-11-05 18Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick
Anti-Patterns
The issue typically gets escalated to a more experienced tech. I have yet to be satisfied with an account of what an experienced human does when engaging on their field of expertise. That said, here is one way to express what might be happening.
For every Resource, check Utilization, Saturation and Errors.
Intended to be used early in a performance investigation, to identify systemic bottlenecks.
Terminology definitions:• Resource all physical server functional components (CPUs, disks, busses, …)• Utilization the average time that the resource was busy servicing work• Saturation the degree to which the resource has extra work which it can’t service, often queued• Error the count of error events
Stuart’s version:1. Scan the logs, looking for error messages Errors2. Are requests waiting in queues? Saturation3. How busy are the boxes? Utilization
I am cribbing from Brendan Gregg: http://dtrace.org/blogs/brendan/2012/02/29/the-use-method
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 19
The USE Method
Most problems get solved using any number of techniques, a few of which I sketched in the previous slides
But that’s not what I will be pushing you to do today
I will be pushing you to employ a methodology called Rapid Problem Resolution (RPR) ®
RPR is an evidence-based process … it is a heavy process … it is a sledgehammer. Sledgehammers are generally overkill …
But for a certain class of problems – the ones which have defeated experienced techs for weeks, months, or years – sledgehammers offer plenty of value
The case studies in this class belong to that class of problems
I will push you to employ RPR. You may resist. That’s OK
The official goal of this class is to introduce you to RPR
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 20
But Not Today
This workshop borrows heavily from the Rapid Problem Resolution® methodology codified by Paul Offord of Advance7, which fits into ITIL’sProblem Management schema.
I’ve slashed Advance7’s 19 step approach into 9 steps. This makes the methodology less effective but teachable in a single day. And suitable for smaller RCAs.
RPR is not a silver bullet. It is merely a tool for your tool bag, like ping, top, PerfMon …
There are no silver bullets.
Life is pain, Highness. Anyone who says differently is selling something.--The Man in Black
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 21
Rapid Problem Resolution ®
Derived from the Rapid Problem Resolution® methodology
1. Understand the Symptoms2. Pick One Symptom3. Draw the Diagram4. Design Capture Plan5. Capture Diagnostic Data6. Analyze Captured Data7. Identify Fix8. Implement Fix9. Verify Fix
RCA Methodology
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 22
Phase 1
Phase 2
Phase 3
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 23
Notes on the Nine Steps1. Humans want instant gratification: we start trouble-shooting before we understand the
problem. Resist that urge.2. Natural desire to want to fix everything fast – myself, I rarely succeed when I try. Be
particularly wary of thrashing: jumping from one symptom to another. Pick One Symptom, One Symptom only, and stick to it.
3. Common to start trouble-shooting before understanding the environment. Draw the Diagram and Sit with the User. You may discover that you didn’t understand the Symptom, in which case, start over.
4. As you learn more about the Environment and make mistakes in your capture methodology, you’ll cycle through Steps #4-6 numerous times. This is normal. As you become more experienced, you’ll spend more time on #3 and fewer time s cycling through #4-#6.
5. If the problem is intermittent, you can spend a lot of time waiting here. That is reality.6. Naturally, you need time to think about the data you capture.7. At some point, you exit the #4-#6 loop because you think you understand what is happening
and you have identified a fix.8. You apply the fix9. Key step: verify that your fix actually works. If it doesn’t, start over.
RCA Roles & Responsibilities
Who What
Facilitator
(often a Problem Manager)
Accountable for
o Owns the RCA
o Acquire resources
o Use and execute the methodology
o Communicate within the team
o Report & escalate to leadership
o Schedule meetings
Problem Analyst
(often a senior engineer)
Responsible for
o Unify & synthesize information from SMEs
o Keep team on track technically
o Breadth & depth
Subject Matter Experts
Responsible for
o Strong fundamental knowledge of area
o Facilitating access
o Capturing data
o Analyzing
SME Desirable Characteristics
Skills / Predilections
o Problem solving skills
o Inquiring mind – passion for understanding how things work
o Determination & stamina – pursuing a tough problem can be wearing
o T-shaped – broad background in IT with specialization in one or two particular areas
The Problem Solving Group (aka RCA Team) consists of the Facilitator, the Problem Analyst, and one or more Subject Matter Experts
Process-oriented person
Sees the forest, not the treesRespected / trusted by SMEs
Like getting their hands dirty
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 25
Draw the DiagramDesign Capture Plan
Fibre Channel Switch
Request
Response
Who talks to whom?Where to insert probes?Where to gather logs / debug output?
(DNS, LDAP, NIS …)
We will work through case studies – real situations drawn from my experience at FHCRC – alternating between small group and seminar style sessions.
Typically, we will oscillate in 15-30 minute increments – spending 15-30 minutes together as a class, working privately in our small groups for 15-30 minutes, coming together for 15-30 minutes …
Course materials on the USB stick include packet traces, log extracts, trending charts, ‘show’ output from clients, servers, switches/routers, storage systems, captured during the actual RCA.
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 26
How Does This Class Work?
Whirlwind tour: At the Hutch, we typically spent weeks of an RCA team’s time on these cases – in this workshop, we will just taste each experience, merely touching on key points – we will not have time to dig through any of them in detail.
Variable expertise: As a group, we differ wildly in our expertise –some of us have never seen Wireshark before, have never touched an Ethernet switch or a storage array. I will play to a range of levels: sometimes you may be bored, sometimes you may be drowning.
We will not finish: I do not expect to reach all the case studies. We may not even get through the first one – it contains a lot of material – all depends on where your curiousity leads us.
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 27
Expectations
Detours: Using your questions as cues, I will stop the flow of the course and explore related topics: how striping affects the performance of arrays, how TCP Window works, how to perform a particular function in Wireshark.
Contribute: If you have expertise to contribute, please speak up –group dialogue contributes to learning.
Methodology: I will be a stickler for the RPR Methodology and will attempt to push you into following it, following each step in order. Naturally, you may choose to resist. I’m OK with dissent and rebellion – you know yourself better than I do – if you’ll learn better doing things differently, ignore me + blaze your own trail.
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 28
More Expectations
Red Herrings: I will include data and clues which are irrelevant to solving the problem … that’s what happened to us, so I intend to share the pain.
Misinformation: When I am wearing a hat, I may give you inaccurate information, based on the limitations of the person whose role I am playing. When I am bare-headed, I am playing the role of the instructor and will try to describe reality as accurately as I know how.
Chaos: I am trying to recreate the fog of war, the confusion of a real-world situation: practicing ways to bring order from chaos is a deep lesson of this class
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 29
Great Expectations
Embarrass me: I make mistakes – find them and point them out. I’d rather feel embarrassed and learn than feel comfortable and remain ignorant.
Embarrass yourself: Take risks, ask dumb questions, reveal your ignorance. If you don’t understand my answer, ask again. This is your laboratory, a safe place for you to learn. Ex ignoratia ad sapientium, E luce ad tenebras.
Data: The USB stick contains data – packet traces, ‘show’ output, screen shots – as you work through the scenario and ask for data, I will point you to the relevant directory. If you get stuck, feel free to poke around.
Results Folders: The USB stick also contains the answers to the case studies in folders named Results. I recommend avoiding the Results folder until we’re done for the day.
Wave me down: If you are stuck and thrashing, wave me down – I’m happy to assess where you are and offer you direction to get you unstuck
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 30
Recommendations
We are about to walk through the Example Case.
Questions up to this point?
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 31
Questions?
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 32
Example Case1. Understand the Problem
2. Choose One Symptom
3. Draw the Diagram
4. Design Capture Plan
5. Capture Diagnostic Data
6. Analyze Captured Data
7. Identify Fix
8. Implement Fix
9. Verify Fix
Results
Server Disconnects Telnet Client
The End-User (Angie) keeps getting disconnected from the Server (Daffy). This has been going on for a while; Angie has a high-profile job and a high-profile boss; management has spun up a Root Cause Analysis team and assigned you and a Desktop Tech (Bob) to the team. Bob explains to you that he has been working the issue for several weeks, that a Router is causing the problem, and that he needs help finding and fixing the Router.
We start with 15 minutes together focused on Methodology Step #1: Understand the Symptoms
Walk Through an Example Case
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 33
Questions for the Desktop Tech
You: What do you know about Angie?Bob: She is a power user located in the Fairview Building, runs
Windows XP and the Attachmate Reflection terminal emulator.
You: What do you know about the Server?Bob: It is a Unix server called Daffy located in the Yale data center
and run by the Sys Admin Rick.
You: How long has the problem been occurring?Bob: Several weeks, happens multiple times per day, no pattern.
#1 Understand the Symptoms
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 34
Questions for End-User
You: When did this start?Angie: It has happened for years, but I didn’t bother to report it
because, until several weeks ago, I hardly used Daffy. Now, I spend all day in it, and the problem is really annoying.
You: What do you notice?Angie: Multiple times per day, I get disconnected and have to log
back in.
You: See any patterns?Angie: Not really. Sometimes I’m typing along and get disconnected.
Sometimes, I turn back to my machine or unhide Reflection and see that I’ve been disconnected.
#1 Understand the Symptoms
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 35
Questions for End-User
You: What do you do with this application?Angie: I enter data into the FALCON database. The forms from which
I acquire the data are irregular – requires a lot of interpretation. Sometimes, I spend time looking up related cases in other databases or calling relevant people on the phone for input. Sometimes, I just type like a mad woman. Sometimes, I run reports – it’s really annoying when a report takes half an hour to run and I get disconnected just before it finishes, because then I have to re-run the report.
#1 Understand the Symptoms
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 36
Questions for End-User
You: When you’re typing like a mad woman, how long before you get disconnected?
Angie: I figure I get 45 minutes. That’s my guess – I figure I get disconnected every 45 minutes. I might be wrong about that – I haven’t timed it or anything. But if I’ve been logged in for half an hour or so and need to run a report, I generally wait until I get disconnected, log back in, and then run the report immediately.
#1 Understand the Symptoms
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 37
Questions for the Sys Admin
You: What can you tell me about Angie’s problem?Rick: Got me. It can’t be my server: Daffy has about 40 users and
10 developers, and Angie is the only person reporting this problem. They all use the Reflection SSH client.
You: What can you tell me about Daffy?Rick: It is an HP Alpha server running OpenVMS located here in the
D5 data center. It runs the Ingres database manager. Angie uses the FALCON database: everyone uses FALCON; it’s the most popular database we offer.
#1 Understand the Symptoms
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 38
Questions for the Sys Admin
You: How often does Angie have this problem?Rick: Seems to me that Angie gets disconnected every hour or two;
I’ve checked the server configuration – I haven’t configured a timeout: everyone gets unlimited access as long as they want.
You: What do your logs say?Rick: Not much. Angie has called me plenty of times, right after
getting disconnected, but all the Alpha logs say is:“Username angie: Client disconnected”
#1 Understand the Symptoms
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 39
Split: If this were a real case, we would split into our small groups. You have 15 minutes.
Choose: Your first task in small group is to select one and only one symptom on which to focus. In this example, it’s pretty easy –there’s only one symptom. In future cases, this task will be harder – there will be many symptoms. Generally, I recommend picking either the easiest to analyze, the easiest to replicate, or the most costly to the business.
Phrase: Find a precise way to phrase the symptom. Example:Angie gets intermittently disconnected from Daffy.
#2 Pick One Symptom
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 40
This will involve asking IT staff technical questions about the environment – this is where I start swapping hats (End-User, HelpDesk, Desktop, Sys Admin, Network, Database, Security, Vendor, Manager …), depending on the group to which you address the question
Ideally, the Ops staff already have this diagram and keep it updated as they make changes … but in my experience, only the most mature shops manage this
Sometimes, we identify the cause during the process of diagramming!
There’s a lot of experience & judgment here – what to include, what not to include
Focus on the components which surround the Symptom you have picked and how they relate to one another: dependencies.
If you solve a problem without drawing a diagram, you got lucky.
#3 Draw the Diagram
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 41
Diagram for Example Case
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 42
This is done in small group; you have 15 minutes. In this step, you figure out how you’ll gather the data you identified in the previous step.
Typically, you will want to gather logs and/or metrics from applications and operating systems as well as insert sniffers
As much as possible, I will also support your performing ‘show’ commands, grepping through logs, trending parameters across time, rebooting devices …
#4 Design Capture Plan
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 43
Example Data Capture Plan
1. Plug sniffers into lf-esx and d5sr-esx and SPAN Angie’s port and Daffy’s port, filtering on Angie’s IP address
2. Enable debug tracing on Angie’s copy of Reflection, gather both syslog and Ingres logs on Daffy
3. Validate capture set-up by asking Angie to ssh into Daffy, then verifying that we can see Angie’s login in all logs and packet traces
4. Sit with Angie and watch her work for a day, precisely recording the times when she gets disconnected
5. While we’re waiting: Gather ‘show port’ output from Angie’s and Daffy’sswitch ports plus version and configuration information (idle timer setting) from Reflection
#4 Design Capture Plan
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 44
This is done as a class. The instructor executes each group’s Diagnostic Capture Plan and returns the resulting information.
Each group benefits from hearing the results of every group’s Diagnostic Capture Plan.
Typically 15 minutes.
In this example, the instructor returns:Reflection debug tracePacket CapturesLogsAngie & Daffy’ Ethernet port statisticsReflection Version & Settings (idle timer)
#5 Capture Diagnostic Data
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 45
Angie’s Ethernet Port Statslf-esx#sh ver
[…]
lf-esx uptime is 3 years, 3 weeks, 5 days, 12 hours, 44 minutes
[…]
lf-esx#sh int Fa2/19
FastEthernet2/19 is up, line protocol is up (connected)
Hardware is Fast Ethernet Port, address is 0011.21f5.46c2 (bia 0011.21f5.46c2)
MTU 1500 bytes, BW 100000 Kbit, DLY 100 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 100Mb/s, link type is auto, media type is 10/100BaseTX
input flow-control is unsupported output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:19, output never, output hang never
Last clearing of "show interface" counters never
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 4000 bits/sec, 6 packets/sec
161282073 packets input, 48475519613 bytes, 0 no buffer
Received 2004674 broadcasts (1689326 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 input packets with dribble condition detected
831253443 packets output, 116132425387 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 babbles, 0 late collision, 0 deferred
1 lost carrier, 0 no carrier
0 output buffer failures, 0 output buffers swapped out
lf-esx#
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 46
If t
he
err
or
cou
nte
rs w
ere
hig
h, p
erh
aps
we
hav
e a
bad
NIC
| c
able
| s
wit
ch p
ort
… b
ut
they
are
ze
ro o
r cl
ose
en
ou
gh.
Ru
le o
ut
ba
d p
hysi
cal l
aye
r
Packet Trace
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 47
daffy = ingress
An
gie
abru
ptl
y h
angs
up
(TC
P R
ST)
on
Daf
fy (
aka
Ingr
ess)
. Lo
oks
like
An
gie
init
iate
d t
he
dis
con
nec
t
Reflection Settings
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 48
Rea
din
g th
e m
anu
al t
ells
us
that
Co
nn
ect
ion
Se
ttin
g Ti
meo
ut
is a
n Id
le T
imer
. A
nd
th
at a
ch
oic
e o
f ‘0
’ fo
r th
is t
imer
me
ans
‘un
limit
ed’,
i.e.
nev
er d
isco
nn
ect
, no
mat
ter
ho
w lo
ng
the
use
r re
mai
ns
idle
.D
an
g, w
e re
ally
wa
nte
d t
o s
ee a
set
tin
g o
f, o
h,
60
min
ute
s h
ere
Application Version
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 49
Res
earc
h t
ells
us
tha
t th
e la
test
pa
tch
leve
l fo
r R
efle
ctio
n 1
4 is
v1
4.0
.7.
An
d t
he
late
st v
ersi
on
fo
r th
is t
rain
of
Ref
lect
ion
of
14
.1.1
88
SP
1.
An
gie
is r
un
nin
g a
n o
ld v
ersi
on
Back to small group; you have 30 minutes to analyze the data you have acquired and
In real life, you will likely cycle through Steps #4 - #6 multiple times.
Feel free to continue to #7 Identify Fix when you are ready.
Your team consults together … hmm …• The Ethernet port shows trivial errors, so that looks fine.• The packet trace shows Angie initiating the disconnect• Reflection settings show an unlimited idle timer• We’re running an old version of Reflection … probably full of bugs
#6 Analyze Captured Data
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 50
At some point, you believe you’ve identified the cause; now you can develop a fix.
#7 Identify Fix
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 51
Your team says:We know that Attachmate has shipped numerous updates to Reflection – the latest version is 14.0.7. We propose to upgrade Angie’s copy to the latest version.
We reconvene as a class. Each group proposes its fix, and the instructor reports the results of the fixes.
In this example, Bob doesn’t want to upgrade – he wants to keep all his users at the same revision.
Instead, he uninstalls and re-installs Reflection.
#8 Implement Fix
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 52
We remain regrouped as a class and review the results of the Fixes. In this case, Angie runs for a week without any disconnects.
Bob doesn’t want to invest more time into this, so we quit.
Ideally, we would re-image Angie’s machine and verify that the problem returned … as scientists, we realize that we have demonstrated correlation, but not cause and effect.
#9 Verify Fix
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 53
We declare the Problem resolved, with an undefined Root Cause –something related to Angie’s local Application configuration which gets reset when the Application was re-installed, no explanation for why this only affected Angie and not any of the other 55 users.
In a perfect world, we would re-image Angie’s machine and verify that the problem returned … in the real world, we did not implement that last step of RPR, which requires that we Verify the Fix …
As a Problem Manager, you are responsible for ensuring that management hears the risk they have adopted by skipping this step.
Results
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 54
End of Example Case
• For the rest of our day we cycle between small group and large group
• In large group, you ask questions; in small group, you analyze
• I am available for questions and coaching during both
Questions about the mechanics of what we will be doing?
Questions about the 9 step RCA process?
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 55
#1 SplitIn a moment, you will split into groups of 3-6 people
# 2 Assign RolesI recommend assigning roles & responsibilities, e.g.
Facilitator Tracks who is doing what, spokespersonProblem Analyst Big pictureSubject Matter Experts Sys admin, network, storage …
Successful teams divide & conquer the material …Ideally, one person per role …
# 3 Pick NamePick a cool name for your group; write it on one of the name plates
You have 5 minutes – go
Split into Small Groups
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 56
Remote Office Bumps (morning)Many Applications Crash (afternoon)
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 57
Case Studies
Established in the mid-1990s, the users at a clinic on First Hill have intermittently reported various issues – Outlook appointments vanishing, printing slowness (takes minutes to hours for print jobs to appear), browser-based applications malfunctioning, faxing problems, scratchy quality on voice calls , “Not responding” in the application menu bar … “the computer is slow”. Over the years, we’ve gradually upgraded their WAN connection from BRI (128Kb/s) to PRI (1.544 Mb/s) to bonded PRI (3.588Mb/s) to Metro Ethernet (10Mb/s). And we’ve gradually upgraded their workstations through versions of Windows, Office, and browsers, replacing their PCs along the way. The upgrades have helped but have not eliminated the problems.
The research project behind the clinic has landed a new grant, which will allow them to expand their status from a Clinical Research Site to a Clinical Trials Unit – this will translate into more staff, more equipment, more participants volunteering for their studies.
Management is concerned that the expansion will exacerbate the already unreliable quality of the IT services available at this location and figures that upgrading the WAN circuit to 100 Mb, while expensive, will fix the problems. But before they sign a three year contract, they want a sanity-check: Will upgrading this circuit resolve the issues?
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 58
Remote Office Bumps
Welcome to the first meeting of the Cabrini Tower PSG; today is Friday January 11th 2013.
We start the RPR Methodology working together as a class.
Step #1: Understand the SymptomsWhat questions do you want to ask of the various constituencies?
Database, Desktop, Helpdesk, Manager, Network, Server, Storage, User, Vendor …
After Understand the Symptoms, we will separate into small groups and proceed with:
Step #2: Choose One Symptom
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick
Set the Stage
59
Large group | Small group
1. Understand the Symptoms2. Choose One Symptom3. Draw the Diagram4. Design Capture Plan5. Capture Diagnostic Data6. Analyze Captured Data7. Identify Fix8. Implement Fix9. Verify Fix
RCA MethodologyDerived from Advance7’s Rapid Problem Resolution® methodology
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 60
Phase 1
Phase 2
Phase 3
What aspect of the process would you like to review?
What section of a diagram or graph would you like to explore?
What hunk of data would you like to re-examine?
Which link in the chain of reasoning doesn’t make sense to you?
Additional questions?
This is your opportunity to consolidate your learning.
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 70
Q&A
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 72
PGP
http://xkcd.com/1181/
This is the last week in November 2005. Earlier this year, we bought a mass storage device –a BlueArc Titan NAS head named Indigo sitting in front of 14 TB of Fibre Channel, SATA, and ATA attached disk trays. We have been migrating home + shared directories for two divisions (~1200 staff) from a flock of aging DAS-equipped file servers onto Indigo, along with scratch space for the MIS group.
The experience has been rocky. Starting in June, an OS memory leak caused key processes to hang and sometimes even head freezes, both requiring reboots to fix. A controller fried, requiring emergency downtime for replacement. A controller firmware bug mangled a volume, leading to data loss. We have been applying hot fixes, firmware upgrades, and OS upgrades every few weeks. Starting in August, users began reporting crashing applications –notably Outlook, although Word and Excel and other applications hang as well, intermittently – some days are fine, some days are bad. The MIS group’s Tidal jobs fail regularly.
Backups are slow and sometimes don’t complete – we aren’t meeting our 24 hour Recovery Point Objective, and we have no confidence that we can meet our 48 hour Recovery Time Objective. Sometimes even simple file copies are slow!
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 76
Many Applications Crash I
The storage team was convinced that antivirus scanning was causing the application crashes and has worked with BlueArc for months to resolve this, finally disabling AV over Thanksgiving. However, the intermittent application crashes continued this week.
The local BlueArc team visited a few days ago and identified the Catalyst 4000 Ethernet switches as the likely culprits: “The Catalyst 4003 servicing the backup systems dates to 1998; the Catalyst 4006 servicing the Titan itself dates to 2000 – they are getting overwhelmed by traffic.”
The remaining ~1500 users who have not migrated to Indigo are watching with dismay –currently, they are unaffected, scattered as they are between small NetApp NAS heads and a flock of aging file servers.
Management has made every Sunday night in December available to you for Indigodowntime – just ask.
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 77
Many Applications Crash II
Welcome to the first meeting of the BlueHeat PSG; today is Friday December 2nd.
We start the RPR Methodology working together as a class.
Step #1: Understand the SymptomsWhat questions do you want to ask of the various constituencies?
Database, Desktop, Helpdesk, Manager, Network, Server, Storage, User, Vendor …
After Understand the Symptoms, we will separate into small groups and proceed with:
Step #2: Choose One Symptom
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick
Set the Stage
78
Large group | Small group
1. Understand the Symptoms2. Choose One Symptom3. Draw the Diagram4. Design Capture Plan5. Capture Diagnostic Data6. Analyze Captured Data7. Identify Fix8. Implement Fix9. Verify Fix
RCA MethodologyDerived from the Rapid Problem Resolution® methodology
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 79
Phase 1
Phase 2
Phase 3
What aspect of the process would you like to review?
What section of a diagram or graph would you like to explore?
What hunk of data would you like to re-examine?
Which link in the chain of reasoning doesn’t make sense to you?
Additional questions?
This is your opportunity to consolidate your learning.
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 84
Q&A
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 85
The Mother of All Suspicious Files
http://xkcd.com/1247/
Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 872013-11-05
Tech Flash: Sanity-Checking Throughput Claims Sometimes, I’ll hear a tech claim that we need a fatter WAN pipe, because the file copy | backup job | database synchronization | whatever is slow:
“I’m only getting 400 MB/hour to Chicago: we need to rent a fatter network pipe.”
Well, we geeks often confuse ourselves when translating between bits per second and bytes per second … this whole performance zone is a popular place for error … and fatter WAN pipes are expensive. Let’s sanity check this claim.
Name Bit Rate Effective Data Rate*Vanilla Ethernet 10Mb/s 1MB/sFast Ethernet 100Mb/s 10MB/sGigabit Ethernet 1000Mb/s 100MB/sTen Gig Ethernet 10000Mb/s 1000MB/s (aka 1GB/s)
Assume we have a 100Mb/s pipe to Chicago: would buying a fatter one help?
*These numbers constitute an easily remembered rule-of-thumb: well-tuned clients/servers can actually deliver 10-15% better than this
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 88
Tips & Tools
Wiggly Charts are Overrated
Validate the Diagram
Rich Pingers
Draw the Pie
When to Use Full-Blown RPR
Musings on IT Architecture
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 89
Wiggly Charts are OverratedMy Solexa run failed last night shortly before midnight. You can see that Fred’s switch port was extremely busy then, far busier than usual, and you can also see the IO spike which happened at the same time.Therefore, Fred needs a 10GigENIC and faster disks.
Your task:• Think of ways to support this• Think of ways to refute it
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 90
Validate the DiagramSend a sample transaction from one end of the infrastructure to another, capturing along the way. If your diagram is accurate, you’ll see the transaction at every single capture point
If you don’t see that transaction … then you know your diagram is inaccurate: return to Draw the Diagram
Once you’ve validated the diagram, you are positioned to capture the pathology you’re investigating
Sample transaction: Write a test file, update a database record, send a Rich Ping … in each case, include easy-to-spot ASCII
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 91
TextPinghttp://www.packetiq.com/Tools/PacketIQ-TextPing.aspx
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 92
Command LineSend a TCP port 2049 frame to server.company.comhost> echo “Starting NFS Mount now –marker” | nc -4 –w 1 server.company.com 2049
C:\Temp> echo “Starting NFS Mount now –marker” | ncat -4 –w 1 server.company.com 2049
For Windows, install the open source ncat utility http://www.insecure.org, part of the Nmap distribution
Send a UDP port 666 frame to server.company.comhost> echo “Starting app now –marker” | nc -4 –w 1 -u server.company.com 666
C:\Temp> echo “Starting app now –marker” | ncat -4 –w 1 -u server.company.com 666
Create a file, the name of the file will appear in Wireshark’s Summary screenhost> touch /mnt/whatever/slowness-starting-now--marker.txt
C:\Temp> copy /y nul z:slowness-starting-now--marker.txt
Drop the message into /var/log/syslog on loghosthost> logger –l loghost.company.com slowness starting now –marker
C:\Temp> logger –l loghost.company.com slowness starting now --marker
For Windows, install the freeware logger utility http://www.monitorware.com/logger
Drop the message into the Web server’s logs:host> wget http://www.company.com/slowness-starting-now--marker.html
C:\Temp> wget http://www.company.com/slowness-starting-now--marker.html
For Windows, install the open source GNU wget utility
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 93
PingIn a pinch, you can use ping, manually maintaining a written table associating ping packet length to message:
host> ping –n 1 –l 101 server.company.com
host> ping –n 1 –l 102 server.company.com
host> ping –n 1 –l 102 server.company.com
Ping Packet Length Event101 bytes Mounting file system102 bytes Starting application103 bytes Slowness beginning now
Or, depending on your filters, ping a fake host … the ping won’t show up in the trace, but the failed DNS query will:host > ping www.slowness-starting-now--marker.com
C:\Temp> ping www.slowness-starting-now--marker.com
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 94
Send-UDP-Msghttp://www.skendric.com/app
vishnu> ./send-udp-msg -m "This is a test ping" rhino1 rhino2 rhino3
vishnu>
Or, if you want to show off, write your own … here’s mine
2013-11-05 95
Many problems are intermittent – you set your debugs and packet captures going and then wait hours/days/weeks for the issue to reoccur. How might one capture across such long time frames?
Ring BufferMost capture utilities will produce a ring buffer of files. In this example, dumpcap writes those bytes to a file named in the following way:
server-side_00001_20130325120842.pcapwhere the first field is a serial number and the second field encodes the date/time of start of this capture. After it has captured 50,000 bytes, it gets started on the next file:
server-side_00002_20130325120958.pcapwhere the question marks are replaced with the start time of this capture. Dumpcap will repeat 10,000 times, whereupon it will start deleting the first files, in order to limit the number of files to 10,000.
Windowsdumpcap –i 1 –b files:10000 –b filesize:50000 –f “ip host 10.1.2.3 and not ip host 192.168.20.30” –w
c:\temp\cabrini\server-side.pcap
Linux/usr/sbin/dumpcap –i eth1 –b files:10000 –b filesize:50000 –f “ip host 10.1.2.3 and not ip host 192.168.20.30” –w
/home/skendric/cabrini/server-side.pcap &
Long-Term Captures - CLI
2013-11-05 96
Or perhaps you prefer the GUI
Long-Term Captures - GUI
Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick
2013-11-05 Root Cause Analysis | Sharkfest 2013 | Stuart Kendrick 97
Extract PacketsOK, so now you have 10,000 files, and you want to look at an incident which occurred between noon and 1:00pm on 2013-03-25
The names of the files allow you to focus on the window in question, so you copy those to a working directory. But that can still be a lot of files, in a busy environment. Perhaps you realize that you only care about DNS frames. I write littlescripts to extract the interesting packets and merge them into a single file.
Windowsecho off
setlocal ENABLEDELAYEDEXPANSION
mkdir c:\temp\cabrini\extract
cd \temp\cabrini
FOR /F %%a IN ('dir /b *.pcap') DO (
echo Processing %%a
tshark -r %%a -R “udp.port==53 or tcp.port==53" -w extract\%%a-filtered.pcap
)
cd \temp\cabrini\extract
FOR /F %%a IN ('dir /b *.pcap') DO (
set tmp=%filelist%
set "filelist=!filelist! %%a"
)
mergecap -w c:\temp\cabrini\server-side-extract.pcap %filelist%
Linuxhttp://www.skendric.com/problem/rca/extract-frames
User, application, memory, CPU, disk …Client
5s
Network120 seconds
Server15s
Contribution to the Problem
Switches, Routers, Firewalls, VPN Tunnels …
Client / Network / Server Pie
Application, OS, memory, CPU, storage …
I find drawing the CNS Pie useful when analyzing performance issues:How much does the Client contribute to the Problem? The Network? The Server?http://www.skendric.com/app
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 98
Rule-of-thumb
Application architecture: 1,000x• SQL, query optimization, caching, system calls
Server & Storage Configuration: 100x• Disk striping, spindle tiering, paging, NFS tuning
Application fine-tuning: 2-10x• Threads, asynchronous I/O
Kernel tuning: less than 2x- Caveats:
• If kernel bottleneck is present, then 10-100x• Kernel can be a binary performance gate
Version 3.10Copyright 1994-2007 Hal Stern, Marc Staveley System & Network Performance Tuning LISA 2007
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 99
Tuning Potential
Advance7 is a consulting outfit which helps customers resolve critical Problems – they put an analyst at your site to coordinate your staff plus vendors to fix the issue, using the RPR methodology.
They designed RPR to work against Grey Problems.
Most Problems are not Grey … unless the Problem is Grey, RPR is overkill.
So what are Grey Problems?
The following sides are cribbed from Advance7 materials -- full credit to Paul Offord & colleagues.
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 100
Rapid Problem Resolution ®
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 101
The Grey Problem
• Intermittent Application Error• Poor Application Logic• Transient Overload• Intermittent Infrastructure Error• Incorrect Failover Operation
Single Incident Recurring Problem
Tech
no
logy
Kn
ow
nTe
chn
olo
gy U
nkn
ow
n
• Intermittent Hardware Failure• Known Error• Intermittent Software Failure
• Change-related Cause• Hardware Failure• Software Failure• Misconfiguration• Operations Error
• User Error• Operations Error• Rare Software or Hardware Error
The majority of issues that are passed to 2nd and 3rd line technical support teams are investigated in a straightforward manner. The nature of the issue or an indication from a monitoring system identifies the failing component and the issue is allocated tothe correct technical support team. Q1: the bulk of support work falls into this area. Q2 is harder but tends to be resolved by experienced support staff. Q3 is tough; we tend not to solve these.
An intermittent response-time or error issue is not so easily handled due to its transient nature. Not only does the cause sneak under the radar of monitoring systems, but investigation often starts after the issue has passed, making it impossible to usemany of the tools available. The result is a recurring problem where the causing technology is unknown: Q4, aka the Grey Problem. The Rapid Problem Resolution methodology targets Q4.
Q1 Q2
Q3 Q4
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 102
Grey Problem CharacteristicsBecause the causing technology is unknown, a grey problem will bounce between Technical Support Teams as each in turn produces evidence (often in the form of a health check) to prove that their technology is not to blame.
Typical characteristics of a grey problem• An ever-growing number of people become involved• Long meetings to discuss what might be the cause• Support people shy away from becoming involved• Repeated changes with no clear reason or objective
Consequences of grey problems• An ever growing backlog of problems• A fog that hinders the investigation of other, more urgent problems• A growing pool of problems that escalate into Major Incidents as patterns of use and business priorities change• Wasted IT budget as money is spent on poorly targeted upgrades• Barriers to integration due to concerns about the stability of component systems• Loss of confidence and satisfaction with the IT department• Pressure to outsource IT services• Reduced customer satisfaction• Higher costs as the business adjusts to accommodate the problem• Higher IT staffing costs
Service Desk
Inci
den
t M
anag
emen
t
Pro
ble
m M
anag
emen
t
Server Desktop NetworkVendor ApplicationVendor Vendor Vendor
?
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 103
The RPR Methdology
1.2 Choose One SymptomThe single largest reason I’ve thrashed in my RCA career.
1.4 Draw the Diagram & Sit with the UserIf I can’t draw it, I don’t understand it & Seeing leads to Understanding
2.2 Definitive Diagnostic DataInsert capture gear at critical points along the path, synchronize time using a distinctive transaction, capture data simultaneously from all points while replicating the pathology.Hardest to implement; most likely to make you successful.
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 104
Key Elements of RPR
Full methodologyhttp://www.skendric.com/problem/rca/RPR-RCA-Methodology.pdf
Checklisthttp://www.advance7.com/misc/rpm_wb.html
Manualhttp://www.advance7.com/information/publications
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 105
RPR References
• Business stake holders want features• We IT geeks love to turn every fancy knob
The result is complexityComplexity is the enemy of uptime … and the raison d’être for RCA
Insights from our gurusIncreasingly, people seem to misinterpret complexity as sophistication, which is baffling - the incomprehensible should cause suspicion rather than admiration. --Niklaus Wirth
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. --Brian Kernighan
KISS
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 106
Musings on Architecturehttp://www.skendric.com/philosophy/uptime
Folks sometimes ask me how I learn this stuffKey contributors to my path: Mentorship, Failure, Independent Study, Training
MentorshipI have had the good fortune to work for highly skilled people who have mentored me.• During 1991-1993, I joined my boss on Saturdays … worked as his gofer boy … he used the opportunity to train me• In the mid-2000s, we hired Mike Pennacchi to coach us. Mike came on-site once/month for a half-day; we brought
whatever problem was troubling us to the session; Mike would not solve it for us … rather, he would coach us through solving it. We did this for ~three years before budget contraction interfered
FailureI have had the good fortune to work for bosses who believe that we learn through mistakes … “Fail early and often” … I’ve learned a lot this way
Independent StudyI set aside a slot every week (mostly!) to practice what I’ve learned, push myself to learn something new … these days, Sunday mornings
TrainingI have had the good fortune to work for bosses with training budgets … I typically spend a couple weeks per year in classes … I occasionally augment this using my own shekels
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 107
Musings on Skill
Humans (all living creatures) are wired for fast-twitch: our nervous systems respond rapidly to the unusual, not to the mundane. So a bomb explodes, kills three people: that fires our adrenalin … but the annual toll of smoking (~500,000 per year in the US alone) drifts past our consciousness without a quiver.
Of course we’re wired this way … that’s how we stayed alive on the African savannah: by paying attention to the howl of the hyena, rather than to the gradual constriction of our arteries
But the result is that we have trouble paying attention to slow-twitch threat … to saving for retirement or a rainy day, daily exercise, spending time with our family … investing in the power grid, roads, bridges, anything which seems like a long way off …
Ditto with IT – we focus on the glitzy new projects, ignore the underpinnings … until the infrastructure breaks catastrophically … that drama fires our nervous systems, then we pay attention (for a while)
I don’t have a solution for this design flaw (trade-off) in our brains … but it does keep me employed, as a Problem Manager and a Problem Analyst If we maintained our infrastructure (shrank technical debt), many of our RCAs would not occur
Musings on Nervous Systems
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 108
It has been said that man is a rational animalAll my life I have been searching for evidence which could support this
--Bertrand Russell
Your brain will be predisposed to certain answers and will cling to them, blinding you to reality
Definitive Data Capture is RPR’s effort to counteract this tendency
I wish you success in scrabbling for rationality
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 109
This is Hard
Insight
2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 110
http://xkcd.com/1215/
On-Line ResourcesRapid Problem Resolution by Paul OffordLinkedIn Protocol Analysis & Troubleshooting GroupOld Comm Guy http://www.lovemytool.com
Trouble-shooting & Training Outfits Based Here (will travel for $$)James Baxter http://www.packetiq.com Daytona Beach, FLTony Fortunato http://www.thetechfirm.com Toronto, CanadaChris Greer http://www.packetpioneer.com Central AmericaPaul Offord http://www.advance7.com London (international)Mike Pennacchi http://www.nps-llc.com Seattle, WARay Tompkins http://www.gearbit.com Austin, TX…
ConferencesSharkfest http://www.sharkfest.org Berkeley, CA
Follow-up stuart.kendrick.sea {at} gee mail dot com2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 114
Thank you