Root Cause Analysis - skendric.com · 2013-11-05 Root Cause Analysis Intermediate ... Corvus...
Transcript of Root Cause Analysis - skendric.com · 2013-11-05 Root Cause Analysis Intermediate ... Corvus...
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 1
Root Cause Analysis – IntermediateA Hands On Tutorial
Your Pre-Flight Check List
1. Write your first name on the card stock, display prominently
2. Locate the course files on your USB stick
3. Grab the latest version of the slide deck, dated 2013-11-05
http://www.skendric.com/seminar/rca/Root-Cause-Analysis-Advanced-Deck.pdf
4. Configure Wireshark columns (see p.5 of this presentation)
5. Introduce yourself to your neighbors (teammates): figure out who will play which roles
6. Read printed materials at your table, examine the diagrams on the walls
Copyright Stuart Kendrick ©2013 All Rights Reserved
IntroductionExample CaseSplit into Small GroupsCase Studies
HPC Cluster WoesStorage Stumbles
Tips & ToolsWrap-up
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 2
Workshop Outline
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 3
IntroductionMechanics
Me and My Biases
What is Root Cause Analysis?
How Does This Class Work?
Recommendations
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 4
Mechanics
We use Google Docs … you don’t need an account: I will provide links
9:00 – 10:30 Class Ask questions whenever you want
10:30 – 11:00 Break
11:00 – 12:30 Class
12:30 – 13:30 Lunch Your Laptop
13:30 – 15:00 Class • has Internet connectivity
15:00 – 15:30 Break • can display & search PDF, PNG, TXT, XLS
15:30 – 16:30 Class • has grep or similar
16:30 – 17:00 Wrap-up • Wireshark configured per next slide
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 5
Configure Wireshark Columns
• Use a recent version of Wireshark … 1.10.0 at a minimum – I recommend the latest and greatest • If you are an experienced Wireshark user, feel free to ignore this and use your favorite column choices• If you are really experienced and prefer a different analyzer, feel free to use it
You
rea
lly w
ant
Del
ta t
ime
dis
pla
yed
An
d C
ust
om
(tc
p.s
trea
m)
will
be
hel
pfu
l
Multi-disciplinary IT trouble-shooter / Root Cause Analysishttp://www.skendric.com
sbk@cornella student 1981stuart@cpvax5 (Science Applications Inc) programmer [email protected] desktop / server [email protected] server / network [email protected] multidisciplinary 1993stuart.kendrick {at} isi lon dot com sustaining engineer 2014
IT Architect | ITIL Problem Manager | Problem Analyst | Device Monitoring | Transport
Geeky HighlightsPL/1 on IBM mainframes Cornell University Ithaca 1981FORTRAN on CRAY-1 SAIC San Diego 1984Terak, DisplayWriter, IBM PC, Macintosh Cornell University Ithaca 1985Netware, Corvus Omninet, TCP-IP / IPX / AppleTalk Cornell University Ithaca 1988AppleShare, QuickMail, Farallon, NRC, Cisco, Sniffers Cornell Medical College Manhattan 1991Solaris, Windows, Linux, Perl, SNMP, Wireshark, Cisco ,Fluke FHCRC Seattle 1993OneFS EMC Isilon Seattle 2013
Me
2014-04-12 Myth-Busting | xxx 2014 | Stuart Kendrick / Chris Shaiman 6
You are a senior engineer with a decade or more experience in the industryPerhaps you function as a sys admin, network engineer, database admin, or developerPerhaps you work for a large outfit and function as an ITIL Problem AnalystPerhaps you work for a small outfit and are a jack-of-all-trades
In any case, you are T-shaped: you have a strong fundamental knowledge in one or two areas and have expertise (possibly rusting!) across a range of technologies
Problem solving skills You enjoy difficultyInquiring mind Passion for understanding how things workDetermination & stamina Pursuing a tough problem can be wearingT-shaped Broad background in IT with specialization in one or two areas
You are here because you want to practice skills in small group, rather than listen to a lecture
Or … perhaps you are a people or process person – resource manager, project manager, ITILProblem Manager. You don’t have the skills to analyze bits & bytes, but you want to practice a problem solving methodology. You’ll help keep your team on track, coordinating subject matter experts, bringing the results together for reports to the larger class.
Or … perhaps you are a junior engineer, jumping into the lake with bigger kids, knowing you’ll be out of your depth, hoping to learn from the experience nevertheless.I’m OK with this … but realize that you’ll be inhaling water … wave me down as needed
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 7
You
• I do not claim to be good at trouble-shooting• I do not claim to know how to teach trouble-shooting• I am not the smartest or fastest guy on the block
However …
• I have ~30 years experience in this business• I have trained under gurus• I have accumulated a grab bag of tips which you may find useful• I have converted real-world events into these case studies • The result is a set of puzzle-solving labs which I predict you’ll enjoy
After all, it is more fun to trouble-shoot someone else’s issues …
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 8
Caveats
I have made a ceaseless effort not to ridicule, not to bewail, not to scorn human actions, but to understand them.
--Baruch Spinoza
Anything worth doing is worth doing badly.--Marshall Rosenberg
The first principle is that you must not fool yourself -- and you are the easiest person to fool.
--Richard Feynman
Doubt is uncomfortable; certainty is absurd.--Voltaire
The goal of education is to make up for the shortcomings in our instinctive ways of thinking about the physical and social world.
--Steven Pinker
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 9
My World View
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 10
Confidence & KnowledgeC
on
fid
ence
Knowledge
Certainty
Doubt
Little Lots
Newbie Jedi
Ignorance more frequently begets confidence than does knowledge. --Charles Darwin
As I age, I increasingly value the following from myself and my colleagues:
• I don’t know• I made a mistake• Here’s how I will clean up the mess I made
I predict that you will follow many blind avenues during RCAs … I wish you success in keeping shoshin, aka, beginner’s mind, as you wander along your path …
2013-11-05 Root Cause Analysis Intermediate| LISA 2013 | Stuart Kendrick 11
Music to My Ears
Science is not truth; it is, instead, a method for diminishing ignorance.--J.M. Adovasio, Olga Soffer, Jake Page
A scientific theory accurately describes a large class of observations, makes definite predictions about future observations that could be falsifiable, i.e. disproven by observation.
--Derived from Stephen Hawking
Credible explanations grow from the combined testimony of three more or less independent, mutually reinforcing sources -- explanatory theory, empirical evidence, and rejection of competing alternative explanations.
--Edward Tufte
I recommend Tufte’s day-long seminar, as an introduction to critical thinking --sk2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 12
My Biases
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 13
Quantum Mechanics
http://xkcd.com/1240/
Any structured approach for identifying the contributors to an IT service disruption
There is no such thing as a Root Cause … nevertheless, Root Cause Analysis remains a useful tool
RCA is not complete until we’ve applied the fix and verified that the problem is resolved
Business reality: competing priorities distract us from completing RCAs
Most folks use the term RCA to refer to a post-mortem process … I use the term in its ITIL sense, tightly bound to Problem Management
How Complex Systems Fail – Richard CookA Few Thoughts on Uptime – me
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 15
What is Root Cause Analysis?
Why do I claim there is no such thing as a Root Cause? Consider the server which goes down; your monitoring system pages you; you investigate. Turns out the power supply died – you replace the power supply, the server reboots, everyone is happy again. Then, you notice that the second power supply is dead, too. Turns out your monitoring system wasn’t checking power supplies when the first one fried a few months ago. Why wasn’t your monitoring system checking power supplies? Because it can’t – and upgrading to the newer version which can costs time & money – your management looked at the costs, weighed the risks, and decided to spend your time and those dollars on upgrading the aging e-mail server, which was close to collapse. Why doesn’t your department have enough staff and money to upgrade both the e-mail server and the monitoring server? Because management has to juggle the costs of IT against the costs of core business requirements – both of which look critical from different vantage points.
So what’s the Root Cause? A failed power supply? An inadequate monitoring system? Insufficient process in your leadership’s prioritization tactics, that they let the aging e-mail system stumble along for far too long? Insufficient resources to meet both core business requirements and IT requirements? Not enough market for your product, which is why you don’t have sufficient resources to meet both sets of needs?
Still not convinced? Why have you lost two power supplies across as many months? Because your local utility is straining to meet demand in your area and frequently inflicts brownouts, which age power supplies prematurely. Why hasn’t the utility beefed up capacity in your area? Because that would cost money, and politicians are reluctant to approve the rate increases necessary to support an expansion, given current voter sentiment. Why are voters annoyed at politicians? … Reality is complex: There is no such thing as Root Cause …
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 16
Why No Root Cause?
Oh boy, that’s a big question. But let’s take a stab at answering it. A tech might start asking themselves, or the person reporting the problem, questions similar to the following:
• What makes you think there is an issue?• What are you expecting that you’re not getting?• Has it ever performed well?• What changed recently? Software or hardware? Load?• Can it be expressed in terms of latency or run time?• Does the problem affect other people or applications?• What is the environment? What software and hardware is used? Versions? Configuration?• …
Most issues get fixed somewhere during the process of asking these questions and uncovering the answers …
2013-11-05 Root Cause Analysis Intermediate| LISA 2013 | Stuart Kendrick 17
How Do Techs Fix Issues?
As the issue resists resolution, less skilled techs will start employing less effective approaches.
Street Lamp MethodThe student comes across his professor on the Arts Quad at night, down on his hands & knees, staring at the sidewalk. “What are you doing, sir?” “Looking for my car keys”. The student joins the professor but after looking unsuccessfully in widening circles, asks him “Do you recall precisely where you were when you dropped the keys?” “Yes, over there, in the middle of the quad” points the professor, toward the dimly perceived middle of the grassy acre. “Well, why are you looking here?” asks the student. “Because the light is better here” responds the professor.
More formally:1. List available tools2. Examine the output of each one, looking for clues3. Purchase more tools4. Goto #1
Use The Force, Luke“I know that we are experiencing a broadcast storm … you should check your {switch | router | firewall | server | client | application | whatever-belongs-to-some-other-group}”
I enjoyed Star Wars … but it was fiction … that distinction is hard for human brains to make. --sk
2013-11-05 18Root Cause Analysis Intermediate| LISA 2013 | Stuart Kendrick
Anti-Patterns
The issue typically gets escalated to a more experienced tech. I have yet to be satisfied with an account of what an experienced human does when engaging on their field of expertise. That said, here is one way to express what might be happening.
For every Resource, check Utilization, Saturation and Errors.
Intended to be used early in a performance investigation, to identify systemic bottlenecks.
Terminology definitions:• Resource all physical server functional components (CPUs, disks, busses, …)• Utilization the average time that the resource was busy servicing work• Saturation the degree to which the resource has extra work which it can’t service, often queued• Error the count of error events
Stuart’s version:1. Scan the logs, looking for error messages Errors2. Are requests waiting in queues? Saturation3. How busy are the boxes? Utilization
I am cribbing from Brendan Gregg: http://dtrace.org/blogs/brendan/2012/02/29/the-use-method
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 19
The USE Method
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 20
But Not TodayMost problems get solved using any number of techniques, a few of which I sketched in the previous slides
But that’s not what I will be pushing you to do today
I will be pushing you to employ a methodology called Rapid Problem Resolution (RPR) ®
RPR is an evidence-based process … it is a heavy process … it is a sledgehammer. Sledgehammers are generally overkill …
But for a certain class of problems – the ones which have defeated experienced techs for weeks, months, or years – sledgehammers offer plenty of value
The case studies in this class belong to that class of problems
I will push you to employ RPR. You may resist. That’s OK
The official goal of this class is to introduce you to RPR
This workshop borrows heavily from the Rapid Problem Resolution® methodology codified by Paul Offord of Advance7, which fits into ITIL’sProblem Management schema.
I’ve slashed Advance7’s 19 step approach into 9 steps. This makes the methodology less effective but teachable in a single day. And suitable for smaller RCAs.
RPR is not a silver bullet. It is merely a tool for your tool bag, like ping, top, PerfMon …
There are no silver bullets.
Life is pain, Highness. Anyone who says differently is selling something.--The Man in Black
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 21
Rapid Problem Resolution ®
Derived from the Rapid Problem Resolution® methodology
1. Understand the Symptoms2. Choose One Symptom3. Draw the Diagram4. Design Capture Plan5. Capture Diagnostic Data6. Analyze Captured Data7. Identify Fix8. Implement Fix9. Verify Fix
RCA Methodology
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 22
Phase 1
Phase 2
Phase 3
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 23
Notes on the Nine Steps1. Humans want instant gratification: we start trouble-shooting before we understand the
problem. Resist that urge.2. Natural desire to want to fix everything fast – myself, I rarely succeed when I try. Be
particularly wary of thrashing: jumping from one symptom to another. Pick One Symptom, One Symptom only, and stick to it.
3. Common to start trouble-shooting before understanding the environment. Draw the Diagram and Sit with the User. You may discover that you didn’t understand the Symptom, in which case, start over.
4. As you learn more about the Environment and make mistakes in your capture methodology, you’ll cycle through Steps #4-6 numerous times. This is normal. As you become more experienced, you’ll spend more time on #3 and fewer time s cycling through #4-#6.
5. If the problem is intermittent, you can spend a lot of time waiting here. That is reality.6. Naturally, you need time to think about the data you capture.7. At some point, you exit the #4-#6 loop because you think you understand what is happening
and you have identified a fix.8. You apply the fix9. Key step: verify that your fix actually works. If it doesn’t, start over.
RCA Roles & Responsibilities
Who What
Facilitator
(often a Problem Manager)
Accountable for
o Owns the RCA
o Acquire resources
o Use and execute the methodology
o Communicate within the team
o Report & escalate to leadership
o Schedule meetings
Problem Analyst
(often a senior engineer)
Responsible for
o Unify & synthesize information from SMEs
o Keep team on track technically
o Breadth & depth
Subject Matter Experts
Responsible for
o Strong fundamental knowledge of area
o Facilitating access
o Capturing data
o Analyzing
SME Desirable Characteristics
Skills / Predilections
o Problem solving skills
o Inquiring mind – passion for understanding how things work
o Determination & stamina – pursuing a tough problem can be wearing
o T-shaped – broad background in IT with specialization in one or two particular areas
The Problem Solving Group (aka RCA Team) consists of the Facilitator, the Problem Analyst, and one or more Subject Matter Experts
Process-oriented person
Sees the forest, not the treesRespected / trusted by SMEs
Like getting their hands dirty
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 25
Draw the DiagramDesign Capture Plan
Fibre Channel Switch
Request
Response
Who talks to whom?Where to insert probes?Where to gather logs / debug output?
(DNS, LDAP, NIS …)
We will work through case studies – real situations drawn from my experience at FHCRC – alternating between small group and seminar style sessions.
Typically, we will oscillate in 15-30 minute increments – spending 15-30 minutes together as a class, working privately in our small groups for 15-30 minutes, coming together for 15-30 minutes …
Course materials on the USB stick include packet traces, log extracts, trending charts, ‘show’ output from clients, servers, switches/routers, storage systems, captured during the actual RCA.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 26
How Does This Class Work?
Whirlwind tour: At the Hutch, we typically spent weeks of an RCA team’s time on these cases – in this workshop, we will just taste each experience, merely touching on key points – we will not have time to dig through any of them in detail.
Variable expertise: As a group, we differ wildly in our expertise –some of us have never seen Wireshark before, have never touched an Ethernet switch or a storage array. I will play to a range of levels: sometimes you may be bored, sometimes you may be drowning.
We will not finish: I do not expect to reach all the case studies. We may not even get through the first one – it contains a lot of material – all depends on where your curiousity leads us.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 27
Expectations
Detours: Using your questions as cues, I will stop the flow of the course and explore related topics: how striping affects the performance of arrays, how TCP Window works, how to perform a particular function in Wireshark.
Contribute: If you have expertise to contribute, please speak up –group dialogue contributes to learning.
Methodology: I will be a stickler for the RPR Methodology and will attempt to push you into following it, following each step in order. Naturally, you may choose to resist. I’m OK with dissent and rebellion – you know yourself better than I do – if you’ll learn better doing things differently, ignore me + blaze your own trail.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 28
More Expectations
Red Herrings: I will include data and clues which are irrelevant to solving the problem … that’s what happened to us, so I intend to share the pain.
Misinformation: When I am wearing a hat, I may give you inaccurate information, based on the limitations of the person whose role I am playing. When I am bare-headed, I am playing the role of the instructor and will try to describe reality as accurately as I know how.
Chaos: I am trying to recreate the fog of war, the confusion of a real-world situation: practicing ways to bring order from chaos is a deep lesson of this class
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 29
Great Expectations
Embarrass me: I make mistakes – find them and point them out. I’d rather feel embarrassed and learn than feel comfortable and remain ignorant.
Embarrass yourself: Take risks, ask dumb questions, reveal your ignorance. If you don’t understand my answer, ask again. This is your laboratory, a safe place for you to learn. Ex ignoratia ad sapientium, E luce ad tenebras.
Data: The USB stick contains data – packet traces, ‘show’ output, screen shots – as you work through the scenario and ask for data, I will point you to the relevant directory. If you get stuck, feel free to poke around.
Results Folders: The USB stick also contains the answers to the case studies in folders named Results. I recommend avoiding the Results folder until we’re done for the day.
Wave me down: If you are stuck and thrashing, wave me down – I’m happy to assess where you are and offer you direction to get you unstuck
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 30
Recommendations
We are about to walk through the Example Case.
Questions up to this point?
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 31
Questions?
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 32
Example Case1. Understand the Problem
2. Choose One Symptom
3. Draw the Diagram
4. Design Capture Plan
5. Capture Diagnostic Data
6. Analyze Captured Data
7. Identify Fix
8. Implement Fix
9. Verify Fix
Results
Server Disconnects Telnet Client
The End-User (Angie) keeps getting disconnected from the Server (Ingres). This has been going on for a while; Angie has a high-profile job and a high-profile boss; management has spun up a Root Cause Analysis team and assigned you and a Desktop Tech (Bob) to the team. Bob explains to you that he has been working the issue for several weeks, that a Router is causing the problem, and that he needs help finding and fixing the Router.
We start with 15 minutes together focused on Methodology Step #1: Understand the Symptoms
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 33
Walk Through an Example Case
Questions for the Desktop Tech
You: What do you know about Angie?Bob: She is a power user located in the Fairview Building, runs
Windows XP and the Attachmate Reflection terminal emulator.
You: What do you know about the Server?Bob: It is a Unix server called Ingres located in the Yale data center
and run by the Sys Admin Rick.
You: How long has the problem been occurring?Bob: Several weeks, happens multiple times per day, no pattern.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 34
#1 Understand the Symptoms
Questions for End-User
You: When did this start?Angie: It has happened for years, but I didn’t bother to report it
because, until several weeks ago, I hardly used Ingres. Now, I spend all day in it, and the problem is really annoying.
You: What do you notice?Angie: Multiple times per day, I get disconnected and have to log
back in.
You: See any patterns?Angie: Not really. Sometimes I’m typing along and get disconnected.
Sometimes, I turn back to my machine or unhide Reflection and see that I’ve been disconnected.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 35
#1 Understand the Symptoms
Questions for End-User
You: What do you do with this application?Angie: I enter data into the FALCON database. The forms from which
I acquire the data are irregular – requires a lot of interpretation. Sometimes, I spend time looking up related cases in other databases or calling relevant people on the phone for input. Sometimes, I just type like a mad woman. Sometimes, I run reports – it’s really annoying when a report takes half an hour to run and I get disconnected just before it finishes, because then I have to re-run the report.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 36
#1 Understand the Symptoms
Questions for End-User
You: When you’re typing like a mad woman, how long before you get disconnected?
Angie: I figure I get 45 minutes. That’s my guess – I figure I get disconnected every 45 minutes. I might be wrong about that – I haven’t timed it or anything. But if I’ve been logged in for half an hour or so and need to run a report, I generally wait until I get disconnected, log back in, and then run the report immediately.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 37
#1 Understand the Symptoms
Questions for the Sys Admin
You: What can you tell me about Angie’s problem?Rick: Got me. It can’t be my server: Ingres has about 40 users and
10 developers, and Angie is the only person reporting this problem. They all use the Reflection SSH client.
You: What can you tell me about Ingres?Rick: It is an HP Alpha server running OpenVMS located here in the
D5 data center. It runs the Ingres database manager (can you tell by its name?) Angie uses the FALCON database: everyone uses FALCON; it’s the most popular database we offer.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 38
#1 Understand the Symptoms
Questions for the Sys Admin
You: How often does Angie have this problem?Rick: Seems to me that Angie gets disconnected every hour or two;
I’ve checked the server configuration – I haven’t configured a timeout: everyone gets unlimited access as long as they want.
You: What do your logs say?Rick: Not much. Angie has called me plenty of times, right after
getting disconnected, but all the Alpha logs say is:“Username angie: Client disconnected”
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 39
#1 Understand the Symptoms
Split: If this were a real case, we would split into our small groups. You have 15 minutes.
Choose: Your first task in small group is to select one and only one symptom on which to focus. In this example, it’s pretty easy –there’s only one symptom. In future cases, this task will be harder – there will be many symptoms. Generally, I recommend picking either the easiest to analyze, the easiest to replicate, or the most costly to the business.
Phrase: Find a precise way to phrase the symptom. Example:Angie gets intermittently disconnected from Ingres.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 40
#2 Choose One Symptom
This will involve asking IT staff technical questions about the environment – this is where I start swapping hats (End-User, HelpDesk, Desktop, Sys Admin, Network, Database, Security, Vendor, Manager …), depending on the group to which you address the question
Ideally, the Ops staff already have this diagram and keep it updated as they make changes … but in my experience, only the most mature shops manage this
Sometimes, we identify the cause during the process of diagramming!
There’s a lot of experience & judgment here – what to include, what not to include
Focus on the components which surround the Symptom you have picked and how they relate to one another: dependencies.
If you solve a problem without drawing a diagram, you got lucky.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 41
#3 Draw the Diagram
2013-11-05 Root Cause Analysis Intermediate| LISA 2013 | Stuart Kendrick 42
Diagram for Example Case
This is done in small group; you have 15 minutes. In this step, you figure out how you’ll gather the data you identified in the previous step.
Typically, you will want to gather logs and/or metrics from applications and operating systems as well as insert sniffers
As much as possible, I will also support your performing ‘show’ commands, grepping through logs, trending parameters across time, rebooting devices …
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 43
#4 Design Capture Plan
Example Data Capture Plan
1. Plug sniffers into lf-esx and d5sr-esx and SPAN Angie’s port and Daffy’ port, filtering on Angie’s IP address
2. Enable debug tracing on Angie’s copy of Reflection, gather both syslog and Ingres logs on Daffy
3. Validate capture set-up by asking Angie to ssh into Daffy, then verifying that we can see Angie’s login in all logs and packet traces
4. Sit with Angie and watch her work for a day, precisely recording the times when she gets disconnected
5. While we’re waiting: Gather ‘show port’ output from Angie’s and Daffy’sswitch ports plus version and configuration information (idle timer setting) from Reflection
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 44
#4 Design Capture Plan
This is done as a class. The instructor executes each group’s Diagnostic Capture Plan and returns the resulting information.
Each group benefits from hearing the results of every group’s Diagnostic Capture Plan.
Typically 15 minutes.
In this example, the instructor returns:Reflection debug tracePacket CapturesLogsAngie & Daffy’ Ethernet port statisticsReflection Version & Settings (idle timer)
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 45
#5 Capture Diagnostic Data
Angie’s Ethernet Port Statslf-esx#sh ver
[…]
lf-esx uptime is 3 years, 3 weeks, 5 days, 12 hours, 44 minutes
[…]
lf-esx#sh int Fa2/19
FastEthernet2/19 is up, line protocol is up (connected)
Hardware is Fast Ethernet Port, address is 0011.21f5.46c2 (bia 0011.21f5.46c2)
MTU 1500 bytes, BW 100000 Kbit, DLY 100 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation ARPA, loopback not set
Keepalive set (10 sec)
Full-duplex, 100Mb/s, link type is auto, media type is 10/100BaseTX
input flow-control is unsupported output flow-control is unsupported
ARP type: ARPA, ARP Timeout 04:00:00
Last input 00:00:19, output never, output hang never
Last clearing of "show interface" counters never
Input queue: 0/2000/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/40 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 4000 bits/sec, 6 packets/sec
161282073 packets input, 48475519613 bytes, 0 no buffer
Received 2004674 broadcasts (1689326 multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
0 input packets with dribble condition detected
831253443 packets output, 116132425387 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 babbles, 0 late collision, 0 deferred
1 lost carrier, 0 no carrier
0 output buffer failures, 0 output buffers swapped out
lf-esx#
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 46
If t
he
err
or
cou
nte
rs w
ere
hig
h, p
erh
aps
we
hav
e a
bad
NIC
| c
able
| s
wit
ch p
ort
… b
ut
they
are
ze
ro o
r cl
ose
en
ou
gh.
Ru
le o
ut
ba
d p
hysi
cal l
aye
r
Packet Trace
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 47
daffy = ingress
An
gie
abru
ptl
y h
angs
up
(TC
P R
ST)
on
Daf
fy (
aka
Ingr
ess)
. Lo
oks
like
An
gie
init
iate
d t
he
dis
con
nec
t
Reflection Settings
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 48
Rea
din
g th
e m
anu
al t
ells
us
that
Co
nn
ect
ion
Se
ttin
g Ti
meo
ut
is a
n Id
le T
imer
. A
nd
th
at a
ch
oic
e o
f ‘0
’ fo
r th
is t
imer
me
ans
‘un
limit
ed’,
i.e.
nev
er d
isco
nn
ect
, no
mat
ter
ho
w lo
ng
the
use
r re
mai
ns
idle
.D
an
g, w
e re
ally
wa
nte
d t
o s
ee a
set
tin
g o
f, o
h,
60
min
ute
s h
ere
Application Version
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 49
Res
earc
h t
ells
us
tha
t th
e la
test
pa
tch
leve
l fo
r R
efle
ctio
n 1
4 is
v1
4.0
.7.
An
d t
he
late
st v
ersi
on
fo
r th
is t
rain
of
Ref
lect
ion
of
14
.1.1
88
SP
1.
An
gie
is r
un
nin
g a
n o
ld v
ersi
on
Back to small group; you have 30 minutes to analyze the data you have acquired and
In real life, you will likely cycle through Steps #4 - #6 multiple times.
Feel free to continue to #7 Identify Fix when you are ready.
Your team consults together … hmm …• The Ethernet port shows trivial errors, so that looks fine.• The packet trace shows Angie initiating the disconnect• Reflection settings show an unlimited idle timer• We’re running an old version of Reflection … probably full of bugs
#6 Analyze Captured Data
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 50
At some point, you believe you’ve identified the cause; now you can develop a fix.
#7 Identify Fix
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 51
Your team says:We know that Attachmate has shipped numerous updates to Reflection – the latest version is 14.0.7. We propose to upgrade Angie’s copy to the latest version.
We reconvene as a class. Each group proposes its fix, and the instructor reports the results of the fixes.
In this example, Bob doesn’t want to upgrade – he wants to keep all his users at the same revision.
Instead, he uninstalls and re-installs Reflection.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 52
#8 Implement Fix
We remain regrouped as a class and review the results of the Fixes. In this case, Angie runs for a week without any disconnects.
Bob doesn’t want to invest more time into this, so we quit.
Ideally, we would re-image Angie’s machine and verify that the problem returned … as scientists, we realize that we have demonstrated correlation, but not cause and effect.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 53
#9 Verify Fix
We declare the Problem resolved, with an undefined Root Cause –something related to Angie’s local Application configuration which gets reset when the Application was re-installed, no explanation for why this only affected Angie and not any of the other 55 users.
In a perfect world, we would re-image Angie’s machine and verify that the problem returned … in the real world, we did not implement that last step of RPR, which requires that we Verify the Fix …
As a Problem Manager, you are responsible for ensuring that management hears the risk they have adopted by skipping this step.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 54
Results
• For the rest of our day we cycle between small group and large group
• In large group, you ask questions; in small group, you analyze
• I am available for questions and coaching during both
Questions about the mechanics of what we will be doing?
Questions about the 9 step RCA process?
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick
End of Example Case
55
#1 SplitIn a moment, you will split into groups of 3-6 people
# 2 Assign RolesI recommend assigning roles & responsibilities, e.g.
Facilitator Tracks who is doing what, spokespersonProblem Analyst Big pictureSubject Matter Experts Sys admin, network, storage …
Successful teams divide & conquer the material …Ideally, one person per role …
# 3 Pick NamePick a cool name for your group; write it on one of the name plates
You have 5 minutes – go
Split into Small Groups
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 56
HPC Cluster Woes (morning)Storage Stumbles (afternoon)
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 57
Case Studies
Researchers submit tissue samples to the Genome Sequencing Shared Resource. A full sequencing run takes days – the sequencers dump the resulting files, typically dozens to hundreds of gigabytes, onto the server Fred. From there, users run custom code on the High Performance Computing (HPC) cluster Hyrax to analyze the results. They write their own code, typically in a mix of Perl, Python, and R, and tweak this code regularly, as they explore various avenues of inquiry. The cluster has a handful of heavy users (daily or weekly), plus several dozen light users (monthly).
The scheduler behind Hyrax submits jobs to the nodes which comprise the cluster, keeping track of various parameters, like how many nodes a given researcher owns (condo-model), which nodes are already busy, how much time a given job has already consumed, and so on. Some jobs finish in minutes, some take hours, others take days or even weeks -- this is normal.
A few of the nodes are unusual: they are large memory nodes, typically equipped with 64 GB of RAM plus fast processors; they are named RhinoX and OrcaX (e.g. rhino1, rhino2, rhino3 … orca1, orca2, orca3 …)
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 58
HPC Cluster Woes I
In the summer of 2011, the Hutch hired a promising young researcher, Robert Bradley (aka rbradley), who had recently completed postdoctoral work at MIT. Bradley analyzes alternative splicing, a process by which a single gene contributes to producing multiple protein isoforms –a normal event in cells and one which plays an important role in various diseases, including cancers. Bradley’s work makes heavy use of large memory HPC machines.
By September, Bradley had transferred his data and code from MIT to the Hutch; almost immediately, he started encountering problems. Interactive ssh sessions to Rhino/Orca stall, sometimes for seconds, minutes, perhaps even hours. Nodes hang for minutes at a time, with no progress on the job. Nodes crash and must be rebooted. Jobs crash and must be restarted.
This is not the kind of service we want to offer anyone, much less a new recruit.
Management input:You cannot talk to End-Users, rbradley in particular.You can ask their Desktop Support staff questions, and they will answer as best they can.Scheduling downtime on any of the storage systems is hard.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 59
HPC Cluster Woes II
Welcome to the first meeting of the Rhino PSG, on Wednesday November 2nd 2011. The meeting kicks off with the Ops team delivering their briefing. Read it. Understand nls
We start the RPR Methodology working together as a class.
Step #1: Understand the ProblemWhat questions do you want to ask of the various constituencies?
Database, Desktop, Helpdesk, Manager, Network, Server, Storage, User, Vendor …
After Understand the Problem, we will separate into small groups and proceed with:
Step #2: Choose One Symptom
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick
Set the Stage
60
Large group | Small group
1. Understand the Symptoms2. Choose One Symptom3. Draw the Diagram4. Design Capture Plan5. Capture Diagnostic Data6. Analyze Captured Data7. Identify Fix8. Implement Fix9. Verify Fix
RCA MethodologyDerived from Advance7’s Rapid Problem Resolution® methodology
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 61
Phase 1
Phase 2
Phase 3
What aspect of the process would you like to review?
What section of a diagram or graph would you like to explore?
What hunk of data would you like to re-examine?
Which link in the chain of reasoning doesn’t make sense to you?
Additional questions?
This is your opportunity to consolidate your learning.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 66
Q&A
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 68
PGP
http://xkcd.com/1181/
During the first decade of the new century, our storage silos start to multiply – Hitachi, IBM, Compaq, Dell, NetApp … in January 2010, after a year-long project, we go live with Consolidated Storage, an attempt to reduce staff / capital costs and meet future storage needs via a single system. Consolidated Storage consists of a clustered NetApp V3170providing SMB v1/v2, NFS v3/v4, and iSCSI access to a backend 3Par T800, containing 528 SATA drives (both 1 and 2TB). The T800 is a wide-striped system, meaning that every LUN it offers has been striped across all 528 drives. We estimated that the ~600TB usable space on CS would last us until mid-2012. By March 2010, almost all that space has been allocated …
Consolidated Storage services vColo, our VMWare farm (~600 guests residing on ~7 hosts), along with hundreds of servers (HPC nodes, database, custom applications) and thousands of desktop clients (home and shared directories).
By the summer of 2010, CPU utilization on Tungsten-A pegs, at which point performance problems severely impact daily use. We convert Tungsten-B from Standby to Active . Early in 2011, CPU on both systems pegs; we launch an emergency project to purchase another NetApp onto which we offload a particularly IO intensive application.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 72
Storage Stumbles - Background
Severity: Major (unplanned) Start: Wednesday, March 23, 2011 16:31Stop: -Duration: ongoingScope: 3Par storage and systems reliant on it (NetApps, vColo, others)
Description: The 3Par system experienced a drive failure, which caused a large latency spike. One of the NetApp heads subsequently lost access to the 3Parand initiated a failover to the other NetApp head. We are currently consulting with the vendors involved in order to determine next steps.
Service/User Impact: A number of systems have been impacted, including Zimbra, various Internet Services web servers, Outlook Web Access (partial), many others.
Technician/IT Operations Group performing work: xxx xxx, Center IT, InfraOps---------------------------------------------------------------------------------
Cleaning this up took a week. After much discussion, NetApp determined that our storage admin had followed an incorrect procedure some months earlier, in retiring several LUNs. This left ‘ghost’ traces behind which, for unknown reasons, triggered the latency spike and subsequent head failover.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 73
Storage Stumbles – First Event
Severity: Major (unplanned)Start: Tuesday, January 10, 2012 12:55Stop: ongoingDuration: ongoingScope: Consolidated Storage (Tungsten-A & Tungsten-B)
Description: Tungsten-A failed over to Tungsten-B due to a triple disk failure on the 3Par disk system. InfraOps is looking into the issue and will be working with system owners across the Center to get their machines up and running. We currently have a call open to NetApp. Right now all resources are running on Tungsten-B except for PHSDATA Aggregate 1 and ADMHOME. There may be a performance degradation since all the resources are running on a single head. When Tungsten-A is back to a stable condition, we will be scheduling a give-back of resources to that head. In the meantime we strongly suggest that all systems owners turn off any non-critical guests as this will help alleviate the load.
Service/User Impact: All services running on Tungsten-B have been impacted by the failover, including Zimbra, the Enterprise SQL clusters, EMS ... Tungsten-B cannot see a couple of disks, therefore there are resources that will be affected – known resources are PHSDATA Aggregate 1 and ADMHOME directories.
Technician/IT Operations Group performing work: xxx xxx, Center IT, InfraOps
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 74
Storage Stumbles – Second Event
Severity: Major (unplanned)Start: Wednesday, February 01, 2012 16:30Stop: ongoingDuration: ongoingScope: Consolidated Storage (Tungsten-A)
Description: Tungsten-A failed over to Tungsten-B due to a disk failure on the 3Par disk system. Infrastructure Operations is looking into the issue and will be working with system owners across the Center to get their machines up and running.
Service/User Impact: All services running on Tungsten-A have been impacted by the failover. CIT is currently working to resolve the issue and will keep communications open as we go forward.
Technician/IT Operations Group performing work: xxx xxx, Center IT, InfraOps
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 75
Storage Stumbles – Third Event
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 76
Storage Stumbles – Ops Team InputThe latency-sensitive VMs (typically old versions of SuSE for which we do not know how to adjust disk timeout parameters) running in vColo regularly complain about disk access, flag their file systems as read-only, and require reboots … this has been going on since the summer of 2010, when we moved chunks of vColo to Tungsten.
Our Telemetry charts show read/write latency spikes on the T800 whenever it fries a disk … and it fries far more disks than any other system at the Center.
Review your printed copy of Ops Team Briefing Storage Stumbles, along with the contents of the Diagrams folder (that folder includes the Timeline which management wants you to build – we’ve done that for you).
See Storage-Stumbles/Report-to-Management.ppt for the format in which management likes to see reports. Remember, IT mgmt is mostly composed of business folks, not technologists. You may want to produce two reports: one aimed at us – where you get to explain all the cool stuff – and the other report aimed at mgmt.
Make the mgmt report the Fisher Price version: keep it simple, speak their language, one page only.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 77
Storage Stumbles – Mgmt DirectionYou have the CIO’s attention: These events have knocked out most IT services company-wide for hours.
The CIO and his team expect our systems to behave predictably – your job is to figure out why they don’t. For example, if Consolidated Storage is suffering so badly, why do some systems float through the event without issue while others crash and require days of recovery? Why does Cobalt fry so many disks per year while our other systems don’t lose any?
1. Sanity-check Cobalt disk failure rate against industry averages2. Build a Timeline for the week surrounding the Incident3. Explain why Cobalt disk maintenance seems to trigger Tungsten failovers4. Explain why Tungsten failovers do not go smoothly5. Explain why different clients/services behave differently6. Explain the impact of Tungsten CPU utilization on our failover capabilities7. Describe how we fumbled the communication to end-users8. Propose next steps
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 78
Your MissionThere is way too much in this case study for you to tackle over the next few hours
Focus on one, maybe two, of the eight tasks, produce a one-page report for mgmtFeeling frisky? Produce a 1-2 page report aimed at the tech-savvy managers
1. Sanity-check Cobalt disk failure rate against industry averagesLots of googling & reading Misc/History-of-Cobalt-Frying-Disks
2. Build a Timeline for the week surrounding the IncidentOps team has done that already Diagrams
3. Explain why Cobalt disk failure triggers Tungsten failoversComplex, requires a sophisticated understanding of SCSI Incidents
4. Explain why Tungsten failovers do not go smoothlyHard but interesting Incidents
5. Explain why different clients/services behave differentlyInvolves a rich understanding of various clients, protocols, and HA schemes -
6. Explain the impact of Tungsten CPU utilization on our failover capabilitiesI predict that you’ll learn a thing or two about ONTAP Misc/Tungsten-Struggles
7. Describe how we fumbled the communication to end-usersLet the Ops Team do that
8. Propose next steps
Welcome to the first meeting of the Storage Stumbles PSG, today is January 12, 2012. The meeting kicks off with the Ops team delivering their briefing. Read it.
We start the RPR Methodology working together as a class.
Step #1: Understand the ProblemWhat questions do you want to ask of the various constituencies?
Database, Desktop, Helpdesk, Manager, Network, Server, Storage, User, Vendor …
After Understand the Problem, we will separate into small groups and proceed with:
Step #2: Choose One Symptom
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick
Set the Stage
79
Large group | Small group
1. Understand the Symptoms2. Choose One Symptom3. Draw the Diagram4. Design Capture Plan5. Capture Diagnostic Data6. Analyze Captured Data7. Identify Fix8. Implement Fix9. Verify Fix
RCA MethodologyDerived from Advance7’s Rapid Problem Resolution® methodology
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 80
Phase 1
Phase 2
Phase 3
What aspect of the process would you like to review?
What section of a diagram or graph would you like to explore?
What hunk of data would you like to re-examine?
Which link in the chain of reasoning doesn’t make sense to you?
Additional questions?
This is your opportunity to consolidate your learning.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 88
Q&A
http://xkcd.com/378/
2013-11-05
Real Programmers
Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 90
Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 912013-04-14
Tech Flash: Sanity-Checking Throughput Claims Sometimes, I’ll hear a tech claim that we need a fatter WAN pipe, because the file copy | backup job | database synchronization | whatever is slow:
“I’m only getting 400 MB/hour to Chicago: we need to rent a fatter network pipe.”
Well, we geeks often confuse ourselves when translating between bits per second and bytes per second … this whole performance zone is a popular place for error … and fatter WAN pipes are expensive. Let’s sanity check this claim.
Name Bit Rate Effective Data Rate*Vanilla Ethernet 10Mb/s 1MB/sFast Ethernet 100Mb/s 10MB/sGigabit Ethernet 1000Mb/s 100MB/sTen Gig Ethernet 10000Mb/s 1000MB/s (aka 1GB/s)
Assume we have a 100Mb/s pipe to Chicago: would buying a fatter one help?
*These numbers constitute an easily remembered rule-of-thumb: well-tuned clients/servers can actually deliver 10-15% better than this
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 92
Tips & Tools
Wiggly Charts are Overrated
Validate the Diagram
Rich Pingers
Draw the Pie
When to Use Full-Blown RPR
Musings on IT Architecture
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 93
Wiggly Charts are OverratedMy Solexa run failed last night shortly before midnight. You can see that Fred’s switch port was extremely busy then, far busier than usual, and you can also see the IO spike which happened at the same time.Therefore, Fred needs a 10GigENIC and faster disks.
Your task:• Think of ways to support this• Think of ways to refute it
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 94
Validate the DiagramSend a sample transaction from one end of the infrastructure to another, capturing along the way. If your diagram is accurate, you’ll see the transaction at every single capture point
If you don’t see that transaction … then you know your diagram is inaccurate: return to Draw the Diagram
Once you’ve validated the diagram, you are positioned to capture the pathology you’re investigating
Sample transaction: Write a test file, update a database record, send a Rich Ping … in each case, include easy-to-spot ASCII
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 95
TextPinghttp://www.packetiq.com/Tools/PacketIQ-TextPing.aspx
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 96
Command LineSend a TCP port 2049 frame to server.company.comhost> echo “Starting NFS Mount now –marker” | nc -4 –w 1 server.company.com 2049
C:\Temp> echo “Starting NFS Mount now –marker” | ncat -4 –w 1 server.company.com 2049
For Windows, install the open source ncat utility http://www.insecure.org, part of the Nmap distribution
Send a UDP port 666 frame to server.company.comhost> echo “Starting app now –marker” | nc -4 –w 1 -u server.company.com 666
C:\Temp> echo “Starting app now –marker” | ncat -4 –w 1 -u server.company.com 666
Create a file, the name of the file will appear in Wireshark’s Summary screenhost> touch /mnt/whatever/slowness-starting-now--marker.txt
C:\Temp> copy /y nul z:slowness-starting-now--marker.txt
Drop the message into /var/log/syslog on loghosthost> logger –l loghost.company.com slowness starting now –marker
C:\Temp> logger –l loghost.company.com slowness starting now --marker
For Windows, install the freeware logger utility http://www.monitorware.com/logger
Drop the message into the Web server’s logs:host> wget http://www.company.com/slowness-starting-now--marker.html
C:\Temp> wget http://www.company.com/slowness-starting-now--marker.html
For Windows, install the open source GNU wget utility
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 97
PingIn a pinch, you can use ping, manually maintaining a written table associating ping packet length to message:
host> ping –n 1 –l 101 server.company.com
host> ping –n 1 –l 102 server.company.com
host> ping –n 1 –l 102 server.company.com
Ping Packet Length Event101 bytes Mounting file system102 bytes Starting application103 bytes Slowness beginning now
Or, depending on your filters, ping a fake host … the ping won’t show up in the trace, but the failed DNS query will:host > ping www.slowness-starting-now--marker.com
C:\Temp> ping www.slowness-starting-now--marker.com
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 98
Send-UDP-Msghttp://www.skendric.com/app
vishnu> ./send-udp-msg -m "This is a test ping" rhino1 rhino2 rhino3
vishnu>
Or write your own … here’s mine
2013-11-05 99
Many problems are intermittent – you set your debugs and packet captures going and then wait hours/days/weeks for the issue to reoccur. How might one capture across such long time frames?
Ring BufferMost capture utilities will produce a ring buffer of files. In this example, dumpcap writes those bytes to a file named in the following way:
server-side_00001_20130325120842.pcapwhere the first field is a serial number and the second field encodes the date/time of start of this capture. After it has captured 50,000 bytes, it gets started on the next file:
server-side_00002_20130325120958.pcapwhere the question marks are replaced with the start time of this capture. Dumpcap will repeat 10,000 times, whereupon it will start deleting the first files, in order to limit the number of files to 10,000.
Windowsdumpcap –i 1 –b files:10000 –b filesize:50000 –f “ip host 10.1.2.3 and not ip host 192.168.20.30” –w
c:\temp\cabrini\serer-side.pcap
Linux/usr/sbin/dumpcap –i eth1 –b files:10000 –b filesize:50000 –f “ip host 10.1.2.3 and not ip host 192.168.20.30” –w
/home/skendric/cabrini/server-side.pcap &
Long-Term Captures - CLI
2013-11-05 100
Or perhaps you prefer the GUI
Long-Term Captures - GUI
Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick
2013-11-05 Root Cause Analysis | Sharkfest 2013 | Stuart Kendrick 101
Extract PacketsOK, so now you have 10,000 files, and you want to look at an incident which occurred between noon and 1:00pm on 2013-03-25
The names of the files allow you to focus on the window in question, so you copy those to a working directory. But that can still be a lot of files, in a busy environment. Perhaps you realize that you only care about DNS frames. I write littlescripts to extract the interesting packets and merge them into a single file.
Windowsecho off
setlocal ENABLEDELAYEDEXPANSION
mkdir c:\temp\cabrini\extract
cd \temp\cabrini
FOR /F %%a IN ('dir /b *.pcap') DO (
echo Processing %%a
tshark -r %%a -R “udp.port==53 or tcp.port==53" -w extract\%%a-filtered.pcap
)
cd \temp\cabrini\extract
FOR /F %%a IN ('dir /b *.pcap') DO (
set tmp=%filelist%
set "filelist=!filelist! %%a"
)
mergecap -w c:\temp\cabrini\server-side-extract.pcap %filelist%
Linuxhttp://www.skendric.com/problem/rca/extract-frames
User, application, memory, CPU, disk …Client
5s
Network120 seconds
Server15s
Contribution to the Problem
Switches, Routers, Firewalls, VPN Tunnels …
Application, OS, memory, CPU, storage …
Client / Network / Server Pie
I find drawing the CNS Pie useful when analyzing performance issues:How much does the Client contribute to the Problem? The Network? The Server?http://www.skendric.com/app
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 102
Rule-of-thumb
Application architecture: 1,000x• SQL, query optimization, caching, system calls
Server & Storage Configuration: 100x• Disk striping, spindle tiering, paging, NFS tuning
Application fine-tuning: 2-10x• Threads, asynchronous I/O
Kernel tuning: less than 2x- Caveats:
• If kernel bottleneck is present, then 10-100x• Kernel can be a binary performance gate
Version 3.10Copyright 1994-2007 Hal Stern, Marc Staveley System & Network Performance Tuning LISA 2007
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 103
Tuning Potential
Advance7 is a consulting outfit which helps customers resolve critical Problems – they put an analyst at your site to coordinate your staff plus vendors to fix the issue, using the RPR methodology.
They designed RPR to work against Grey Problems.
Most Problems are not Grey … unless the Problem is Grey, RPR is overkill.
So what are Grey Problems?
The following sides are cribbed from Advance7 materials -- full credit to Paul Offord & colleagues.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 104
Rapid Problem Resolution ®
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 105
The Grey Problem
• Intermittent Application Error• Poor Application Logic• Transient Overload• Intermittent Infrastructure Error• Incorrect Failover Operation
Single Incident Recurring Problem
Tech
no
logy
Kn
ow
nTe
chn
olo
gy U
nkn
ow
n
• Intermittent Hardware Failure• Known Error• Intermittent Software Failure
• Change-related Cause• Hardware Failure• Software Failure• Misconfiguration• Operations Error
• User Error• Operations Error• Rare Software or Hardware Error
The majority of issues that are passed to 2nd and 3rd line technical support teams are investigated in a straightforward manner. The nature of the issue or an indication from a monitoring system identifies the failing component and the issue is allocated tothe correct technical support team. Q1: the bulk of support work falls into this area. Q2 is harder but tends to be resolved by experienced support staff. Q3 is tough; we tend not to solve these.
An intermittent response-time or error issue is not so easily handled due to its transient nature. Not only does the cause sneak under the radar of monitoring systems, but investigation often starts after the issue has passed, making it impossible to usemany of the tools available. The result is a recurring problem where the causing technology is unknown: Q4, aka the Grey Problem. The Rapid Problem Resolution methodology targets Q4.
Q1 Q2
Q3 Q4
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 106
Grey Problem CharacteristicsBecause the causing technology is unknown, a grey problem will bounce between Technical Support Teams as each in turn produces evidence (often in the form of a health check) to prove that their technology is not to blame.
Typical characteristics of a grey problem• An ever-growing number of people become involved• Long meetings to discuss what might be the cause• Support people shy away from becoming involved• Repeated changes with no clear reason or objective
Consequences of grey problems• An ever growing backlog of problems• A fog that hinders the investigation of other, more urgent problems• A growing pool of problems that escalate into Major Incidents as patterns of use and business priorities change• Wasted IT budget as money is spent on poorly targeted upgrades• Barriers to integration due to concerns about the stability of component systems• Loss of confidence and satisfaction with the IT department• Pressure to outsource IT services• Reduced customer satisfaction• Higher costs as the business adjusts to accommodate the problem• Higher IT staffing costs
Service Desk
Inci
den
t M
anag
emen
t
Pro
ble
m M
anag
emen
t
Server Desktop NetworkVendor ApplicationVendor Vendor Vendor
?
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 107
The RPR Methdology
1.2 Choose One SymptomThe single largest reason I’ve thrashed in my RCA career.
1.4 Draw the Diagram & Sit with the UserIf I can’t draw it, I don’t understand it & Seeing leads to Understanding
2.2 Definitive Diagnostic DataInsert capture gear at critical points along the path, synchronize time using a distinctive transaction, capture data simultaneously from all points while replicating the pathology.Hardest to implement; most likely to make you successful.
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 108
Key Elements of RPR
Full methodologyhttp://www.skendric.com/problem/rca/RPR-RCA-Methodology.pdf
Checklisthttp://www.advance7.com/misc/rpm_wb.html
Manualhttp://www.advance7.com/information/publications
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 109
RPR References
• Business stake holders want features• We IT geeks love to turn every fancy knob
The result is complexityComplexity is the enemy of uptime … and the raison d’être for RCA
Insights from our gurusIncreasingly, people seem to misinterpret complexity as sophistication, which is baffling - the incomprehensible should cause suspicion rather than admiration. --Niklaus Wirth
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. --Brian Kernighan
KISS
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 110
Musings on Architecturehttp://www.skendric.com/philosophy/uptime
Folks ask me how I learn this stuffKey contributors to my path: Mentorship, Failure, Independent Study, Training
MentorshipI have had the good fortune to work for highly skilled people who have mentored me.• During 1991-1993, I joined my boss on Saturdays … worked as his gofer boy … he used the opportunity to train me• In the mid-2000s, we hired Mike Pennacchi to coach us. Mike came on-site once/month for a half-day; we brought
whatever problem was troubling us to the session; Mike would not solve it for us … rather, he would coach us through solving it. We did this for ~three years before budget contraction interfered
FailureI have had the good fortune to work for bosses who believe that we learn through mistakes … “Fail early and often” … I’ve learned a lot this way
Independent StudyI set aside a slot every week (mostly!) to practice what I’ve learned, push myself to learn something new … typically a weekend morning
TrainingI have had the good fortune to work for bosses with training budgets … I typically spend a couple weeks per year in classes … I occasionally augment this using my own shekels
2013-11-05 111Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick
Musings on Skill
Humans (all living creatures) are wired for fast-twitch: our nervous systems respond rapidly to the unusual, not to the mundane. So a bomb explodes, kills three people: that fires our adrenalin … but the annual toll of smoking (~500,000 per year in the US alone) drifts past our consciousness without a quiver.
Of course we’re wired this way … that’s how we stayed alive on the African savannah: by paying attention to the howl of the hyena, rather than to the gradual constriction of our arteries
But the result is that we have trouble paying attention to slow-twitch threat … to saving for retirement or a rainy day, daily exercise, spending time with our family … investing in the power grid, roads, bridges, anything which seems like a long way off …
Ditto with IT – we focus on the glitzy new projects, ignore the underpinnings … until the infrastructure breaks catastrophically … that drama fires our nervous systems, then we pay attention (for a while)
I don’t have a solution for this design flaw (trade-off) in our brains … but it does keep me employed, as a Problem Manager and a Problem Analyst If we maintained our infrastructure (shrank technical debt), many of our RCAs would not occur
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 112
Musings on Nervous Systems
It has been said that man is a rational animalAll my life I have been searching for evidence which could support this
--Bertrand Russell
Your brain will be predisposed to certain answers and will cling to them, blinding you to reality
Definitive Data Capture is RPR’s effort to counteract this tendency
I wish you success in scrabbling for rationality
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 113
This is Hard
2013-11-05 Root Cause Analysis Intermediate | LISA 2013 | Stuart Kendrick 114
Insight
http://xkcd.com/1215/
On-Line ResourcesRapid Problem Resolution by Paul OffordLinkedIn Protocol Analysis & Troubleshooting GroupOld Comm Guy http://www.lovemytool.com
Trouble-shooting & Training Outfits Based Here (will travel for $$)James Baxter http://www.packetiq.com Daytona Beach, FLTony Fortunato http://www.thetechfirm.com Toronto, CanadaChris Greer http://www.packetpioneer.com Central AmericaPaul Offord http://www.advance7.com London (international)Mike Pennacchi http://www.nps-llc.com Seattle, WARay Tompkins http://www.gearbit.com Austin, TX…
ConferencesSharkfest http://www.sharkfest.org Berkeley, CA
Follow-up stuart.kendrick.sea {at} gee mail dot com2013-11-05 Root Cause Analysis Beginner | LISA 2013 | Stuart Kendrick 118
Thank you