Information Technology - Discover the Root Cause and Develop a solution through structured processes
-
Upload
john-hudson -
Category
Technology
-
view
780 -
download
2
description
Transcript of Information Technology - Discover the Root Cause and Develop a solution through structured processes
John Hudson & Matt Fourie5 November 2012
Go Direct to the Root Cause –
itRCA the solution?
“Most incidentinvestigators askthe wrong questions, so do notchange your peoplebut change thequestions they areasking”
Matt Fourie
• Introduction
• Current situation
• Components of a credible approach• Minimalistic information, being specific
and knowledge (wisdom) creation
• The Three critical investigation skills1. Service Recovery Analysis
2. Technical Cause Analysis
3. Root Cause Analysis
• Client outcomes
• Questions & answers
Agenda
Some of our recent clients...Barclays ITANZ IT DivisionMacquarie ITGUnisysPolypore ITMedtronic ITSITA GlobalBT FinancialWestpac ITMcDonalds ITQueensland Police ITLockheed Martin Space SystemsSPARQ IT
• Thinking Dimensions International - operating KEPNERandFOURIE company initiatives for the last 25 years
• Specialising in RCA Methodology for IT Incident and Problem Management
Thinking Dimensions
Americas• Canada• Chile• Peru• USA
EMEA• Germany• Italy• Netherlands• Poland• Saudi Arabia• South Africa• Spain• Turkey• United Kingdom
Asia Pacific• Australia• China• India• South Korea• Thailand• Singapore
• Baxter International• Blue Cross Blue Shield• Bosch• Caltex Oil• Carraro• Crown Cork and Seal• Dometic• Electrolux• Federal Judiciary Center • General Dynamics IT• Hollister,Inc• Infineon• BASF• Macquarie Bank IT• BT Financial IT• Stihl• Westpac IT• Maersk • Norfolk Naval Shipyard• Selig• Siemens• SITA• SKF
Global Presence
PAST NOW FUTURE
itTCA® – TECHNICAL CAUSE ANALYSIS
itSRA® – SERVICE RECOVERY ANALYSIS
itRCA® – ROOT CAUSE ANALYSIS
STANDARD
DEVIATION
The Current Dilemma
Incident 2. itTCA
®
Technical Cause Process and Techniques
Root Cause & FIX Checklist & Templates
3. itRCA ®
1. itSRA® Recovery &
Containment Tools & Templates
Service Recovery Analysis
Technical Cause
Analysis
Root Cause
Analysis
The Three Skills…
Current Default Root Causes
• Hardware
• Software
• “Human Error”
• Environment
Technical Cause
Root Cause
Incisive ThinkingIncident Statement Technical Cause Root Cause
Internet Banking Degrading
Incisive ThinkingIncident Statement Technical Cause Root Cause
Internet Banking Degrading
New browser configuration issue
Incisive ThinkingIncident Statement Technical Cause Root Cause
Internet Banking Degrading
New browser configuration issue
Integrative testing not done properly
Incisive ThinkingIncident Statement Technical Cause Root Cause
Internet Banking Degrading
New browser configuration issue
Integrative testing not done properly
Incisive ThinkingIncident Statement Technical Cause Root Cause
Internet Banking Degrading
New browser configuration issue
Integrative testing not done properly
Encrypted “hello” message not returned
Incisive ThinkingIncident Statement Technical Cause Root Cause
Internet Banking Degrading
New browser configuration issue
Integrative testing not done properly
Encrypted “hello” message not returned
‘Beta’ Certificate used
Incisive ThinkingIncident Statement Technical Cause Root Cause
Internet Banking Degrading
New browser configuration issue
Integrative testing not done properly
Encrypted “hello” message not returned
‘Beta’ Certificate used
Policy requirements for “production” environment not adhered to
Incisive ThinkingIncident Statement Technical Cause Root Cause
G-Force System Freezing
Incisive ThinkingIncident Statement Technical Cause Root Cause
G-Force System Freezing
High volume
Incisive ThinkingIncident Statement Technical Cause Root Cause
G-Force System Freezing
High volume Too many users allowed access
Incisive ThinkingIncident Statement Technical Cause Root Cause
G-Force System Freezing
High volume Too many users allowed access
G-Force SQL DB thread count exceeding maximum
Incisive ThinkingIncident Statement Technical Cause Root Cause
G-Force System Freezing
High volume Too many users allowed access
G-Force SQL DB thread count exceeding maximum
G-Force program not closing out threads
Incisive ThinkingIncident Statement Technical Cause Root Cause
G-Force System Freezing
High volume Too many users allowed access
G-Force SQL DB thread count exceeding maximum
G-Force program not closing out threads
Vendor implemented an untested program update
Basic phases of problem solving
DivergentThinking
Procedure for addressing an Incident
1. State the purpose
2. Gather incident/problem detail
3. Evaluate for causes
4. Confirm technical/root cause1. Testing2. Verifying cause
ConvergentThinking
DivergentThinking
ConvergentThinking
Procedure for addressing an Incident
1. State the purpose
2. Gather incident/problem detail
3. Evaluate for causes
4. Confirm technical/root cause1. Testing2. Verifying cause
Factual information
gathering
Intuitive analysis of
own suggestions
Basic phases of problem solving
Good RCA…
YOU NEED TO SOLVE AN INCIDENT;
• QUICKLY [Service Recovery]
• ACCURATELY [Technical Cause]
• PERMANENTLY [Root Cause]
Factors in minimalistic approach
Factor
What
Where
When
How
Why
Who
IS BUT NOTI Keep six honest serving-men:
(They taught me all I knew);
Their names are What and
Why and When
And How and Where and Who.
I send them over land and sea,
I send them east and west;
But after they have worked for me,
I give them all a rest.
Rudyard Kipling
Extreme Focus With “Specificity”
Specificity Rules
•One object one fault
•Single-minded & simplistic
•Highly focused
•Must find the correct entry point
•Ask a question – expect an answer
Object Fault
Servers Not communicating
“The key to successis to be insistent about specificity –the more specificyou are the betteryour chances toSolve an incident.”
KEPNERandFOURIE
Extreme Focus With “Specificity”Specificity Rules
•One object one fault
•Single-minded & simplistic
•Highly focused
•Must find the correct entry point
•Ask a question – expect an answer
Object Fault
Servers Not communicating
Data not transferred
Extreme Focus With “Specificity
Specificity Rules
•One object one fault
•Single-minded & simplistic
•Highly focused
•Must find the correct entry point
•Ask a question – expect an answer
Object Fault
Servers Not communicating
Data not transferred
Sent but not received by receiving servers
Extreme Focus With “Specificity”Specificity Rules
•One object one fault
•Single-minded & simplistic
•Highly focused
•Must find the correct entry point
•Ask a question – expect an answer
Object Fault
Servers Not communicating
Data not transferred.Sent but not received by receiving servers
Data for Large Outlets
Not received
Extreme Focus With “Specificity”
Specificity Rules
•One object one fault
•Single-minded & simplistic
•Highly focused
•Must find the correct entry point
•Ask a question – expect an answer
Object Fault
Servers Not communicating
Data not transferred.Sent but not received by receiving servers
Data for Large Outlets
Not received
Sales turnover numbers for Large Outlets
Not received
Creating Intelligence
DATA
IS
Internet Banking
Slow
APAC users
Started Oct 1
Continuous
INFORMATION
BUT NOT
Intranet Banking
Freezing
USA, UK
Before
After 4pm
KNOWLEDGE
WHY NOT
Different routingSSL handshake
Volume?
ADSL lines
New passwords
Different routing
Unexpected Outcomes
• “BUT NOT” clarifies the facts
• Creates a curious “contrast”
• Looking at answers at a “granular level”
• Stimulates deductive reasoning
PAST NOW FUTURE
itTCA® – TECHNICAL CAUSE ANALYSIS
itSRA ® – SERVICE RECOVERY ANALYSIS
itRCA ® – ROOT CAUSE ANALYSIS
STANDARD
DEVIATION
The Current Dilemma
Service Recovery [ MTR]
FACTOR IS BUT NOT
OBJECT Mobile website access
PC website access
FAULT Denied – not authorized
Slow/freezing
WHO Blackberry users
Other Smart phones
WHERE Asia ANZ, UK, USA
IMPACT Customer complaints
PATTERN Sporadic continuous
REQUIREMENT ACTIONS TO CONSIDER
WHAT TO RESTORE
WHAT PROBLEMS TO REMOVE
WHO
WHERE
TO WHAT EXTENT
FOR HOW LONG
Statement: Restore website access to customers
Key Solution Requirements Various actions to meet key requirements
1 2 3 4 5
1. Provide access to client to at least receive interim non-availability notice
0 3 2 1 3
2. No loss of Data 3 3 0 0 1
3. Should not impact System Performance 1 0 3 1 0
4. ADSL compatible for Asia 1 2 0 0 0
5. Improve reliability 3 0 3 1 1
6. Implementation within the hour 1 3 3 1 2
Possible Actions:1. Upload or switch on simple site maintenance page2. Set up or start up back up service 3. Reroute 20/80 service all to back up service4. Restrict access to low load tasks only5. Allow access based on region
Service Recovery [ MTR]
Statement: Restore website access to customers
Key Solution Requirements Various actions to meet key requirements
1 2 3 4 5
1. Provide access to client to at least receive interim non-availability notice
0 3 2 1 3
2. No loss of Data 3 3 0 0 1
3. Should not impact System Performance 1 0 3 1 0
4. ADSL compatible for Asia 1 2 0 0 0
5. Improve reliability 3 0 3 1 1
6. Implementation within the hour 1 3 3 1 2
Possible Actions:1. Upload or switch on simple site maintenance page2. Set up or start up back up service 3. Reroute 20/80 service all to back up service4. Restrict access to low load tasks only5. Allow access based on region
Service Recovery [ MTR]
PAST NOW FUTURE
itTCA® – TECHNICAL CAUSE ANALYSIS
itSRA ® – SERVICE RECOVERY ANALYSIS
itRCA ® – ROOT CAUSE ANALYSIS
STANDARD
DEVIATIO
N
The Current Dilemma
Technical Cause Analysis [TCA - MTTR]
IS BUT NOT
WHY NOT
OBJECT
FAULT
USERS
WHERE
TIMING
PATTERN
CYCLE
OBJECT – What object and which other object(s) not?
FAULT – What fault and which other typical faults not?
USERS – Who has the problem and who does not?
WHERE – Where are these users and where could they have been but are not?
TIMING – When did it happen first time and when not?
PATTERN – What is the pattern of faults and what could it have been but is not?
CYCLE – In which cycle does the problem occur and in which cycle does it not occur?
DIMENSION IS BUT NOT WHY NOT Possible Causes & Testing
Object Fireburst V2.0 connection
E-Express, Mango connections
F/B upgrade from V1 to V2, Poor testing issue
Fault dropping Freezing, slow Time out settings, configuration of drivers
Location of Object
ANZ, USA, UK
Asia LAN, Proxy server issues, F/Wall rules
Timing Monday, Sept 2nd with SOB
Any time earlier than Sept 2nd
Java upgrade, Netscape upgrade
Pattern Continuous Sporadic, Periodic
Don’t know
Life Cycle When doing a transaction
“x” time into transaction
Operator error, Code error on a specific page
Phase of Work
Just after logging in
Logging in or out OS configuration issue, DNS issue
Technical Cause Analysis [TCA]
DIMENSION IS BUT NOT WHY NOT Possible Causes & Testing
Object Fireburst V2.0 connection
E-Express, Mango connections
F/B upgrade from V1 to V2, Poor testing issue
1. Proxy server tampered with during the Java upgrade on the LAN
Fault Dropping Freezing, slow Time out settings, configuration of drivers
Location of Object
ANZ, USA, UK
Asia LAN, Proxy server issues, F/Wall rules
2. Java upgrade caused driver incompatibility with Fireburst website V2.0
Timing Monday, Sept 2nd with SOB
Any time earlier than Sept 2nd
Java upgrade, Netscape upgrade
Pattern Continuous Sporadic, Periodic
Don’t know 3. Netscape upgrade caused driver incompatibility with Fireburst website V2.0
Life Cycle When doing a transaction
“x” time into transaction
Operator error, Code error on a specific page
Phase of Work
Just after logging in
Logging in or out OS configuration issue, DNS issue
Technical Cause Analysis [TCA]
DIMENSION IS BUT NOT WHY NOT Possible Causes & Testing
Object Fireburst V2.0 connection
E-Express, Mango connections
F/B upgrade from V1 to V2, Poor testing issue
1. Proxy server tampered with during the Java upgrade on the LAN
Fault Dropping Freezing, slow Time out settings, configuration of drivers
X
Location of Object
ANZ, USA, UK
Asia LAN, Proxy server issues, F/Wall rules
2. Java upgrade caused driver incompatibility with Fireburst website V2.0
Timing Monday, Sept 2nd with SOB
Any time earlier than Sept 2nd
Java upgrade, Netscape upgrade
√ √ X
Pattern Continuous Sporadic, Periodic
Don’t know 3. Netscape upgrade caused driver incompatibility with Fireburst website V2.0
Life Cycle When doing a transaction
“x” time into transaction
Operator error, Code error on a specific page
√ √ A1 √ √ √ √
A1- Only if the staff in Asia did not upgrade to Netscape
Phase of Work
Just after logging in
Logging in or out OS configuration issue, DNS issue
Technical Cause Analysis [TCA]
PAST NOW FUTURE
itTCA® – TECHNICAL CAUSE ANALYSIS
itSRA ® – SERVICE RECOVERY ANALYSIS
itRCA ® – ROOT CAUSE ANALYSIS
STANDARD
DEVIATION
The Current Dilemma
A Case of a good thinking process
• Deviation Statement
• Factor Analysis
• Possible causal factors
• Testing the causal hypotheses
• Find the underlying reason(s) for incident
'The truth, if it exists, is in the details'
“Bartlett – Familiar Quotations”
The Right Starting Point
• Find the technical cause first
• Do 5 Why’s to get to the systemic level
• Find the root cause(s)
• Fix the incident/problem for good
“If a team has not solved an incident, the person with the information was not invited”
Chuck Kepner
Four Questions to get Started
• Is the object deviation within the control of your own system? Can you fix the root cause with actions under your control?
• Is the object deviation within the control of your own system? Can you only fix the root cause with the vendor's help?
• Is the technical cause deviation in the vendor's system? Can you only fix the root cause with the vendor's help?
• Is the technical cause deviation in the vendor's system? We would only be able to take avoiding actions.
RiskWise
ITRCAMax4
ITRCA
Max4
Root Cause Analysis [RCA]
DIMENSION IS BUT NOT
APPLICATION
DEVIATION
FUNCTION
WHO
WHERE
TIMING
FREQUENCY
APPPLICATION: What application and which other applications not?
DEVIATION: What deviation do we have and which ones not?
FUNCTION: Which job/function/process is involved and which ones not?
USERS: Who has the problem and who does not?
WHERE: Where are these users and where could they have been but are not?
TIMING: When did it happen first time and when not?
FREQUENCY: How frequent is the fault occurring?
COMPONENT CAUSAL FACTORS CAUSAL ELEMENTS
Decision Making Process and Collaboration for inputs Critical stakeholder requirements not consulted for this taskInadequate authority levels for making good decisions
Implementation issues
Resources and Scope & Definition of project
Poor decision process and documentation for this taskInadequate standards guiding the decision makingTime Zone difficulties hampering effective decision making
Standard Operating Procedures
Applicability of SOP and Awareness of SOP
Unrealistic time, cost and performance expectationsPoor initial estimation of resources needed for the projectPoor updated approval data making the procedure unclear
Management Management of Work and Staff Poor work guidance/coaching for correct performanceWork standards for this task is not enforcedPoor management support in getting this task done
Measurement KPI”s and Roles & Responsibilities KPI and metrics regarding this output not clear or absentPoor feedback on this KPIDuplication and GAPS making roles and responsibilities difficult
Root Cause Analysis [RCA]
COMPONENT CAUSAL FACTORS CAUSAL ELEMENTS
Support Internal and External Vendor supportOveruse of the SME causing sub-standard workPoor continual vendor support for this output
Communications Clarity of communications and instructions Continual interruptions in performing the task
Task performance request not properly understood
Work Environment Task Interference and consequences Work environment not conducive for the demands of the taskUnrealistic task and performance expectation for this task
Skills Complexity and applicability Not having enough experience with similar tasksNo vendor training provided for new product and or service
Testing Practices Procedures and requirementsPoor risk analysis and decision pressure during testingNot all aspects tested and the test was incomplete
Personal Aptitude and Attitude Inadequate problem solving ability for this type of task Incumbent does not follow instructions or Standard Procedure
Root Cause Analysis 2 cont. [RCA]
COMPONENT CAUSAL FACTORS CAUSAL ELEMENTS
Decision Making Process and Collaboration for inputs
Critical stakeholder requirements not consulted for this taskInadequate authority levels for making good decisions
Implementation issues
Resources and Scope & Definition of project
Poor decision process and documentation for this taskInadequate standards guiding the decision makingTime Zone difficulties hampering effective decision making
Standard Operating Procedures
Applicability of SOP and Awareness of SOP
Unrealistic time, cost and performance expectationsPoor initial estimation of resources needed for the projectPoor updated approval data making the procedure unclear
Management Management of Work and Staff Poor work guidance/coaching for correct performanceWork standards for this task is not enforcedPoor management support in getting this task done
Measurement KPI”s and Roles & Responsibilities
KPI and metrics regarding this output not clear or absentPoor feedback on this KPIDuplication and GAPS making roles and responsibilities difficult
Root Cause Analysis [RCA]
COMPONENT CAUSAL FACTORS CAUSAL ELEMENTS
Support Internal and External Vendor support Overuse of the SME causing sub-standard work
Poor continual vendor support for this output
Communications Clarity of communications and instructions Continual interruptions in performing the task
Task performance request not properly understood
Work Environment
Task Interference and consequences
Work environment not conducive for the demands of the taskUnrealistic task and performance expectation for this task
Skills Complexity and applicability Not having enough experience with similar tasksNo vendor training provided for new product and or service
Testing Practices Procedures and requirements
Poor risk analysis and decision pressure during testingNot all aspects tested and the test was incomplete
Personal Aptitude and Attitude Inadequate problem solving ability for this type of task Incumbent does not follow instructions or Standard Procedure
Root Cause Analysis [RCA]
Testing the Hypothesis
The decision making process is too cumbersome to allow for own initiatives and the staff member must make a choice with given alternatives which is not most optimal for the situation
The job incumbent did not get the necessary support to do his job under a pressure situation adding to task interference
External vendor support for certain technical decisions was not available and that resulted in a less optimized decision choice.
Final Conclusion and Action Plan:
1.
2.
3.
✗
Additional Resources
“SOLVE IT” – Find a way to solve incidents quickly, accurately and permanently.