Information Technology - Discover the Root Cause and Develop a solution through structured processes

50
John Hudson & Matt Fourie 5 November 2012 Go Direct to the Root Cause – itRCA the solution?

description

The presentation was compiled by Thinking Dimensions Global in November 2012 for the ITSMF conference held in London. The content relates to the KEPNERandFOURIE process for dealing with incidents and problems in IT and in particular a means of determining the Root Cause and providing the best solution. The presentation was co-presented by Dr Mat-thys Fourie and John Hudson of Thinking Dimensions Global

Transcript of Information Technology - Discover the Root Cause and Develop a solution through structured processes

Page 1: Information Technology - Discover the Root Cause and Develop a solution through structured processes

John Hudson & Matt Fourie5 November 2012

Go Direct to the Root Cause –

itRCA the solution?

Page 2: Information Technology - Discover the Root Cause and Develop a solution through structured processes

“Most incidentinvestigators askthe wrong questions, so do notchange your peoplebut change thequestions they areasking”

Matt Fourie

• Introduction

• Current situation

• Components of a credible approach• Minimalistic information, being specific

and knowledge (wisdom) creation

• The Three critical investigation skills1. Service Recovery Analysis

2. Technical Cause Analysis

3. Root Cause Analysis

• Client outcomes

• Questions & answers

Agenda

Page 3: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Some of our recent clients...Barclays ITANZ IT DivisionMacquarie ITGUnisysPolypore ITMedtronic ITSITA GlobalBT FinancialWestpac ITMcDonalds ITQueensland Police ITLockheed Martin Space SystemsSPARQ IT

• Thinking Dimensions International - operating KEPNERandFOURIE company initiatives for the last 25 years

• Specialising in RCA Methodology for IT Incident and Problem Management

Thinking Dimensions

Page 4: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Americas• Canada• Chile• Peru• USA

EMEA• Germany• Italy• Netherlands• Poland• Saudi Arabia• South Africa• Spain• Turkey• United Kingdom

Asia Pacific• Australia• China• India• South Korea• Thailand• Singapore

• Baxter International• Blue Cross Blue Shield• Bosch• Caltex Oil• Carraro• Crown Cork and Seal• Dometic• Electrolux• Federal Judiciary Center • General Dynamics IT• Hollister,Inc• Infineon• BASF• Macquarie Bank IT• BT Financial IT• Stihl• Westpac IT• Maersk • Norfolk Naval Shipyard• Selig• Siemens• SITA• SKF

Global Presence

Page 5: Information Technology - Discover the Root Cause and Develop a solution through structured processes

PAST NOW FUTURE

itTCA® – TECHNICAL CAUSE ANALYSIS

itSRA® – SERVICE RECOVERY ANALYSIS

itRCA® – ROOT CAUSE ANALYSIS

STANDARD

DEVIATION

The Current Dilemma

Page 6: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incident 2. itTCA

®

Technical Cause Process and Techniques

Root Cause & FIX Checklist & Templates

3. itRCA ®

1. itSRA® Recovery &

Containment Tools & Templates

Service Recovery Analysis

Technical Cause

Analysis

Root Cause

Analysis

The Three Skills…

Page 7: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Current Default Root Causes

• Hardware

• Software

• “Human Error”

• Environment

Technical Cause

Root Cause

Page 8: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

Internet Banking Degrading

Page 9: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

Internet Banking Degrading

New browser configuration issue

Page 10: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

Internet Banking Degrading

New browser configuration issue

Integrative testing not done properly

Page 11: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

Internet Banking Degrading

New browser configuration issue

Integrative testing not done properly

Page 12: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

Internet Banking Degrading

New browser configuration issue

Integrative testing not done properly

Encrypted “hello” message not returned

Page 13: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

Internet Banking Degrading

New browser configuration issue

Integrative testing not done properly

Encrypted “hello” message not returned

‘Beta’ Certificate used

Page 14: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

Internet Banking Degrading

New browser configuration issue

Integrative testing not done properly

Encrypted “hello” message not returned

‘Beta’ Certificate used

Policy requirements for “production” environment not adhered to

Page 15: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

G-Force System Freezing

Page 16: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

G-Force System Freezing

High volume

Page 17: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

G-Force System Freezing

High volume Too many users allowed access

Page 18: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

G-Force System Freezing

High volume Too many users allowed access

G-Force SQL DB thread count exceeding maximum

Page 19: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

G-Force System Freezing

High volume Too many users allowed access

G-Force SQL DB thread count exceeding maximum

G-Force program not closing out threads

Page 20: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Incisive ThinkingIncident Statement Technical Cause Root Cause

G-Force System Freezing

High volume Too many users allowed access

G-Force SQL DB thread count exceeding maximum

G-Force program not closing out threads

Vendor implemented an untested program update

Page 21: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Basic phases of problem solving

DivergentThinking

Procedure for addressing an Incident

1. State the purpose

2. Gather incident/problem detail

3. Evaluate for causes

4. Confirm technical/root cause1. Testing2. Verifying cause

ConvergentThinking

Page 22: Information Technology - Discover the Root Cause and Develop a solution through structured processes

DivergentThinking

ConvergentThinking

Procedure for addressing an Incident

1. State the purpose

2. Gather incident/problem detail

3. Evaluate for causes

4. Confirm technical/root cause1. Testing2. Verifying cause

Factual information

gathering

Intuitive analysis of

own suggestions

Basic phases of problem solving

Page 23: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Good RCA…

YOU NEED TO SOLVE AN INCIDENT;

• QUICKLY [Service Recovery]

• ACCURATELY [Technical Cause]

• PERMANENTLY [Root Cause]

Page 24: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Factors in minimalistic approach

Factor

What

Where

When

How

Why

Who

IS BUT NOTI Keep six honest serving-men:

(They taught me all I knew);

Their names are What and

Why and When

And How and Where and Who.

I send them over land and sea,

I send them east and west;

But after they have worked for me,

I give them all a rest.

Rudyard Kipling

Page 25: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Extreme Focus With “Specificity”

Specificity Rules

•One object one fault

•Single-minded & simplistic

•Highly focused

•Must find the correct entry point

•Ask a question – expect an answer

Object Fault

Servers Not communicating

“The key to successis to be insistent about specificity –the more specificyou are the betteryour chances toSolve an incident.”

KEPNERandFOURIE

Page 26: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Extreme Focus With “Specificity”Specificity Rules

•One object one fault

•Single-minded & simplistic

•Highly focused

•Must find the correct entry point

•Ask a question – expect an answer

Object Fault

Servers Not communicating

Data not transferred

Page 27: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Extreme Focus With “Specificity

Specificity Rules

•One object one fault

•Single-minded & simplistic

•Highly focused

•Must find the correct entry point

•Ask a question – expect an answer

Object Fault

Servers Not communicating

Data not transferred

Sent but not received by receiving servers

Page 28: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Extreme Focus With “Specificity”Specificity Rules

•One object one fault

•Single-minded & simplistic

•Highly focused

•Must find the correct entry point

•Ask a question – expect an answer

Object Fault

Servers Not communicating

Data not transferred.Sent but not received by receiving servers

Data for Large Outlets

Not received

Page 29: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Extreme Focus With “Specificity”

Specificity Rules

•One object one fault

•Single-minded & simplistic

•Highly focused

•Must find the correct entry point

•Ask a question – expect an answer

Object Fault

Servers Not communicating

Data not transferred.Sent but not received by receiving servers

Data for Large Outlets

Not received

Sales turnover numbers for Large Outlets

Not received

Page 30: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Creating Intelligence

DATA

IS

Internet Banking

Slow

APAC users

Started Oct 1

Continuous

INFORMATION

BUT NOT

Intranet Banking

Freezing

USA, UK

Before

After 4pm

KNOWLEDGE

WHY NOT

Different routingSSL handshake

Volume?

ADSL lines

New passwords

Different routing

Unexpected Outcomes

• “BUT NOT” clarifies the facts

• Creates a curious “contrast”

• Looking at answers at a “granular level”

• Stimulates deductive reasoning

Page 31: Information Technology - Discover the Root Cause and Develop a solution through structured processes

PAST NOW FUTURE

itTCA® – TECHNICAL CAUSE ANALYSIS

itSRA ® – SERVICE RECOVERY ANALYSIS

itRCA ® – ROOT CAUSE ANALYSIS

STANDARD

DEVIATION

The Current Dilemma

Page 32: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Service Recovery [ MTR]

FACTOR IS BUT NOT

OBJECT Mobile website access

PC website access

FAULT Denied – not authorized

Slow/freezing

WHO Blackberry users

Other Smart phones

WHERE Asia ANZ, UK, USA

IMPACT Customer complaints

PATTERN Sporadic continuous

REQUIREMENT ACTIONS TO CONSIDER

WHAT TO RESTORE

WHAT PROBLEMS TO REMOVE

WHO

WHERE

TO WHAT EXTENT

FOR HOW LONG

Page 33: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Statement: Restore website access to customers

Key Solution Requirements Various actions to meet key requirements

1 2 3 4 5

1. Provide access to client to at least receive interim non-availability notice

0 3 2 1 3

2. No loss of Data 3 3 0 0 1

3. Should not impact System Performance 1 0 3 1 0

4. ADSL compatible for Asia 1 2 0 0 0

5. Improve reliability 3 0 3 1 1

6. Implementation within the hour 1 3 3 1 2

Possible Actions:1. Upload or switch on simple site maintenance page2. Set up or start up back up service 3. Reroute 20/80 service all to back up service4. Restrict access to low load tasks only5. Allow access based on region

Service Recovery [ MTR]

Page 34: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Statement: Restore website access to customers

Key Solution Requirements Various actions to meet key requirements

1 2 3 4 5

1. Provide access to client to at least receive interim non-availability notice

0 3 2 1 3

2. No loss of Data 3 3 0 0 1

3. Should not impact System Performance 1 0 3 1 0

4. ADSL compatible for Asia 1 2 0 0 0

5. Improve reliability 3 0 3 1 1

6. Implementation within the hour 1 3 3 1 2

Possible Actions:1. Upload or switch on simple site maintenance page2. Set up or start up back up service 3. Reroute 20/80 service all to back up service4. Restrict access to low load tasks only5. Allow access based on region

Service Recovery [ MTR]

Page 35: Information Technology - Discover the Root Cause and Develop a solution through structured processes

PAST NOW FUTURE

itTCA® – TECHNICAL CAUSE ANALYSIS

itSRA ® – SERVICE RECOVERY ANALYSIS

itRCA ® – ROOT CAUSE ANALYSIS

STANDARD

DEVIATIO

N

The Current Dilemma

Page 36: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Technical Cause Analysis [TCA - MTTR]

IS BUT NOT

WHY NOT

OBJECT

FAULT

USERS

WHERE

TIMING

PATTERN

CYCLE

OBJECT – What object and which other object(s) not?

FAULT – What fault and which other typical faults not?

USERS – Who has the problem and who does not?

WHERE – Where are these users and where could they have been but are not?

TIMING – When did it happen first time and when not?

PATTERN – What is the pattern of faults and what could it have been but is not?

CYCLE – In which cycle does the problem occur and in which cycle does it not occur?

Page 37: Information Technology - Discover the Root Cause and Develop a solution through structured processes

DIMENSION IS BUT NOT WHY NOT Possible Causes & Testing

Object Fireburst V2.0 connection

E-Express, Mango connections

F/B upgrade from V1 to V2, Poor testing issue

Fault dropping Freezing, slow Time out settings, configuration of drivers

Location of Object

ANZ, USA, UK

Asia LAN, Proxy server issues, F/Wall rules

Timing Monday, Sept 2nd with SOB

Any time earlier than Sept 2nd

Java upgrade, Netscape upgrade

Pattern Continuous Sporadic, Periodic

Don’t know

Life Cycle When doing a transaction

“x” time into transaction

Operator error, Code error on a specific page

Phase of Work

Just after logging in

Logging in or out OS configuration issue, DNS issue

Technical Cause Analysis [TCA]

Page 38: Information Technology - Discover the Root Cause and Develop a solution through structured processes

DIMENSION IS BUT NOT WHY NOT Possible Causes & Testing

Object Fireburst V2.0 connection

E-Express, Mango connections

F/B upgrade from V1 to V2, Poor testing issue

1. Proxy server tampered with during the Java upgrade on the LAN

Fault Dropping Freezing, slow Time out settings, configuration of drivers

Location of Object

ANZ, USA, UK

Asia LAN, Proxy server issues, F/Wall rules

2. Java upgrade caused driver incompatibility with Fireburst website V2.0

Timing Monday, Sept 2nd with SOB

Any time earlier than Sept 2nd

Java upgrade, Netscape upgrade

Pattern Continuous Sporadic, Periodic

Don’t know 3. Netscape upgrade caused driver incompatibility with Fireburst website V2.0

Life Cycle When doing a transaction

“x” time into transaction

Operator error, Code error on a specific page

Phase of Work

Just after logging in

Logging in or out OS configuration issue, DNS issue

Technical Cause Analysis [TCA]

Page 39: Information Technology - Discover the Root Cause and Develop a solution through structured processes

DIMENSION IS BUT NOT WHY NOT Possible Causes & Testing

Object Fireburst V2.0 connection

E-Express, Mango connections

F/B upgrade from V1 to V2, Poor testing issue

1. Proxy server tampered with during the Java upgrade on the LAN

Fault Dropping Freezing, slow Time out settings, configuration of drivers

X

Location of Object

ANZ, USA, UK

Asia LAN, Proxy server issues, F/Wall rules

2. Java upgrade caused driver incompatibility with Fireburst website V2.0

Timing Monday, Sept 2nd with SOB

Any time earlier than Sept 2nd

Java upgrade, Netscape upgrade

√ √ X

Pattern Continuous Sporadic, Periodic

Don’t know 3. Netscape upgrade caused driver incompatibility with Fireburst website V2.0

Life Cycle When doing a transaction

“x” time into transaction

Operator error, Code error on a specific page

√ √ A1 √ √ √ √

A1- Only if the staff in Asia did not upgrade to Netscape

Phase of Work

Just after logging in

Logging in or out OS configuration issue, DNS issue

Technical Cause Analysis [TCA]

Page 40: Information Technology - Discover the Root Cause and Develop a solution through structured processes

PAST NOW FUTURE

itTCA® – TECHNICAL CAUSE ANALYSIS

itSRA ® – SERVICE RECOVERY ANALYSIS

itRCA ® – ROOT CAUSE ANALYSIS

STANDARD

DEVIATION

The Current Dilemma

Page 41: Information Technology - Discover the Root Cause and Develop a solution through structured processes

A Case of a good thinking process

• Deviation Statement

• Factor Analysis

• Possible causal factors

• Testing the causal hypotheses

• Find the underlying reason(s) for incident

'The truth, if it exists, is in the details'

“Bartlett – Familiar Quotations”

Page 42: Information Technology - Discover the Root Cause and Develop a solution through structured processes

The Right Starting Point

• Find the technical cause first

• Do 5 Why’s to get to the systemic level

• Find the root cause(s)

• Fix the incident/problem for good

“If a team has not solved an incident, the person with the information was not invited”

Chuck Kepner

Page 43: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Four Questions to get Started

• Is the object deviation within the control of your own system? Can you fix the root cause with actions under your control?

• Is the object deviation within the control of your own system? Can you only fix the root cause with the vendor's help?

• Is the technical cause deviation in the vendor's system? Can you only fix the root cause with the vendor's help?

• Is the technical cause deviation in the vendor's system? We would only be able to take avoiding actions.

 

RiskWise

ITRCAMax4

ITRCA

Max4

Page 44: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Root Cause Analysis [RCA]

DIMENSION IS BUT NOT

APPLICATION

DEVIATION

FUNCTION

WHO

WHERE

TIMING

FREQUENCY

APPPLICATION: What application and which other applications not?

DEVIATION: What deviation do we have and which ones not?

FUNCTION: Which job/function/process is involved and which ones not?

USERS: Who has the problem and who does not?

WHERE: Where are these users and where could they have been but are not?

TIMING: When did it happen first time and when not?

FREQUENCY: How frequent is the fault occurring?

Page 45: Information Technology - Discover the Root Cause and Develop a solution through structured processes

COMPONENT CAUSAL FACTORS CAUSAL ELEMENTS

Decision Making Process and Collaboration for inputs Critical stakeholder requirements not consulted for this taskInadequate authority levels for making good decisions

Implementation issues

Resources and Scope & Definition of project

Poor decision process and documentation for this taskInadequate standards guiding the decision makingTime Zone difficulties hampering effective decision making

Standard Operating Procedures

Applicability of SOP and Awareness of SOP

Unrealistic time, cost and performance expectationsPoor initial estimation of resources needed for the projectPoor updated approval data making the procedure unclear

Management Management of Work and Staff Poor work guidance/coaching for correct performanceWork standards for this task is not enforcedPoor management support in getting this task done

Measurement KPI”s and Roles & Responsibilities KPI and metrics regarding this output not clear or absentPoor feedback on this KPIDuplication and GAPS making roles and responsibilities difficult

Root Cause Analysis [RCA]

Page 46: Information Technology - Discover the Root Cause and Develop a solution through structured processes

COMPONENT CAUSAL FACTORS CAUSAL ELEMENTS

Support Internal and External Vendor supportOveruse of the SME causing sub-standard workPoor continual vendor support for this output

Communications Clarity of communications and instructions Continual interruptions in performing the task

Task performance request not properly understood

Work Environment Task Interference and consequences Work environment not conducive for the demands of the taskUnrealistic task and performance expectation for this task

Skills Complexity and applicability Not having enough experience with similar tasksNo vendor training provided for new product and or service

Testing Practices Procedures and requirementsPoor risk analysis and decision pressure during testingNot all aspects tested and the test was incomplete

Personal Aptitude and Attitude Inadequate problem solving ability for this type of task Incumbent does not follow instructions or Standard Procedure

Root Cause Analysis 2 cont. [RCA]

Page 47: Information Technology - Discover the Root Cause and Develop a solution through structured processes

COMPONENT CAUSAL FACTORS CAUSAL ELEMENTS

Decision Making Process and Collaboration for inputs

Critical stakeholder requirements not consulted for this taskInadequate authority levels for making good decisions

Implementation issues

Resources and Scope & Definition of project

Poor decision process and documentation for this taskInadequate standards guiding the decision makingTime Zone difficulties hampering effective decision making

Standard Operating Procedures

Applicability of SOP and Awareness of SOP

Unrealistic time, cost and performance expectationsPoor initial estimation of resources needed for the projectPoor updated approval data making the procedure unclear

Management Management of Work and Staff Poor work guidance/coaching for correct performanceWork standards for this task is not enforcedPoor management support in getting this task done

Measurement KPI”s and Roles & Responsibilities

KPI and metrics regarding this output not clear or absentPoor feedback on this KPIDuplication and GAPS making roles and responsibilities difficult

Root Cause Analysis [RCA]

Page 48: Information Technology - Discover the Root Cause and Develop a solution through structured processes

COMPONENT CAUSAL FACTORS CAUSAL ELEMENTS

Support Internal and External Vendor support Overuse of the SME causing sub-standard work

Poor continual vendor support for this output

Communications Clarity of communications and instructions Continual interruptions in performing the task

Task performance request not properly understood

Work Environment

Task Interference and consequences

Work environment not conducive for the demands of the taskUnrealistic task and performance expectation for this task

Skills Complexity and applicability Not having enough experience with similar tasksNo vendor training provided for new product and or service

Testing Practices Procedures and requirements

Poor risk analysis and decision pressure during testingNot all aspects tested and the test was incomplete

Personal Aptitude and Attitude Inadequate problem solving ability for this type of task Incumbent does not follow instructions or Standard Procedure

Root Cause Analysis [RCA]

Page 49: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Testing the Hypothesis

The decision making process is too cumbersome to allow for own initiatives and the staff member must make a choice with given alternatives which is not most optimal for the situation

The job incumbent did not get the necessary support to do his job under a pressure situation adding to task interference

External vendor support for certain technical decisions was not available and that resulted in a less optimized decision choice.

Final Conclusion and Action Plan:

1.

2.

3.

Page 50: Information Technology - Discover the Root Cause and Develop a solution through structured processes

Additional Resources

“SOLVE IT” – Find a way to solve incidents quickly, accurately and permanently.