Achieving System Safety by Resilience Engineering IET_System_Safety_Hollnagel

1

© Erik Hollnagel 2006

Achieving System Safety by Resilience Engineering

Erik HollnagelIndustrial Safety Chair, École des Mines de Paris, France

E-mail: [email protected], University of Linköping, Sweden

E-mail: [email protected]


Accidents, incidents

Safety as a non-event

Daily operation

(Status quo)

Unwanted outcomeUnexpected event

Prevention of unwanted events

Protection against unwanted outcomes

SAFE SYSTEM = NOTHING UNWANTED HAPPENS

Reduce likelihood.

Reduce consequences.

Safety management must prevent/protect against both KNOWN and UNKNOWN risks.Safety management requires THINKING about how accidents can HAPPEN

David

X Arrow Left

David

Highlight

2


Looking into the futureLooking at the past

What has happened? What may happen?

Accident model

Simple linear

Complex linear

Non-linear*

* outcomes are not proportional toinputs, and cannot be derived froma simple combination of inputs

Risk model

Component failures

Combination of failures and degraded defences

Performance variability coincidences


Simple, linear cause-effect modelAssumption: Accidents are the (natural) culmination of a series of events or circumstances, which occur in a specific and recognisable order.

Consequence: Accidents are prevented by finding and eliminating possible causes. Safety is ensured by improving the organisation’s ability to respond.

Domino model (Heinrich, 1930)

Hazards-risks: Due to component failures (technical, human, organisational), hence looking for failure probabilities (event tree, PRA/HRA).

3


Consequence: Accidents are prevented by strengthening barriers and defences. Safety is ensured by measuring/sampling performance indicators.

Complex, linear cause-effect modelAssumption: Accidents result from a combination of active failures (unsafe acts) and latent conditions (hazards).

Swiss cheese model (Reason, 1990)

Hazards-risks: Due to degradation of components (organisational, human, technical), hence looking for drift, degradation and weaknesses


Consequence: Accidents are prevented by monitoring and damping variability. Safety requires constant ability to anticipate future events.

Non-linear accident modelAssumption: Accidents result from unexpected combinations (resonance) of normal performance variability.

Hazards-risks: Emerges from combinations of normal variability (socio-technical system), hence looking for ETTO* and sacrificing decision

Functional Resonance Accident Model

CertificationI

P

C

O

R

TFAA

LubricationI

P

C

O

R

T

Mechanics

High workload

Grease

Maintenance oversightI

P

C

O

R

T

Interval approvals

Horizontal stabilizer

movementI

P

C

O

R

TJackscrew up-down

movementI

P

C

O

R

T

Expertise

Controlledstabilizer

movement

Aircraft designI

P

C

O

R

T

Aircraft design knowledge

Aircraft pitch controlI

P

C

O

R

T

Limiting stabilizer

movementI

P

C

O

R

T

Limitedstabilizer

movement

Aircraft

Lubrication

End-play checkingI

P

C

O

R

T

Allowableend-play

Jackscrew replacement

I

P

C

O

R

T

Excessiveend-play

High workload

Equipment Expertise

Interval approvals

Redundantdesign

Procedures

Procedures

* ETTO = Efficiency-Thoroughness Trade-Off

David

X Arrow Left

David

Highlight

4


Safety management and control

Controller and actuating

deviceProcess

Sensor

Output+

-

Disturbance

Setpoint

The purpose of safety management is ensure that nothing unwanted happens.

An SMS must therefore be able to control a dynamic process or organisation to insure that performance remains within predetermined safety limits.

Key concepts: Process model (nature of activity)Measurements (performance indicators, output)Possibilities for control (means of intervention)Nature of threats (disturbances, noise)


Safety management as feedback control

Process (internal

variability)

Environment (external

variability)

Safety Management

SystemRequired

safety level Performance

Accident model:- simple linear- complex linear- non-linear

Reporting threshold

How can changes be brought about?What are the control options/tools?

Delays in effects? Delays in feedback?

Nature of threats:- regular- irregular- unexampled

Performance indicators

David

Highlight

5


Knowing what may happen

There is an infinite number of ways in which something can go wrong. The problem is to find those that are unlikely yet potentially serious.Pr

obab

ility

(p)

Consequence

Unknown (unsafe)

Requisite imagination:

Where is the cut-off point?

Murphy’s law:“everything that can go wrong

sooner or later will go wrong”

“If there’s more than one way to do a job and one of those ways will end in disaster, then somebody will do it that way.”

Known (safe)


Regular threats

Events that occur so often that the organisation can learn how to respond.

(Westrum, 2006)

Medication errors that only affect a single patient.Transportation accidents (collision between vehicles)Process or component failure (loss of mass, loss of energy)

Regular threats are covered by standard methods (HAZOP, Fault Trees, FMECA, etc.)

Solutions can be based on standard responses,typically elimination or barriers

Their likelihood and severity (cost) are so high that they must be dealt with.

p

Cost

p = 0.01

6


Irregular threats(Westrum, 2006)

p

Cost

p = 0.01

One-off (singular) events, but so many, so rare, and so different that a standard response is impossible.

Apollo 13 moon mission accident.Epidemics (BSE, N5H1)Simultaneous loss of main and back-up systems.

Irregular threats are imaginable but usually completely unexpected. They are discounted by standard methods.

Solutions require interaction and improvisation. Standard responses are insufficient.

Their likelihood is so low that defences are not cost effective, even if consequences are serious.


Unexampled events(Westrum, 2006)

p

Cost

p = 0.01

Events that are virtually impossible to imagine and which exceed the organisation’s collective experience

ChernobylNew Orleans flooding (2005)Attack on the WTC (9/11).

Even when unexampled events are imaginable, they are normally discounted as impossible.

Solutions require the ability to cope, i.e., dynamically to self-organize, formulate and monitor responses.

Their likelihood is so low that defences are notviable, even if consequences are catastrophic.

7


Reactive organisation

Accident

Surprise!Scrambling for action

Activate ready-made plans

Safety planningPreparing for

regular threats

Accident


Interactive (attentive) organisation

Accident

Evaluation, learning


irregular threats

Situation assessment,

quick replanning

Occasional health checks using pre-defined indicators

Prepared and alertLooking for expected

situations.

8


Proactive (resilient) organisation

Accident

Alert and observant.

Situation assessment,

reorganisation

Constantly self-critical and inquisitive

Evaluation, learning


unexampled events

Alternative ways of functioning


Some examples

Reactive (brittle, no resilience)

Interactive (robust, partial

resilience)

Proactive (full resilience)

Mont Blanc Tunnel fire (March 26 1999)Swedish government after Tsunami (December 26 2004)Homeland Security and FEMA after Hurricane Katrina (August 29 2005)

The aviation industry Nuclear power plantsHospitals

Toyota (as innovative manufacturer)People of London after bombing, July 7 2005Israeli hospitals (bus bombings)

Type of organisation Examples

9


Success and failureFailure is normally explained as a breakdown or malfunctioning of a system and/or its components.

Individuals and organisations must adjust to the current conditions in everything they do. Because information, resources and time are finite such adjustments will always be approximate.

Failure is due to the absence of that ability — either temporarily or permanently.

Success is due to the ability of organisations, groups and individuals correctly to make these adjustments, in particular to anticipate failures before they occur.

This view assumes that success and failure are of a fundamentally different nature.

Safety must encompass strengthening this ability, rather than just avoiding or eliminating failures.


“Surprises” and responses

Disturbances, or disrupting events, which challenge the proper functioning of a process.

Organisation’s view on “surprises”

Exceptions that must be regimented.Uncertainty about the future.

A need constantly to update definitions of the difference between success and failure.

A recognition that models and plans are likely to be incomplete or wrong, despite best efforts.

Try to keep process under control and ensure people do not exceed given ‘limits.’

Focus of organisation’s response

Improve ability to detect and to respond when challenged. Prepare routines and plans.

Identify the variability that organisation should be aware of; ensure ability to cope with these variations.Search for the boundaries of own assessments in order to learn and revise.

Reactive

Interactive (attentive)

Proactive (resilient)

David

Highlight

David

Highlight

David

Highlight

David

Highlight

10


From reactive to proactive control

+

-

Process

Sensor

Target state (setpoint)

Anticipatory control

(feedforward)

Compensatory control

(feedback)

Disturbance

Output

The main tool for looking ahead should NOT be to look back

You cannot drive a car by looking in the rear-view mirror!


Environment (external

variability)

Anticipation (irregularities, disturbances,

threats)

SMS as feedforward control

Process (internal

variability)

Safety Management

System

Safety values and targets

Performance

Accident model:- simple linear- complex linear- non-linear Reporting

threshold

How can changes be brought about?What are the control options/tools?

Delays in effects? Delays in feedback?

Nature of threats:- regular- irregular- unexampled

Performance indicators

Customers, regulators, …

David

Highlight

David

Highlight

David

Highlight

situation / context

11


Knowing what to expect

(anticipation)

Knowing what to look

for (attention)

Knowing what to do (rational response)

Components of resilience

AttentionAnticipation Response

Dynamic developments

Upda

ting Learning

Knowledge Competence Resources


Resilience and safety management

Resilience is the intrinsic ability of an organisation to keep or recover a stable state, thereby allowing it to continue operations after a major mishap or in presence of continuous stress.

A practice of Resilience Engineering must comprise the followingcritical components:

Techniques to model and predict the short- and long-term effects of change and decisions on risk.

Tools and methods to improve an organisation’s resilience vis-à-vis the environment.

Ways to analyse, measure and monitor the resilience of organisations in their operating environment.

David

Highlight

David

Highlight

David

Highlight

David

Text Box

Strategic decision making

David

Highlight

David

Highlight

David

Highlight

12


If you want to know more about RE ...

Achieving System Safety by Resilience Engineering IET_System_Safety_Hollnagel

Documents

Transcript of Achieving System Safety by Resilience Engineering IET_System_Safety_Hollnagel