Achieving System Safety by Resilience Engineering IET_System_Safety_Hollnagel
Transcript of Achieving System Safety by Resilience Engineering IET_System_Safety_Hollnagel
1
© Erik Hollnagel 2006
Achieving System Safety by Resilience Engineering
Erik HollnagelIndustrial Safety Chair, École des Mines de Paris, France
E-mail: [email protected], University of Linköping, Sweden
E-mail: [email protected]
© Erik Hollnagel 2006
Accidents, incidents
Safety as a non-event
Daily operation
(Status quo)
Unwanted outcomeUnexpected event
Prevention of unwanted events
Protection against unwanted outcomes
SAFE SYSTEM = NOTHING UNWANTED HAPPENS
Reduce likelihood.
Reduce consequences.
Safety management must prevent/protect against both KNOWN and UNKNOWN risks.Safety management requires THINKING about how accidents can HAPPEN
2
© Erik Hollnagel 2006
Looking into the futureLooking at the past
What has happened? What may happen?
Accident model
Simple linear
Complex linear
Non-linear*
* outcomes are not proportional toinputs, and cannot be derived froma simple combination of inputs
Risk model
Component failures
Combination of failures and degraded defences
Performance variability coincidences
© Erik Hollnagel 2006
Simple, linear cause-effect modelAssumption: Accidents are the (natural) culmination of a series of events or circumstances, which occur in a specific and recognisable order.
Consequence: Accidents are prevented by finding and eliminating possible causes. Safety is ensured by improving the organisation’s ability to respond.
Domino model (Heinrich, 1930)
Hazards-risks: Due to component failures (technical, human, organisational), hence looking for failure probabilities (event tree, PRA/HRA).
3
© Erik Hollnagel 2006
Consequence: Accidents are prevented by strengthening barriers and defences. Safety is ensured by measuring/sampling performance indicators.
Complex, linear cause-effect modelAssumption: Accidents result from a combination of active failures (unsafe acts) and latent conditions (hazards).
Swiss cheese model (Reason, 1990)
Hazards-risks: Due to degradation of components (organisational, human, technical), hence looking for drift, degradation and weaknesses
© Erik Hollnagel 2006
Consequence: Accidents are prevented by monitoring and damping variability. Safety requires constant ability to anticipate future events.
Non-linear accident modelAssumption: Accidents result from unexpected combinations (resonance) of normal performance variability.
Hazards-risks: Emerges from combinations of normal variability (socio-technical system), hence looking for ETTO* and sacrificing decision
Functional Resonance Accident Model
CertificationI
P
C
O
R
TFAA
LubricationI
P
C
O
R
T
Mechanics
High workload
Grease
Maintenance oversightI
P
C
O
R
T
Interval approvals
Horizontal stabilizer
movementI
P
C
O
R
TJackscrew up-down
movementI
P
C
O
R
T
Expertise
Controlledstabilizer
movement
Aircraft designI
P
C
O
R
T
Aircraft design knowledge
Aircraft pitch controlI
P
C
O
R
T
Limiting stabilizer
movementI
P
C
O
R
T
Limitedstabilizer
movement
Aircraft
Lubrication
End-play checkingI
P
C
O
R
T
Allowableend-play
Jackscrew replacement
I
P
C
O
R
T
Excessiveend-play
High workload
Equipment Expertise
Interval approvals
Redundantdesign
Procedures
Procedures
* ETTO = Efficiency-Thoroughness Trade-Off
4
© Erik Hollnagel 2006
Safety management and control
Controller and actuating
deviceProcess
Sensor
Output+
-
Disturbance
Setpoint
The purpose of safety management is ensure that nothing unwanted happens.
An SMS must therefore be able to control a dynamic process or organisation to insure that performance remains within predetermined safety limits.
Key concepts: Process model (nature of activity)Measurements (performance indicators, output)Possibilities for control (means of intervention)Nature of threats (disturbances, noise)
© Erik Hollnagel 2006
Safety management as feedback control
Process (internal
variability)
Environment (external
variability)
Safety Management
SystemRequired
safety level Performance
Accident model:- simple linear- complex linear- non-linear
Reporting threshold
How can changes be brought about?What are the control options/tools?
Delays in effects? Delays in feedback?
Nature of threats:- regular- irregular- unexampled
Performance indicators
5
© Erik Hollnagel 2006
Knowing what may happen
There is an infinite number of ways in which something can go wrong. The problem is to find those that are unlikely yet potentially serious.Pr
obab
ility
(p)
Consequence
Unknown (unsafe)
Requisite imagination:
Where is the cut-off point?
Murphy’s law:“everything that can go wrong
sooner or later will go wrong”
“If there’s more than one way to do a job and one of those ways will end in disaster, then somebody will do it that way.”
Known (safe)
© Erik Hollnagel 2006
Regular threats
Events that occur so often that the organisation can learn how to respond.
(Westrum, 2006)
Medication errors that only affect a single patient.Transportation accidents (collision between vehicles)Process or component failure (loss of mass, loss of energy)
Regular threats are covered by standard methods (HAZOP, Fault Trees, FMECA, etc.)
Solutions can be based on standard responses,typically elimination or barriers
Their likelihood and severity (cost) are so high that they must be dealt with.
p
Cost
p = 0.01
6
© Erik Hollnagel 2006
Irregular threats(Westrum, 2006)
p
Cost
p = 0.01
One-off (singular) events, but so many, so rare, and so different that a standard response is impossible.
Apollo 13 moon mission accident.Epidemics (BSE, N5H1)Simultaneous loss of main and back-up systems.
Irregular threats are imaginable but usually completely unexpected. They are discounted by standard methods.
Solutions require interaction and improvisation. Standard responses are insufficient.
Their likelihood is so low that defences are not cost effective, even if consequences are serious.
© Erik Hollnagel 2006
Unexampled events(Westrum, 2006)
p
Cost
p = 0.01
Events that are virtually impossible to imagine and which exceed the organisation’s collective experience
ChernobylNew Orleans flooding (2005)Attack on the WTC (9/11).
Even when unexampled events are imaginable, they are normally discounted as impossible.
Solutions require the ability to cope, i.e., dynamically to self-organize, formulate and monitor responses.
Their likelihood is so low that defences are notviable, even if consequences are catastrophic.
7
© Erik Hollnagel 2006
Reactive organisation
Accident
Surprise!Scrambling for action
Activate ready-made plans
Safety planningPreparing for
regular threats
Accident
© Erik Hollnagel 2006
Interactive (attentive) organisation
Accident
Evaluation, learning
Safety planningPreparing for
irregular threats
Situation assessment,
quick replanning
Occasional health checks using pre-defined indicators
Prepared and alertLooking for expected
situations.
8
© Erik Hollnagel 2006
Proactive (resilient) organisation
Accident
Alert and observant.
Situation assessment,
reorganisation
Constantly self-critical and inquisitive
Evaluation, learning
Safety planningPreparing for
unexampled events
Alternative ways of functioning
© Erik Hollnagel 2006
Some examples
Reactive (brittle, no resilience)
Interactive (robust, partial
resilience)
Proactive (full resilience)
Mont Blanc Tunnel fire (March 26 1999)Swedish government after Tsunami (December 26 2004)Homeland Security and FEMA after Hurricane Katrina (August 29 2005)
The aviation industry Nuclear power plantsHospitals
Toyota (as innovative manufacturer)People of London after bombing, July 7 2005Israeli hospitals (bus bombings)
Type of organisation Examples
9
© Erik Hollnagel 2006
Success and failureFailure is normally explained as a breakdown or malfunctioning of a system and/or its components.
Individuals and organisations must adjust to the current conditions in everything they do. Because information, resources and time are finite such adjustments will always be approximate.
Failure is due to the absence of that ability — either temporarily or permanently.
Success is due to the ability of organisations, groups and individuals correctly to make these adjustments, in particular to anticipate failures before they occur.
This view assumes that success and failure are of a fundamentally different nature.
Safety must encompass strengthening this ability, rather than just avoiding or eliminating failures.
© Erik Hollnagel 2006
“Surprises” and responses
Disturbances, or disrupting events, which challenge the proper functioning of a process.
Organisation’s view on “surprises”
Exceptions that must be regimented.Uncertainty about the future.
A need constantly to update definitions of the difference between success and failure.
A recognition that models and plans are likely to be incomplete or wrong, despite best efforts.
Try to keep process under control and ensure people do not exceed given ‘limits.’
Focus of organisation’s response
Improve ability to detect and to respond when challenged. Prepare routines and plans.
Identify the variability that organisation should be aware of; ensure ability to cope with these variations.Search for the boundaries of own assessments in order to learn and revise.
Reactive
Interactive (attentive)
Proactive (resilient)
10
© Erik Hollnagel 2006
From reactive to proactive control
+
-
Process
Sensor
Target state (setpoint)
Anticipatory control
(feedforward)
Compensatory control
(feedback)
Disturbance
Output
The main tool for looking ahead should NOT be to look back
You cannot drive a car by looking in the rear-view mirror!
© Erik Hollnagel 2006
Environment (external
variability)
Anticipation (irregularities, disturbances,
threats)
SMS as feedforward control
Process (internal
variability)
Safety Management
System
Safety values and targets
Performance
Accident model:- simple linear- complex linear- non-linear Reporting
threshold
How can changes be brought about?What are the control options/tools?
Delays in effects? Delays in feedback?
Nature of threats:- regular- irregular- unexampled
Performance indicators
Customers, regulators, …
11
© Erik Hollnagel 2006
Knowing what to expect
(anticipation)
Knowing what to look
for (attention)
Knowing what to do (rational response)
Components of resilience
AttentionAnticipation Response
Dynamic developments
Upda
ting Learning
Knowledge Competence Resources
© Erik Hollnagel 2006
Resilience and safety management
Resilience is the intrinsic ability of an organisation to keep or recover a stable state, thereby allowing it to continue operations after a major mishap or in presence of continuous stress.
A practice of Resilience Engineering must comprise the followingcritical components:
Techniques to model and predict the short- and long-term effects of change and decisions on risk.
Tools and methods to improve an organisation’s resilience vis-à-vis the environment.
Ways to analyse, measure and monitor the resilience of organisations in their operating environment.
12
© Erik Hollnagel 2006
If you want to know more about RE ...