Resilience Engineering and Safety Assessment Erik Hollnagel Professor & Industrial Safety Chair MINES ParisTech — Crisis and Risk Research Centre Sophia Antipolis, France E-mail:
[email protected]
© Erik Hollnagel, 2008
Outline of presentation WHY
Safety and risk come from an engineering tradition, where risks are attributed to unreliable system components — whether human or technological.
Safety assessments usually focus on what can go wrong, and how such developments can be prevented
Resilience engineering focuses on how systems can succeed under varying and unpredictable conditions In resilience engineering, safety assessment therefore focus on what goes right, as well as on what should have gone right. © Erik Hollnagel, 2008
How can we know that we are safe?
Accident analysis
Risk assessment
Explaining and Predicting what understanding what has may happen happened (actual causes) (possible consequences) How can we know what did go wrong?
Elimination or reduction of attributed causes
Elimination or prevention of potential risks
How can we predict what may go wrong?
In order to achieve freedom from risks, models, concepts and methods must be compatible, and be able to describe ‘reality’ in an adequate fashion. © Erik Hollnagel, 2008
First there were technical failures Technology, equipment
% Attributed cause
100 90 80 70 60 50 40 30 20 10 1960
1965
1970
1975
1980
1985
1990
1995
2000
2005 © Erik Hollnagel, 2008
... and technical analysis methods
HAZOP FMEA Fault tree
1900
1910
1920 1930
1940 1950
FMECA
1960 1970
1980 1990
2000 2010
© Erik Hollnagel, 2008
How do we know technology is safe?
Design principles: Clear and explicit Architecture and components: Known Models: Formal, explicit Analysis methods: Standardised, validated Mode of operation: Well-defined (simple) Structural stability: High (permanent) Functional stability: High © Erik Hollnagel, 2008
Then came the “human factor” Technology, equipment
Human performance
% Attributed cause
100 90 80 70 60 50 40 30 20 10 1960
1965
1970
1975
1980
1985
1990
2000
1995
© Erik Hollnagel, 2008
... and human factors analysis methods RCA, ATHEANA HEAT Swiss Cheese HPES
HAZOP
Root cause
1900
1910
Domino
1920 1930 Technical
CSNI FMEA Fault tree FMECA 1940 1950
HERA
HCR THERP
1960 1970
AEB TRACEr
1980 1990
2000 2010
Human Factors © Erik Hollnagel, 2008
How do we know humans are safe?
Design principles: Architecture and components: Models: Analysis methods: Mode of operation: Structural stability: Functional stability:
Unknown, inferred Partly known, partly unknown Mainly analogies Ad hoc, unproven Vaguely defined, complex Variable Usually reliable © Erik Hollnagel, 2008
Finally, “organisational failures” ... Technology, equipment
Human performance
Organisation
% Attributed cause
100 90
?
80 70 60
Which will be the most unreliable component?
50 40
?
30 20 10 1960
1965
1970
1975
1980
1985
1990
1995
2000
2005
?
© Erik Hollnagel, 2008
... and organisational analysis methods RCA, ATHEANA HEAT TRIPOD MTO Swiss Cheese HPES
Root cause
1900
1910
Domino
1920 1930 Technical
FRAM STEP HERA STAMP HCR AcciMap AEB THERP HAZOP MERMOS CSNI FMEA Fault tree TRACEr FMECA CREAM MORT 1940 1950
1960 1970
Human Factors
1980 1990
Organisational
2000 2010 Systemic © Erik Hollnagel, 2008
How do we know organisations are safe?
Design principles: Architecture and components: Models: Analysis methods: Mode of operation: Structural stability: Functional stability:
High-level, programmatic Partly known, partly unknown Semi-formal, Ad hoc, unproven Partly defined, complex Stable (formal), volatile (informal) Good, hysteretic (lagging). © Erik Hollnagel, 2008
Common assumptions System can be decomposed into meaningful elements (components, events) The function of each element is bimodal (true/false, work/fail) The failure probability of elements can be analysed/described individually
The order or sequence of events is predetermined and fixed When combinations occur they can be described as linear (tractable, non-interacting) The influence from context/conditions is limited and quantifiable © Erik Hollnagel, 2008
Theories and models of the negative
Accidents are caused by people, due to carelessness, inexperience, and/or wrong attitudes. Technology and materials are imperfect so failures are inevitable
Organisations are complex but brittle with limited memory and unclear distribution of authority © Erik Hollnagel, 2008
Risks as propagation of failures Decomposable, simple linear models If accidents happen like this ...
Binary branching ... then risks can be found like this ...
The culmination of a chain of events. Find the component that failed by reasoning backwards from the final consequence.
Probability of component failures Find the probability that something “breaks”, either alone or by simple, logical and fixed combinations.
Human failure is treated at the “component” level. © Erik Hollnagel, 2008
Risks as combinations of failures Decomposable, complex linear models
If accidents happen like this ...
Combinations of failures and conditions
... then risks can be found like this ... Combinations of active failures and latent conditions.
Look for how degraded barriers or defences combined with an active (human) failure.
Likelihood of weakened defenses, combinations Single failures combined with latent conditions, leading to degradation of barriers and defences. © Erik Hollnagel, 2008
Learning from when things go right? P(failure) = 10-4
For every time that something goes wrong, there will be 9.999 times when something goes right.
Proposition 1:
Proposition 2:
The ways in which things go right are special cases of the ways in which things go wrong. Successes = failures gone wrong. The best way to improve system safety is therefore to study how things go wrong, and to generalise from that.
The ways in which things go wrong are special cases of the ways in which things go right, or Failures = successes gone wrong. The best way to improve system safety is therefore to study how things go right, and to generalise from that.
Potential data source: 1 case out of 10.000
Potential data source: 9.999 cases out of 10.000 © Erik Hollnagel, 2008
Success and failure Failure is normally explained as a breakdown or malfunctioning of a system and/or its components. This view assumes that success and failure are of a fundamentally different nature. Resilience Engineering recognises that individuals and organisations must adjust to the current conditions in everything they do. Because information, resources and time always are finite, the adjustments will always be approximate. Success is due to the ability of organisations, groups and individuals correctly to make these adjustments, in particular correctly to anticipate risks before failures and harm occur. Failure can be explained as the absence of that ability — either temporarily or permanently. Safety can be improved by strengthening that ability, rather than just by avoiding or eliminating failures. © Erik Hollnagel, 2008
Risks as non-linear combinations Non-decomposable, non-linear models T
Functional resonance analysis model
C
T T
C
C
T
FAA
O
Maintenance oversight
I I
Certification
O
O I
Aircraft design knowledge
P
If accidents happen like this ...
R T
Aircraft
C
P
R
Interval approvals
T
Interval approvals
C
T
Aircraft design
I
High workload
End-play checking
Mechanics
P
Redundant design
O
P
T
C
T Jackscrew up-down movement
I Excessive end-play
High workload
Lubrication
P
R
T
C
P
O
Jackscrew replacement
I
R
P
R Limited stabilizer m ovem ent
C
Horizontal stabilizer movement
I
Procedures
C
P
O
O
Lubrication
I
I
R Controlled stabilizer movem ent
R
T
Limiting stabilizer movement
Allowable end-play
Expertise
Equipment
C
O
Procedures
I
C
FAA
Maintenance oversight
I
O
O
R
T
I
C
Aircraft pitch control
P
O
R
Certification
O
Aircraft design knowledge
... then risks can be found like this ...
P
R T
Aircraft
C
P
R
Interval approvals
T
Interval approvals
C
T
Aircraft design
I
High workload Procedures
End-play checking
I
Mechanics
P
Redundant design
O
P
Controlled stabilizer movem ent
R C
T Jackscrew up-down movement
Expertise
I Excessive end-play
High workload
T
C
Lubrication
P
Horizontal stabilizer movement
I
Procedures
P
R
T
C
P
O
I
R
P
O
R Limited stabilizer m ovem ent
C
O
Lubrication
I
Limiting stabilizer movement
I
R
Allowable end-play
T
Equipment
C
O
Jackscrew replacement
O
O
R
T
I
C
Aircraft pitch control
P
O
R
Grease
P
Grease
R
P
Expertise
R
Expertise
Unexpected combinations (resonance) of variability of normal performance.
Systems at risk are intractable rather than tractable.
Unexpected combinations (resonance) of variability of normal performance.
The established assumptions therefore have to be revised © Erik Hollnagel, 2008
Revised assumptions - 2008 Systems cannot be decomposed in a meaningful way (no natural elements or components)
T
System functions are not bimodal, but normal performance is — and must be — variable. Outcomes are determined by performance variability rather than by (human) failure probability. Performance variability is the reason why things go right — but also why they go wrong. Some adverse events can be attributed to failures and malfunctions of normal functions, but others are best understood as the result of combinations of variability of normal performance.
C T
C
FAA
Maintenance oversight
I
O I
Certification
O
Aircraft design knowledge
P
R T
Aircraft
C
P
R
Interval approvals
T
Interval approvals
C
High workload
T
Aircraft design
I Procedures
End-play checking
I
Mechanics
P
Redundant design
O
P
Controlled stabilizer movement
R C
P
T Jackscrew up-down movement
Expertise
I Excessive end-play
High workload
T
C
Lubrication
P
O
Procedures
P
R
T
C
P
I
Jackscrew replacement
Grease
P Expertise
Horizontal stabilizer movement
I
O
R
R Limited stabilizer movement
C
O
Lubrication
I
Limiting stabilizer movement
I
R
Allowable end-play
T
Equipment
C
O
R
O
O
R
T
I
C
Aircraft pitch control
P
R
O
Risk and safety analyses should try to understand the nature of variability of normal performance and use that to identify conditions that may lead to both positive and adverse outcomes. © Erik Hollnagel, 2008
From the negative to the positive Negative outcomes are caused by failures and malfunctions.
All outcomes (positive and negative) are due to performance variability..
Safety = Reduced number of adverse events.
Safety = Ability to respond when something fails.
Safety = Ability to succeed under varying conditions.
Eliminate failures and malfunctions as far as possible.
Improve ability to respond to adverse events.
Improve resilience. © Erik Hollnagel, 2008
Resilience and safety management Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations even after a major mishap or in the presence of continuous stress. A practice of Resilience Engineering / Proactive Safety Management requires that all levels of the organisation are able to: Respond to regular and irregular threats in an effective, flexible manner,
Anticipate threats, disruptions and destabilizing conditions
Actual Factual Learn from past events, understand correctly what happened and why
Critical
Potential
Monitor threats and revise risk models © Erik Hollnagel, 2008
Designing for resilience Responding: Knowing what to do, being capable of doing it.
Anticipating: Finding out and knowing what to expect
Actual Factual Learning: Knowing what has happened
Critical
Potential
Monitoring: Knowing what to look for (attention)
An increased availability and reliability of functioning on all levels will not only improve safety but also enhance control, hence the ability to predict, plan, and produce. © Erik Hollnagel, 2008
As Low As Reasonably Practicable Unacceptable region (intolerable risk) INVEST!
Must be eliminated or contained at any cost Will be eliminated or contained, if not too costly
Save rather than invest
ALARP or Tolerability region (tolerable risk)
Should be eliminated or contained or otherwise responded to May be eliminated or contained or otherwise responded to
SAVE!
Broadly acceptable region (negligible risk)
Might be assessed when feasible © Erik Hollnagel, 2008
As high as reasonably practicable Which events? How were they found? Is the list revised? How is readiness ensured and maintained?
What is our “model” of the future? How long to we look ahead? What risks are we willing to take? Who believes what and why? Actual
Factual
Critical
What, when continuously or event-driven, from what (successes or failures), how (qualitative, quantitative), by individual or by organisation?
Potential
How are indicators defined? Lagging / leading? How are they “measured”? Are effects transient or permanent? Who looks where and when? How, and when, are they revised?
© Erik Hollnagel, 2008
Resilience and safety management Managing risks of the present: Since prevention has its limitations, it is necessary also to monitor the state of the system and / or organisation. This requires an articulated model of leading / lagging indicators and of “weak” signals. Actual Managing risks Factual of the past: Effective risk management must consider both what went right and what went wrong. Issues: how to learn from accidents, near misses and successes?
Managing risks of the future: Risk management means taking risks when preparing for future events. This requires a strategy to address both safety and business goals, and a practical and realistic way of identifying future risks and threats.
Critical
Potential
© Erik Hollnagel, 2008
Thanks for your attention Any questions?
© Erik Hollnagel, 2008