Resilience Engineering and Safety Assessment Erik Hollnagel Professor & Industrial Safety Chair MINES ParisTech — Crisis and Risk Research Centre Sophia Antipolis, France E-mail: [email protected]

© Erik Hollnagel, 2008

Outline of presentation WHY

Safety and risk come from an engineering tradition, where risks are attributed to unreliable system components — whether human or technological.

Safety assessments usually focus on what can go wrong, and how such developments can be prevented

Resilience engineering focuses on how systems can succeed under varying and unpredictable conditions In resilience engineering, safety assessment therefore focus on what goes right, as well as on what should have gone right. © Erik Hollnagel, 2008

How can we know that we are safe?

Accident analysis

Risk assessment

Explaining and Predicting what understanding what has may happen happened (actual causes) (possible consequences) How can we know what did go wrong?

Elimination or reduction of attributed causes

Elimination or prevention of potential risks

How can we predict what may go wrong?

In order to achieve freedom from risks, models, concepts and methods must be compatible, and be able to describe ‘reality’ in an adequate fashion. © Erik Hollnagel, 2008

First there were technical failures Technology, equipment

% Attributed cause

100 90 80 70 60 50 40 30 20 10 1960

1965

1970

1975

1980

1985

1990

1995

2000

2005 © Erik Hollnagel, 2008

... and technical analysis methods

HAZOP FMEA Fault tree

1900

1910

1920 1930

1940 1950

FMECA

1960 1970

1980 1990

2000 2010

© Erik Hollnagel, 2008

How do we know technology is safe?

Design principles: Clear and explicit Architecture and components: Known Models: Formal, explicit Analysis methods: Standardised, validated Mode of operation: Well-defined (simple) Structural stability: High (permanent) Functional stability: High © Erik Hollnagel, 2008

Then came the “human factor” Technology, equipment

Human performance

% Attributed cause

100 90 80 70 60 50 40 30 20 10 1960

1965

1970

1975

1980

1985

1990

2000

1995

© Erik Hollnagel, 2008

... and human factors analysis methods RCA, ATHEANA HEAT Swiss Cheese HPES

HAZOP

Root cause

1900

1910

Domino

1920 1930 Technical

CSNI FMEA Fault tree FMECA 1940 1950

HERA

HCR THERP

1960 1970

AEB TRACEr

1980 1990

2000 2010

Human Factors © Erik Hollnagel, 2008

How do we know humans are safe?

Design principles: Architecture and components: Models: Analysis methods: Mode of operation: Structural stability: Functional stability:

Unknown, inferred Partly known, partly unknown Mainly analogies Ad hoc, unproven Vaguely defined, complex Variable Usually reliable © Erik Hollnagel, 2008

Finally, “organisational failures” ... Technology, equipment

Human performance

Organisation

% Attributed cause

100 90

?

80 70 60

Which will be the most unreliable component?

50 40

?

30 20 10 1960

1965

1970

1975

1980

1985

1990

1995

2000

2005

?

© Erik Hollnagel, 2008

... and organisational analysis methods RCA, ATHEANA HEAT TRIPOD MTO Swiss Cheese HPES

Root cause

1900

1910

Domino

1920 1930 Technical

FRAM STEP HERA STAMP HCR AcciMap AEB THERP HAZOP MERMOS CSNI FMEA Fault tree TRACEr FMECA CREAM MORT 1940 1950

1960 1970

Human Factors

1980 1990

Organisational

2000 2010 Systemic © Erik Hollnagel, 2008

How do we know organisations are safe?

Design principles: Architecture and components: Models: Analysis methods: Mode of operation: Structural stability: Functional stability:

High-level, programmatic Partly known, partly unknown Semi-formal, Ad hoc, unproven Partly defined, complex Stable (formal), volatile (informal) Good, hysteretic (lagging). © Erik Hollnagel, 2008

Common assumptions System can be decomposed into meaningful elements (components, events) The function of each element is bimodal (true/false, work/fail) The failure probability of elements can be analysed/described individually

The order or sequence of events is predetermined and fixed When combinations occur they can be described as linear (tractable, non-interacting) The influence from context/conditions is limited and quantifiable © Erik Hollnagel, 2008

Theories and models of the negative

Accidents are caused by people, due to carelessness, inexperience, and/or wrong attitudes. Technology and materials are imperfect so failures are inevitable

Organisations are complex but brittle with limited memory and unclear distribution of authority © Erik Hollnagel, 2008

Risks as propagation of failures Decomposable, simple linear models If accidents happen like this ...

Binary branching ... then risks can be found like this ...

The culmination of a chain of events. Find the component that failed by reasoning backwards from the final consequence.

Probability of component failures Find the probability that something “breaks”, either alone or by simple, logical and fixed combinations.

Human failure is treated at the “component” level. © Erik Hollnagel, 2008

Risks as combinations of failures Decomposable, complex linear models

If accidents happen like this ...

Combinations of failures and conditions

... then risks can be found like this ... Combinations of active failures and latent conditions.

Look for how degraded barriers or defences combined with an active (human) failure.

Likelihood of weakened defenses, combinations Single failures combined with latent conditions, leading to degradation of barriers and defences. © Erik Hollnagel, 2008

Learning from when things go right? P(failure) = 10-4

For every time that something goes wrong, there will be 9.999 times when something goes right.

Proposition 1:

Proposition 2:

The ways in which things go right are special cases of the ways in which things go wrong. Successes = failures gone wrong. The best way to improve system safety is therefore to study how things go wrong, and to generalise from that.

The ways in which things go wrong are special cases of the ways in which things go right, or Failures = successes gone wrong. The best way to improve system safety is therefore to study how things go right, and to generalise from that.

Potential data source: 1 case out of 10.000

Potential data source: 9.999 cases out of 10.000 © Erik Hollnagel, 2008

Success and failure Failure is normally explained as a breakdown or malfunctioning of a system and/or its components. This view assumes that success and failure are of a fundamentally different nature. Resilience Engineering recognises that individuals and organisations must adjust to the current conditions in everything they do. Because information, resources and time always are finite, the adjustments will always be approximate. Success is due to the ability of organisations, groups and individuals correctly to make these adjustments, in particular correctly to anticipate risks before failures and harm occur. Failure can be explained as the absence of that ability — either temporarily or permanently. Safety can be improved by strengthening that ability, rather than just by avoiding or eliminating failures. © Erik Hollnagel, 2008

Risks as non-linear combinations Non-decomposable, non-linear models T

Functional resonance analysis model

C

T T

C

C

T

FAA

O

Maintenance oversight

I I

Certification

O

O I

Aircraft design knowledge

P

If accidents happen like this ...

R T

Aircraft

C

P

R

Interval approvals

T

Interval approvals

C

T

Aircraft design

I

High workload

End-play checking

Mechanics

P

Redundant design

O

P

T

C

T Jackscrew up-down movement

I Excessive end-play

High workload

Lubrication

P

R

T

C

P

O

Jackscrew replacement

I

R

P

R Limited stabilizer m ovem ent

C

Horizontal stabilizer movement

I

Procedures

C

P

O

O

Lubrication

I

I

R Controlled stabilizer movem ent

R

T

Limiting stabilizer movement

Allowable end-play

Expertise

Equipment

C

O

Procedures

I

C

FAA

Maintenance oversight

I

O

O

R

T

I

C

Aircraft pitch control

P

O

R

Certification

O

Aircraft design knowledge

... then risks can be found like this ...

P

R T

Aircraft

C

P

R

Interval approvals

T

Interval approvals

C

T

Aircraft design

I

High workload Procedures

End-play checking

I

Mechanics

P

Redundant design

O

P

Controlled stabilizer movem ent

R C

T Jackscrew up-down movement

Expertise

I Excessive end-play

High workload

T

C

Lubrication

P

Horizontal stabilizer movement

I

Procedures

P

R

T

C

P

O

I

R

P

O

R Limited stabilizer m ovem ent

C

O

Lubrication

I

Limiting stabilizer movement

I

R

Allowable end-play

T

Equipment

C

O

Jackscrew replacement

O

O

R

T

I

C

Aircraft pitch control

P

O

R

Grease

P

Grease

R

P

Expertise

R

Expertise

Unexpected combinations (resonance) of variability of normal performance.

Systems at risk are intractable rather than tractable.

Unexpected combinations (resonance) of variability of normal performance.

The established assumptions therefore have to be revised © Erik Hollnagel, 2008

Revised assumptions - 2008 Systems cannot be decomposed in a meaningful way (no natural elements or components)

T

System functions are not bimodal, but normal performance is — and must be — variable. Outcomes are determined by performance variability rather than by (human) failure probability. Performance variability is the reason why things go right — but also why they go wrong. Some adverse events can be attributed to failures and malfunctions of normal functions, but others are best understood as the result of combinations of variability of normal performance.

C T

C

FAA

Maintenance oversight

I

O I

Certification

O

Aircraft design knowledge

P

R T

Aircraft

C

P

R

Interval approvals

T

Interval approvals

C

High workload

T

Aircraft design

I Procedures

End-play checking

I

Mechanics

P

Redundant design

O

P

Controlled stabilizer movement

R C

P

T Jackscrew up-down movement

Expertise

I Excessive end-play

High workload

T

C

Lubrication

P

O

Procedures

P

R

T

C

P

I

Jackscrew replacement

Grease

P Expertise

Horizontal stabilizer movement

I

O

R

R Limited stabilizer movement

C

O

Lubrication

I

Limiting stabilizer movement

I

R

Allowable end-play

T

Equipment

C

O

R

O

O

R

T

I

C

Aircraft pitch control

P

R

O

Risk and safety analyses should try to understand the nature of variability of normal performance and use that to identify conditions that may lead to both positive and adverse outcomes. © Erik Hollnagel, 2008

From the negative to the positive Negative outcomes are caused by failures and malfunctions.

All outcomes (positive and negative) are due to performance variability..

Safety = Reduced number of adverse events.

Safety = Ability to respond when something fails.

Safety = Ability to succeed under varying conditions.

Eliminate failures and malfunctions as far as possible.

Improve ability to respond to adverse events.

Improve resilience. © Erik Hollnagel, 2008

Resilience and safety management Resilience is the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations even after a major mishap or in the presence of continuous stress. A practice of Resilience Engineering / Proactive Safety Management requires that all levels of the organisation are able to: Respond to regular and irregular threats in an effective, flexible manner,

Anticipate threats, disruptions and destabilizing conditions

Actual Factual Learn from past events, understand correctly what happened and why

Critical

Potential

Monitor threats and revise risk models © Erik Hollnagel, 2008

Designing for resilience Responding: Knowing what to do, being capable of doing it.

Anticipating: Finding out and knowing what to expect

Actual Factual Learning: Knowing what has happened

Critical

Potential

Monitoring: Knowing what to look for (attention)

An increased availability and reliability of functioning on all levels will not only improve safety but also enhance control, hence the ability to predict, plan, and produce. © Erik Hollnagel, 2008

As Low As Reasonably Practicable Unacceptable region (intolerable risk) INVEST!

Must be eliminated or contained at any cost Will be eliminated or contained, if not too costly

Save rather than invest

ALARP or Tolerability region (tolerable risk)

Should be eliminated or contained or otherwise responded to May be eliminated or contained or otherwise responded to

SAVE!

Broadly acceptable region (negligible risk)

Might be assessed when feasible © Erik Hollnagel, 2008

As high as reasonably practicable Which events? How were they found? Is the list revised? How is readiness ensured and maintained?

What is our “model” of the future? How long to we look ahead? What risks are we willing to take? Who believes what and why? Actual

Factual

Critical

What, when continuously or event-driven, from what (successes or failures), how (qualitative, quantitative), by individual or by organisation?

Potential

How are indicators defined? Lagging / leading? How are they “measured”? Are effects transient or permanent? Who looks where and when? How, and when, are they revised?

© Erik Hollnagel, 2008

Resilience and safety management Managing risks of the present: Since prevention has its limitations, it is necessary also to monitor the state of the system and / or organisation. This requires an articulated model of leading / lagging indicators and of “weak” signals. Actual Managing risks Factual of the past: Effective risk management must consider both what went right and what went wrong. Issues: how to learn from accidents, near misses and successes?

Managing risks of the future: Risk management means taking risks when preparing for future events. This requires a strategy to address both safety and business goals, and a practical and realistic way of identifying future risks and threats.

Critical

Potential

© Erik Hollnagel, 2008

Thanks for your attention Any questions?

© Erik Hollnagel, 2008