
When an incident occurs in a complex system, we seek to understand what has happened. Identifying the event that triggered the visible consequences is insufficient to explain why a system didn’t behave as intended.
A few principles guide our understanding of incidents.
There is no single “root cause.”
When a complex system exhibits a failure, it’s natural to look at the event that triggered the visible problem. However, when we look at the system as a whole, there are usually feedback loops and protective systems that should have kept that triggering event from happening.
When components of a system have a bad interaction, isolating the problem to a single component may fix the current problem, but leave latent the underlying potential for components to interact badly, resulting in future incidents.
“Human error” is the product of system failure.
During an incident, human operators make decisions based on the information available to us at the time. But when we look back at what happened, our judgments about those decisions are colored by knowledge that was unavailable at the time.
Cognitive biases in the incident review process come into play; hindsight bias makes it seem as though the consequences “should have been” apparent at the time, and attribution bias makes the decision seem inherent to the personal character of the operator.
A wider view seeks to understand how the system appeared at the time of the incident, what feedback about operational actions was available, and the pressures and incentives that are part of the human systems alongside the technical systems.
That viewpoint leads to possibilities for systemic improvements that do not lay blame on individuals or rely on “trying harder” for doing better next time.
System control failures lead to accidents.
Complex systems normally operate within guardrails, constrained in their actions. When controls that keep the system in those boundaries fail, accidents occur. Looking at systems through the lens of the signals and feedback loops that bind components together is a powerful way to identify what we expected to keep the system in a safe state and what did not work as intended.
We have together over a decade’s experience in analyzing complex incidents and helping teams build models to use to take effective action and improve system safety. We primarily turn to the methodologies created by Dr Nancy Leveson, including STPA, which provide holistic views of systems, and we are practiced at adapting them to the constraints of working software engineering teams.
If you’re interested in better understanding the incidents in your systems and organizations, please drop us a line!
Written by Rachel Silber and Kevin Riggle. Photo credit: iStock.com/EyeEm Mobile GmbH.