Reflections on Resilience: Digitalization and ‘Errors of the Third Kind’
If you have been working in Financial Services Technology and Architecture for any length of time, the emerging requirements for Digital Operational Resilience (as enshrined within DORA, APRA CPS 230, Singapore’s Financial Services and Markets Act, and Bank of England’s PS21/3) will come as no surprise. The digital transformation of the sector has created a level of interconnectedness, dynamism, and complexity which requires a different approach to business continuity and risk management. Where organizations used to utilise IT systems to automate aspects of their business, institutions and entire financial systems now operate end-to-end digitally with software controlling every aspect of the digital economy.
We got here, unfortunately, having experienced a string of incidents resulting from an ‘Error of the Third Kind’ in our approach to continuity. These are errors where we find the right answer for the wrong problem, and when it comes to Business Continuity and Operational Risk Management, we have been making these errors for many years.
For example, back in the late 2000s / early 2010s, while I was working as a Technology Fellow at one of the G-SIBs (Globally Systemically Important Banks) we noticed an important development; that the incidents which regularly caused the largest impact had nothing to do with classical ‘business continuity’ or disaster scenarios, but were caused by an emergent level of complexity and lack of transparency into the complex, interconnected systems that delivered Critical Business Services. Our relentless drive to innovate had transformed our business fabric in a way that our isolated bi-annual recovery tests for our most important applications failed to address.
The approach to business continuity then (and for many years) involved the following:
- Planning for a limited set of traditional disaster scenarios (for example, the Datacenter or Office location failure).
- Periodic risk assessments run manually against static inventories.
- Infrequent testing of the recovery of primary systems delivering a critical business function, without necessarily considering their critical dependencies or the multitude of ways in which complex systems can fail or degrade.
When we looked at the problem back then, it became clear that the ‘new reality’ was that business functions were now delivered by complex interdependent sets of digital resources (interacting with humans through complex layers of abstraction), which changed often, and failed for a multitude of reasons – often a combination of individually trivial and opaque reasons. We realized that we needed to look at resilience differently; in particular in terms of the information required (importantly, its accuracy and currency) and reducing complexity and variability (by simplifying architectures around a well-tested and understood set of patterns).
In the intervening years, we have seen the dominance of cloud computing, continuous application deployment pipelines, exploding third party dependencies and microservices. The problems we identified 10-20 years ago have become exponentially more challenging and we have become almost totally dependent upon these digital systems.
So, the move to Operational Resilience is welcome. It teaches us that we need to gain control of our accumulated complexity and manage risk differently. And we can no longer separate business continuity and enterprise risk management from technology operations and resilience. They have become the same thing, which is why board level governance and accountability for digital operations is a required aspect of every Operational Resilience regulation.
I hope you have found this consideration of the need for Operational Resilience in a digital world insightful. Over the next few articles, I plan to explore complexity, entropy, the human factor, the need to observe and adapt, and the impact of resilience initiatives that can have the opposite effect. I will also reflect on some examples of failures where a focus on systemic resilience would have either enabled organizations to anticipate and avoid, or manage the impact more effectively.
About Marc Woolward:
Marc Woolward is CTO and CISO at vArmour. He has spent his career working to architect and operate mission critical architectures within the financial services industry and now many of our customers across all sectors of critical national infrastructure. Through his career, Marc has worked to address resilience requirements within critical infrastructure across networking, cloud, SaaS, middleware, voice communications, and enterprise architecture. This blog series reflects on learnings from a career spent working with resilience in complex, mission critical systems and references notable research and academic publications. In particular, Marc recommends ‘Resilience Engineering – Concepts and Precepts’ by Hollnagel, Woods and Leverson for its presentation of the wealth of studies into resilience in complex systems).