Reflections on Resilience: The Properties of Resilient Systems and Their Business Benefits

 

Around the world, organizations are scrambling to meet enforcement dates for key operational resilience regulations (see the Resource section at the end of this article for an overview of today’s major regulations). This pressure is being driven by significant potential financial and reputational penalties imposed by regulators and from board members and leadership teams who will be held accountable.

However, among those organizations affected by the regulations, we see several notable examples who are approaching their resilience program as an opportunity to transform their business operations. Specifically, these companies are looking at resilience as a way of accelerating innovation, providing more certainty to their boards, and reducing costs.

In this article we’ll explore the key properties of resilience and how they can significantly benefit organizations followed by a business case for resilience.

 

The Key Properties of Resilience

Like every operational resilience regulation and best practice, resilience benefits organizations in dealing with cyber threats and routine non-adversarial operational issues. Of note, the majority of high profile issues experienced within the global financial services industry have been related to operational failures.

In this section we’ll examine the following four key properties of the resilience system and how each can benefit a business’ operations:

  1. Anticipate and Avoid
  2. Withstand
  3. Recover
  4. Adapt

In fact, in the Special Publication 800-160, NIST defines cyber resilience (which is a close sibling of operational resilience within any digital business) as:

“The ability to anticipate, withstand, recover from, and adapt to adverse conditions, stresses, attacks, or compromises on systems that use or are enabled by cyber resources.”  And stresses that “Cyber resiliency constructs address both adversarial and non-adversarial threats from cyber and non-cyber sources.”

 

Property 1: Anticipate and Avoid

The anticipation and avoidance of incidents is the essence of operating resiliently in comparison to earlier paradigms that were centered around incident identification and recovery. At its core, this new approach identifies material risks and corrects material conditions before any harm or cost to the organization occurs.

In operational resilience circles, the risk tolerance enables boundaries to be set which enable the business to operate within safe guardrails – limiting the potential impact of any incident, but more importantly avoiding an incident from occurring in the first place. A simple example within financial services is the protection of capital reserves and restriction of capital at risk. 

Below are a couple of examples of risk tolerances within the digital and cyber world:

Risk Tolerance Policy Business Benefit
Adversarial Risk Tolerance “No critical dependency of an Important Business Service (IBS) should have software vulnerability rated Medium Severity or higher.” By adhering to this policy, organizations can reduce the exposed software attack surface of IBSs so that if a breach or targeted ransomware attack occurred there is lower probability of a service-affecting incident.
Non-Adversarial Risk Tolerance “Material changes to Important Business Service (IBS) behaviour must be fully tested, reviewed, and approved.” By adhering to this policy, the organization ensures a baseline level of diligence associated with changes. While errors can still be made, steps have been taken to ensure testing and review occurs prior to changes to critical services.
Universal Risk Tolerance “No Important Business Service (IBS) should be accessed from an unknown, unapproved legacy, or unmanaged system, or any system outside of the production environment.” Complex enterprises often struggle to understand their entire asset inventory, and to ensure that all systems (many of which are poorly understood legacy systems) are correctly managed.Dependency and communication with systems lacking full production properties and controls can increase risk of breach or operational failure (due to lack of system knowledge, or simply failure rates associated with aging hardware).
Addressing this issue enterprise-wide can be an intractable problem within an immediate timescale. A more scalable approach to mitigate this risk can be to ensure that systems interacting with or interdependent within IBSs are prioritized for remediation and monitored closely.

Property Summary

Defining and closely monitoring a set of risk tolerances enables organizations to anticipate and avoid a large number of potential incidents, or at least eliminate factors contributing towards them. For this reason, investment in the ability to anticipate issues within a complex environment can potentially deliver the most significant return on investment as losses and disruption can be avoided entirely.

Financial impact of issues addressed by Anticipation and Avoidance strategy: $0

 

Property 2: Withstand

No matter how careful organizations are to anticipate and avoid incidents, they will sometimes still occur.

In today’s asymmetric cyber environment, breaches are increasingly inevitable and physical environments fail from time to time. However, by taking the time to identify resilient design patterns, apply deep defensive controls, and assuring that they are applied, organizations can prepare themselves to withstand these incidents – potentially reducing the impact on IBSs to a minimum. 

Let’s take a look at examples of measures that are commonly applied to enable an organization to withstand an incident.

Withstand Measure Description
Resilient Design Pattern Ensure no IBS function contains a single point of failure within a single failure domain (AZ). This is critical to ensuring a datacenter and other failure domains impacting the IBS continue to operate as normal.
Defensive Security Control Deploy Zero Trust network partitioning around all IBSs and their dependencies in order to reduce the ability of a malware event to penetrate and cause impact.
Resilient Staffing Strategy Identify key role and skill set dependencies and ensure there is sufficient coverage across the organization to recover from incidents of attrition.

Property Summary

Carefully designed controls, approved design patterns, and attention to risk assessment capabilities ensuring best practices are asserted and that controls are continuously monitored and effective can ensure organizations successfully withstand incidents with minimal disruption.

Since impact to business is minimal, focusing on the systemic properties that provide the resilience to withstand incidents is worthwhile. 

Financial impact of issues addressed by Withstand strategy: ~$0

 

Property 3: Recover

Historically, business continuity programs have focused upon recovery from incidents – often focused on a relatively narrow set of ‘cookie cutter’ disaster scenarios which are relatively easy to identify. While there are still many scenarios where an organization will need to plan, prepare and practice incident response procedures (and maintain business recovery procedures, business continuity plans, and incident notification procedures) as part of their operational resilience program, this pillar of resilience does not address many of the risks organizations face in their complex and dynamic digital ecosystems. 

Anticipation, avoidance, and the ability to withstand disruptions creates ‘defence in depth’ to the enterprise architecture – which hopefully minimizes the requirement to ever need to recover. However, sometimes the worst happens and modern business continuity management practices have defined a set of principles to ensure recovery can be achieved before intolerable harm is caused.

Impact tolerances are defined for each business function or service to set boundaries defining when a disruption would cause intolerable harm to the institution, their customers, or potentially the system or society within which they operate. Exercises are then conducted to ensure that in the event of ‘severe but plausible’ scenarios the organization can recover within that impact tolerance. 

Let’s take a look at examples of measures organizations commonly take to ensure they can recover from incidents, when necessary:

Recover Measure Description
Systems Testing and Assessing Recovery Time Capabilities Tests system recovery to ensure they can meet the IBS impact tolerances. Repeat for all necessary dependencies and monitor/map to ensure the end-to-end service can recover in the case of a failure within the dependency path.
Scenario Test ‘Severe But Plausible’ Failure Scenarios Against Delivery of Business Service Formerly conducted as a table-top manual exercise, organizations will ‘war game’ their response and ability to deal with a set of failure scenarios. Historically, this exercise would include classic ‘disaster’ scenarios (interestingly, many missed the scenario of a global pandemic and depended upon the resilience and flexibility of their digital architecture to recover). Today, digital businesses must also now consider complex digital and technology dependencies — such as impacts affecting their critical third parties. Scenario tests help organizations create procedures for how they will respond to specific scenarios, thereby giving them a ‘head start’ in their response and a runbook to follow.

Property Summary

Recovery assessment and planning provides organizations with some assurance that they can recover when the worst case scenario occurs – including more opaque scenarios occurring within the digital architecture. Recovery, however, does generally mean that impact to the business is experienced, hopefully only up to the threshold of ‘intolerable harm’.

Financial impact of issues addressed with Recovery: $0 < impact < $Intolerable harm

 

Property 4: Adapt

Adaptation can also be triggered by changes to risk tolerances, the risk or threat landscape, or as the result of an incident that has been experienced. Within organizations that proactively manage their risks, changes to underlying properties within the system (for example, a reduction in the recovery capabilities or stability of a third party) will trigger a reassessment and adaptation without impact. However for many organizations adaptation only occurs following an impact event (in the form of a retrospective following recovery, or even as a mitigating measure to reduce ongoing impact).

Financial impact of issues addressed with Proactive Adaptation: $0 

Financial impact of issues addressed with Adaptation after incident: 1 * ($0 < impact < $Intolerable harm) as future recurring incidents should be avoided.

 

The Business Case for Resilience

Let us now consider the business case associated with each type of resilience property within a system.

As you can see within the incident timelines, anticipation and avoidance enables organizations to address potential risks before they precipitate an incident thereby avoiding any cost.

Resilient systems that can withstand ‘severe but plausible’ incidents also tend to avoid the cost of an incident, with perhaps some more minor operational costs and disruption involved in the retrospectives and clean-up activities. Beyond this, recovery and adaptation to an incident can result in costs and disruptions up to the threshold of ‘intolerable harm’.

So, the business case here is clear: organizations that focus upon the proactive aspects of resilience – anticipation, avoidance, the ability to withstand, and proactive adaptation to new risks – are less likely to experience the expenses and disruptions associated with operational and cyber incidents.

 

What do we mean by cost?

So far, we have discussed cost in abstract and somewhat qualitative terms. However, in forming a business case we need to consider cost and impact in a number of ways, including:

1. Direct Financial Cost:

The costs incurred to the business, including recovery costs, loss of business, customer compensation, and fines or sanctions resulting from an incident.

While the direct costs after an incident are relatively easier to measure, they can be difficult to predict. The impact of public fines (which for UK Bank, TSB, amounted to $100M for a single incident, including $100K in personal fines against the CIO) and compensation are often further magnified in terms of reputational impact.

2. Reputation:

During my time at Goldman, our approach to risk management (and therefore our technology resilience) was informed by the (at the time) first business principle “Our assets are our people, capital and reputation. If any of these is ever diminished, the last is the most difficult to restore.”

For many organizations, the reputational impact of an incident or disruption (even where recovery can be achieved within an impact tolerance) will far outweigh the direct financial cost.

3. Velocity:

Incidents impact confidence, and also an organization’s attitude to risk. A former CISO colleague of mine used to draw an analogy between informed risk management and the powerful brakes of a sports car “great brakes let you go faster.” That is, organizations which have accurate situational awareness can innovate and change faster with high confidence they understand the potential consequences.

In contrast, organizations that regularly experience operational issues often implement more restrictive change procedures or even change freezes in order to attempt to regain control. In addition to the self imposed impacts to velocity, each regulator reserves the right to impose enforcement actions, including restrictions to business activity and technology governance.

4. Overall Risk reduction:

There are many cultural and behavioural benefits of understanding and managing an organization’s risks. In the extreme, however, sometimes the problem can be so severe that organizations are required to reserve additional capital in order to account for increased operational risk. This scenario occurred recently with DBS Bank of Singapore which was required to set aside an additional $1.2bn due to technology resilience failures.

5. Personal Impact / Disqualifications:

Operational resilience regulations strongly focus on governance and accountability of senior executives for resilience programs and operational excellence. We have seen significant personal fines from UK regulators (FCA and PRA levying a $100K personal fine on the CIO of TSB) and we should expect to see those increase alongside the potential for disqualification from future practice.

Within this final section, I have tried to introduce some quantitative specifics relating to ICT risk management in order to help in building the business case for operational and cyber resilience. In future articles I will discuss risk quantification approaches, such as the FAIR model for risk quantification.

 

Summary 

Operational resilience requires organizations to understand their critical business functions and important business services, and to take appropriate measures to ensure they can continue to be delivered during severe but plausible incidents.

In order to achieve this outcome, organizations must take measures to anticipate, avoid, withstand, recover and adapt to risks, threats, and incidents. However, when we look at the properties and benefits of each approach, we find that organizations focusing on anticipation, avoidance, the capability to withstand, and to adapt proactively are likely to streamline their operations and embrace resilience as a strategic differentiator.

 

Resource

The following are some of the notable regulations driving today’s resilience pressure:

Regulation Scope Full Enforcement Date Maximum Penalties
Bank of England PS21/3 UK Financial Services sector March 2025 Undefined, likely to be > $10,000,000s.
DORA EU Financial Services sector and critical third parties January 2025 1% of gross global revenue.
APRA CPS 230 AUS Financial Services sector July 2025 Undefined, likely to be > $10,000,000s.
EU NIS2 EU Critical Infrastructure providers December 2024 2% of gross revenue.
MAS Financial Services and Markets Act Singapore Financial Services sector June 2024 SGD 1 million ($740K) per violation.

 

About Marc Woolward:

Marc Woolward is CTO and CISO at vArmour. He has spent his career working to architect and operate mission critical architectures within the financial services industry and now many of our customers across all sectors of critical national infrastructure. Through his career, Marc has worked to address resilience requirements within critical infrastructure across networking, cloud, SaaS, middleware, voice communications, and enterprise architecture. This blog series reflects on learnings from a career spent working with resilience in complex, mission critical systems and references notable research and academic publications. In particular, Marc recommends ‘Resilience Engineering – Concepts and Precepts’ by Hollnagel, Woods and Leverson for its presentation of the wealth of studies into resilience in complex systems).

Related

Read More
August 19, 2024
Segmentation Readiness: Perspectives from a Product Manager
READ MORE
Read More
July 17, 2024
Reflections on Resilience: Managing Drift
READ MORE
Read More
June 27, 2024
Broadcom’s VMware Acquisition: Driving the Enterprise Need for Rapid and Secure Cloud Adoption
READ MORE
close

Timothy Eades

Chief Executive Officer