Reflections on Resilience: Managing Drift

 

In the last article, we explored the importance and benefits of the proactive aspects of resilience — in particular anticipation, avoidance, and adaptation. In order to achieve these operational properties within a complex system (such as an enterprise’s digital estate) we require the ability to recognize where the system is moving from a ‘good’ state towards a state where an incident is more likely to occur. In complex, high entropy systems, observing and understanding ‘drift’ is critical — particularly where that drift might affect important and critical services or functions.

So, what is drift and how is it possible to manage it effectively without being subjected to an avalanche of noise that overwhelms operations teams? We’ll explore in this article.

 

Defining Drift

The excellent publication ‘Resilience Engineering – Concepts and Precepts’ focuses on drift in Chapter 8 (Engineering Resilience into Safety Critical Systems). In particular, the authors summarize the challenge as:

“Major accidents are usually preceded by periods where the organization drifts toward states of increasing risk until the events occur that lead to a loss (Rasmussen, 1997). Our goal is to determine how to design resilient systems that respond to the pressures and influences causing the drift to states of higher risk or, if that is not possible, to design continuous risk management systems to detect the drift and assist in formulating appropriate responses before the loss event occurs.”

To summarize, we either need to architect systems that recognize and respond to drifts towards risk thresholds, or at least recognize material drift in order to provide the service owner with a timely ‘tap on the shoulder’.

Figure 1: Drift and Risk Tolerance

The problem is that within complex digital ecosystems, there is too much activity and ‘noise’ to easily identify those changes unless you are able to define and automate those events that should trigger the ‘tap on the shoulder’.

If we focus on the digital enterprise and the complex software systems that underpin their Critical or Important Functions (CIFs) then we can take a number of approaches which we’ll explore here in a little more detail:

 

Approach 1: Define and Continuously Monitor System Against Risk Tolerances

Most organizations define a set of standards or policies designed to ensure the safe and continuous operation of their environments, which are now becoming known as ‘Risk Tolerances’.

Examples of these standards could include safeguards to ensure operational continuity and integrity and also resilience against cyber threats, such as:

  1. All changes must be reviewed, tested, and approved before execution.
  2. Critical and important business functions must have access restricted utilizing least privilege access controls.
  3. All administrative access to production systems must be logged and controlled through a hardened bastion controller.

Defining these rules is a necessary yet insufficient step. Within complex modern digital architectures, the ability to continuously ensure risk tolerances continue to be observed is a critical capability in preventing ‘drift’ toward disaster. Observability of deviation and drift is necessary.

 

Approach 2: Establish a Baseline of ‘Known Good’ and Continuously Monitor for Material Deviations

Sometimes incidents or accidents occur that could not have been fully mitigated by adherence to existing risk tolerances. New threats emerge continuously alongside the rapid adoption of new technologies and business models. In addition, unforeseeable toxic combinations of conditions may occur that are difficult to predict. This is where human oversight and cognition is often important to consider changes to threat model and threat landscape.

If organizations are able to establish a baseline of ‘approved’ behavior (or known good) for their critical or important functions and the systems that underpin them, it should become possible to identify where material changes occur. And if you can identify where material changes occur you can ask some important questions, for example:

  1. Were established change control procedures followed, including design reviews and coding analyses?
  2. Does the change increase overall attack surface?
  3. Does the change increase my concentration risk on a given provider or location?

At the crux of this challenge lies the requirement to identify ‘material’ deviations, as opposed to low level day-to-day entropy (which can be presented as ‘noise’ in many cases and is therefore ignored) such as the scaling of a service or a routine software upgrade.

Examples of material deviations are:

  1. Changes to functional behavior.
  2. New upstream processing dependencies.
  3. Material vulnerability impacting the dependency of an Important or Critical Service.

We have found that establishing materiality requires an ability to model the functions present within a business service, in order to establish where material changes occur at the function level. And obviously, you need to be able to do this as change occurs in order to get that timely ‘tap on the shoulder’.

 

Bringing it all together…

Changing business requirements, technical implementations, and external factors all interact upon our critical digital systems to create drift. This drift is extremely difficult to monitor and measure in complex digital environments utilizing traditional risk management methods (periodic mapping, tabletop exercises, and manual risk assessment procedures).

This is one of the most important and difficult challenges when it comes to delivering digital operational resilience — to proactively observe drift and manage the risk it causes in order to remain resilient in high entropy environments.

Throughout my career in resilience, this has been a ‘holy grail’ — to be able to report upon, affect, and address risks as they develop. I feel proud that this is one of the most significant things we have been able to deliver at vArmour.

 

About Marc Woolward:

Marc Woolward is CTO and CISO at vArmour. Throughout his career, Marc has worked to address resilience requirements within critical infrastructure across networking, cloud, SaaS, middleware, voice communications, and enterprise architecture. This blog series draws from Marc’s extensive experience and insights on resilience in complex, mission-critical systems, incorporating notable research and academic publications. In particular, Marc recommends ‘Resilience Engineering – Concepts and Precepts’ by Hollnagel, Woods and Leverson for its presentation of the wealth of studies into resilience in complex systems).

Related

Read More
April 18, 2024
Reflections on Resilience: Digitalization and ‘Errors of the Third Kind’
READ MORE
Read More
June 17, 2024
Reflections on Resilience: The Properties of Resilient Systems and Their Business Benefits
READ MORE
close

Timothy Eades

Chief Executive Officer