Resilient Assurance: A Failure-Assumed Approach to Confidence and Control

Modern flight deck; layered controls, redundancy, degraded operation, continuous telemetry, human & technical resilience, operational assurance, and failure survivability

“Mature assurance is not the pursuit of perfect prevention. It is the disciplined management of inevitable failures”

Limits of Classical Assurance

Traditional or classical Assurance tends to make assumptions; that controls are implemented, that controls operate correctly, and thus that risk is reduced.

In practice, things are not generally so simple. Controls decay. Environments change and evolve. People bypass controls for speed or convenience. Attackers adapt in real-time. Dependencies fail. And monitoring has, or develops, blind spots.

Many Assurance failures are organisational, not technical; staffing issues, skill erosion, alert fatigue, procedural drift, unexpected incentives. Controls do not operate independently of the organisations that sustain them.

“‘Control present’ is not equivalent to ‘risk controlled’, although often this is an implicit assumption”

Reframing Assurance as Confidence, not as Certainty

A less binary approach is needed. Instead of thinking in terms of secure vs insecure, compliant vs non-compliant, controlled vs uncontrolled, we should move towards confidence levels, operational trustworthiness, and take account of incomplete evidence and the observational limits of system & component behaviours.

Compliance vs Resilient Assurance

Compliance is often static and evidential; Resilient Assurance is dynamic and behavioural. Compliance may demonstrate that prescribed controls exist; Assurance must demonstrate that systems remain trustworthy when controls fail.

Control Independence and Correlated Failure

Independent controls reduce correlated failure risk, avoiding the failure of multiple controls resulting from a single element failing. Overlapping controls must be considered, so that the failure of one control does not lead to too great an exposure. Degradation tolerance is the goal for our set of controls.

“Assurance is fundamentally about justified confidence under conditions of uncertainty”

The Failure-Assumption Principle

“In resilient systems, controls should not be assumed reliable indefinitely. They should be assumed eventually fallible”

Once we adopt this thinking, some consequences of it become more apparent.

No single control should carry critical assurance weight; achieving degradation tolerance requires that no single control should be absolutely critical. This is a significant change in approach.

Important risks require independent mitigating mechanisms; if a single failure can prevent multiple controls relating to a specific risk from being effective, then we have a problem of common-mode failure. Control independence must be considered. Multiple controls are not necessarily independent if they rely upon shared infrastructure, shared telemetry, shared administration, or shared trust anchors.

Assurance, therefore, depends upon surviving individual control failures. This means that the level of Assurance asserted, and indeed the levels of residual risks, should be expressed taking this into account – not merely by assuming that all controls are working as expected.

Statistical Framing (no equations!)

A control does not eliminate risk; it changes the probability distribution of undesirable outcomes. Two independent controls often reduce simultaneous failure risk more effectively than a single stronger control because any individual control may fail. In some circumstances, reliable detection of failure plus confidence in recovery can suffice.

Critical risks require multiple independent means of prevention, detection, and recovery.

This can be modelled probabilistically, although the underlying principle is operational rather than mathematical^[1].

Implications for Assurance Practice

We must look deeper, and look at slightly different aspects compared with the classical approach.

Independence of controls needs to be explicitly considered.

Detectability of failure needs to be understood and considered. I’ve written before about Protective Monitoring negative use cases – where the absence of expected telemetry or log entries needs to generate alerts. This is similar in nature.

Recovery capability should be considered as part of the Assurance assessment.

Degradation behaviour should be understood; how does the risk position change when controls are degraded, and how much degradation can be tolerated.

Control validation as a continual process. The existence of a control is evidence, but by itself it is not proof of effectiveness. A control that has never been exercised under realistic conditions provides only very limited assurance value.

Continuous Validation & Operational Assurance

Old-school Assurance tended to be point-in-time. Resilient Assurance should involve continuous telemetry and adversarial testing, drift detection, and validation under live conditions.

Assurance confidence decays over time unless continuously revalidated. That means chaos engineering, purple teaming, and developing operational resilience. And operational resilience depends as much upon organisational behaviour as it does upon technical implementation.

Example: MFA as a Single Point of Assurance Failure

An organisation may implement:

VPN access controls,
privileged access management,
cloud administration protections,
zero-trust segmentation,

but all dependent upon a single identity provider and MFA platform.

Classical Assurance may treat these as multiple independent controls.

Resilient Assurance recognises a common-mode failure risk: compromise or outage of the identity platform may simultaneously degrade multiple protective mechanisms.

The apparent diversity of controls may conceal a concentrated dependency risk.

Conclusion

“Resilient Assurance is not proving absence of failure. Resilient Assurance is demonstrating resilience despite inevitable failure”

Footnotes

[1] See – no equations and I didn’t even mention Bayesian Confidence. Although fair warning: there may be a follow-up article on “Assurance as Bayesian Updating”