Digital Operational Resilience – a blessing in disguise?

Home » Blogs » Digital Operational Resilience – a blessing in disguise?

Downtime is expensive. The average cost of unplanned application downtime at Tier One financial institutions exceeds $2.5bn every year, according to the IDC. The Operational Resilience of your business depends now more than ever on the availability of your IT systems. Critical systems and the applications which support them need to be available at all times, especially in a crisis. The Digital Operational Resilience Act presents an opportunity to review and implement changes to systems and processes. You can both reduce your financial exposure and demonstrate to regulators the robustness of your resilience processes.

In this blog, we discuss how you can quantify the financial impact of downtime and how an application-centric approach to availability can solve the resilience problem.

Should you care about the cost of downtime?

According to a Gartner report titled “Why Business Leaders Don’t Care About the Cost of Downtime”, published in April 2019, “through 2021, 65% of I&O leaders will underinvest in their availability and recovery needs because they use estimated cost-of-downtime metrics.”

We believe that this means these leaders aren’t right-sizing their availability because they don’t know what investment will deliver the right availability and recovery for their critical business applications.

Focusing on the wrong metrics for availability and recovery might lead to improved overall availability, despite the application experiencing an outage at a critical time; leading to a loss of transactions and therefore revenue. High ’average’ availability is not enough to guarantee availability at financially critical moments.

Adjusting the approach to availability

In Cloudsoft’s experience the key to getting the investment right and improving application availability and recovery lies in adjusting the approach:

  • Be application-centric

Metrics and controls must be aligned to the software application, not its individual components, technology layers and the often disparate teams aligned to them. This might be directed by the Shared Site Reliability Engineering team, the importance of which is discussed in our eBook.

By unifying application resilience and recovery at the application level, it’s possible to “look down” at all the dependent, interconnected systems that make up the application and understand how the metrics and behaviour of each component affects the whole system, and even implement controls at the application layer to make lower-level systems more resilient.

  • Define business tolerances

The focus for improved availability has shifted in recent years from improving probabilistic uptime to accepting that failure is inevitable and investing instead in resilience and recovery to those inevitable events.

In their report “Why Business Leaders Don’t Care About the Cost of Downtime”, Gartner observed that there is a period of outage for an application after which there is an inflection point where the impact dramatically increases. This is measured by the Maximum Tolerable Period of Disruption (MTPOD).

Understand, for each application, where the outage impact inflection point is and build the recovery and resilience to within Maximum Tolerable Period of Disruption (MTPOD) thresholds.

  • Automation, automation, automation

The pinnacle of application resilience involves intelligent automation tools that transform the runbook from a Word document to a model in source control, extending techniques such as infrastructure-as-code. These tools enable you to codify your resilience and recovery policies at the application level and limit or remove the human “bump in the road”.

The best performing tools seek to model logical application components so that recovery processes are easily and consistently visualised, while integrating the breadth of technologies an organisation uses. This means availability and recovery are optimised and strategies are not only reused, they are improved, automatically tested and rolled out. As a resulting effect, all stakeholders have a clear view of what is running and where, not only in peacetime, but also in the (much-less-likely) event of an incident.

Applying this approach for Digital Operational Resilience

Digital Operational Resilience ebookEnsure digital operational resilience by guaranteeing rock solid availability for all your critical digital infrastructure and processes. In our latest eBook, ‘Resilience Through Application Management’, we explore approaches to ensuring application availability.  Compiling years of experience in the field of application availability, Alex Heneveld, Cloudsoft co-founder and CTO, and Alasdair Hodge, Principal Engineer & Solutions Architect at Cloudsoft, have produced the Application Resilience Maturity Model which will enable you to identify concrete next steps to eliminating application downtime within your organisation.

Blog Authored by: Cloudsoft


Posted on