The Single Cloud, Multi-Region Disaster Recovery Fallacy

8 min readAug 7, 2024

Inspiration

This article was inspired by the recent catastrophic Azure failure which impacted our deployments and rendered us helpless even though we had done everything in terms of planning for disasters, having warm DR in place in the form of paired Azure regions and carrying our regular DR drills. Disaster recovery architectures are supposed to be an insurance policy which greatly mitigates risks during catastrophic situations, and meticulous planning in addition to high costs are incurred to implement such architectures. However, when disaster struck, all the effort was in vain since we could not even access the Azure portal, let alone kicking off the DR process. The simple truth is that despite the assurances provided by cloud providers, there are many shared aspects or single points of failure across regions, which could render their services totally inaccessible. Even in principle, relying on a single cloud provider as DR insurance is placing all the eggs in the same basket — multiple regions does not necessarily mean multiple baskets, it is the same technology basket. As cloud architects, we have to find solutions to this problem and I am convinced that muti-cloud deployments is the way to go if you want the assurances provided by a DR architecture.

Introduction

Disaster recovery in the context of cloud deployments, refers to the practice of incorporating redundancy into the deployment in order to guarantee service availability in the event of catastrophic failures. Moreover, It is common practice for many organizations to opt for multi-region deployments within a single cloud provider in order to cater to disaster recovery requirements. While appearing to be robust, this approach has significant challenges and potential pitfalls. This article delves into the limitations and risk associated with depending solely on a single cloud provider multi-regional deployment for disaster recovery, and argues the case for a multi-cloud strategy to achieve true resilience and reliability.

Multi-region deployments involve distributing applications and data across multiple geographical locations, also called regions, within the same cloud provider with the primary goals of enhancing availability by spreading resources across multiple regions in order to ensure that their services remain accessible even if one region experiences a total outage in addition to providing a failover mechanism in case of total regional outages, reducing latency by serving users from the nearest region and regulatory compliance by meeting data residency requirements by storing data in specific geographic locations.

A good example of a multi-vendor, multi-technology fault tolerant solution is the triple redundant systems that are built for critical areas in aircrafts such as the hydraulic controls. These systems have to be built by different vendors, different team there can be nothing shared. In you words, “de-standardizing” is required to achieve improved reliability. In other words, standardization can lead to less fault-tolerance.

The Fallacy of Single Cloud, Multi-Region DR

Typically, DR architectures are implemented as multi-region deployments on a single cloud provider due to DR services natively provided by the cloud provider, which in turn reduces the complexity of the infrastructure and operational overhead. Despite these apparent benefits, relying solely on such a deployment has several critical flaws that can defeat the whole purpose of having a DR architecture in place. We will examine a few of these flaws in the rest of this article.

Multi-regional Outages and Cascading Failures

While cloud providers strive to isolate failures to individual regions, history has shown that multi-regional outages can and do happen owing to software bugs, misconfigurations, or large-scale events affecting the underlying infrastructure. The recent Microsoft Azure outage on July 30, 2024, is a stark example of the limitations of single cloud, multi-region DR deployment architectures. In this case, a distributed denial-of-service (DDoS) attack overwhelmed Microsoft’s Azure services, resulting in a multi-region outage that lasted nearly 10 hours! This incident disrupted various Microsoft services, including Microsoft 365 products and Azure, highlighting the vulnerability of even robust cloud infrastructures. During this outage, users experienced difficulties accessing essential services including the Azure portal itself, which made it impossible even to open support tickets. This attack triggered Azure’s protection mechanisms and a flaw in these defences amplified the impact rather than mitigating it. Another historical example from 2017 is the significant AWS outage which affected multiple regions due to issues with the S3 service. Such incidents highlight the vulnerability of relying on a single cloud provider, as failures can propagate across regions, leading to widespread service disruptions.

Warm DR Resource Challenges

In a warm DR setup, a secondary region is pre-provisioned with resources but remains inactive until a failover is triggered. This approach can lead to several challenges:

Resource Unavailability: During an outage in a region, the sudden surge in demand for resources in the secondary or paired region can lead to resource exhaustion. Imagine if most of the cloud deployments are implemented with a warm DR, paired region architecture, when the primary region fails, there will be a significant demand and burden placed on the secondary region, which could result in capacity issues and even a cascading failure due to the high demand and load placed on the paired region.
Cost Implications: Maintaining a warm DR setup incurs ongoing costs, which can be substantial, especially if the DR environment requires scaling up rapidly during a failover. Unfortunately, all the money spent on keeping the warm DR setup running may be wasted in case the DR setup is not usable during the rare event of a disaster.

Single Points of Failure

Even in a multi-region deployment, there can be single points of failure that compromise the entire system:

Shared Services: Services such as identity and access management (IAM) or global DNS can become single points of failure. If these services are disrupted, they can affect the entire deployment regardless of regional distribution.
Network Infrastructure: The network infrastructure, which includes backbone networks, transit links, and network management services, that connects different regions within a cloud provider can also be a single point of failure.
Global DNS Services: DNS services are essential for routing traffic to the appropriate endpoints and regions. Many cloud providers offer global DNS services to manage domain names and direct traffic efficiently. A failure in the global DNS service can result in users being unable to access services, as their requests cannot be correctly routed. This could lead to widespread service outages, regardless of the regional distribution of resources.
Control Plane Dependencies: The control plane, which manages the cloud infrastructure, can become a single point of failure. If the control plane is compromised, it can impact multiple regions simultaneously.

The Case for Multi-Cloud DR

To achieve true disaster resilience, organizations must consider a multi-cloud strategy, leveraging multiple cloud providers to mitigate the risks associated with single cloud, multi-region deployments which were discussed.

Multi-Cloud DR deployments offer the following advantages compared to single cloud DR deployments:

True Redundancy and improved resilience: By distributing workloads across multiple cloud providers, organizations can ensure that a failure in one provider does not impact the entire system. Moreover, organizations can avoid single points of failure and shared service outages in one provider do not impact overall availability. This would result in availability improvements in general.
Cost Optimization and price negotiations leverage: Multi-cloud strategies allow for competitive pricing and the ability to leverage the best services from each provider, optimizing costs and performance.

Implementing Multi-Cloud DR

To successfully implement a multi-cloud DR strategy, organizations should consider the following aspect:

Utilize Cloud-Agnostic Services: Instead of relying solely on cloud-native services, using cloud-agnostic services that are compatible across multiple cloud providers allows for deploying the same solution on multiple clouds with minimal changes. For instance, using a cloud-agnostic container orchestration tool like Kubernetes can facilitate easier migration and management of workloads across different cloud environments and databases like PostgreSQL or MySQL can be deployed on any major cloud provider, reducing dependency on specific cloud-native services.
Standardize Infrastructure and Application Deployment: Using Infrastructure as Code (IaC) tools like Terraform, which support multi-cloud deployments, instead of cloud specific IaC technologies, allows you to standardize the deployment process across different cloud providers.
Leverage Multi-Cloud Management Platforms: Utilizing platforms that provide a unified management interface, or a single pane of glass, for multiple cloud environments, would allow you to simplify operations and monitoring.
Data Synchronization and Replication: Implementing robust data synchronization and replication mechanisms would ensure that data remains consistent and accessible across different cloud providers.
DR Drills: Regularly testing failover procedures and automating the failover process would minimize downtime during an actual disaster.

While the advantages of multi-cloud DR are clear, implementing such a strategy comes with its own set of challenges:

Complexity in Management: Managing multiple cloud environments can be complex. Each cloud provider has its own set of tools, APIs, and services, requiring teams to have expertise in multiple platforms. Utilizing a cloud-agnostic toolkit as discussed above would minimize this complexity.
Increased Costs: While multi-cloud can optimize costs in the long run, the initial setup and ongoing management can be expensive. Organizations need to consider the costs associated with data transfer, replication, and maintaining redundant environments.
Data Consistency and Synchronization: Ensuring data consistency across multiple clouds can be challenging. Implementing robust synchronization mechanisms is critical to avoid data discrepancies. Specialized tooling would be necessary for maintaining data consistency, and techniques such as eventual consistency may be acceptable for the required recovery point objective (RPO).
Security and Compliance: Managing security and compliance across multiple cloud environments requires rigorous policies and controls. Organizations must ensure that their security measures are consistent and compliant with regulatory requirements across all cloud providers.
Interoperability Issues: Despite the use of cloud-agnostic services, there may still be interoperability issues. Some features and functionalities might work differently or may not be available across all cloud providers, necessitating custom solutions.
Monitoring and Visibility: Achieving unified monitoring and visibility across multiple cloud environments can be challenging. Organizations need comprehensive monitoring tools that provide a holistic view of their infrastructure and applications across all clouds.
Disaster Recovery Testing: Regularly testing DR procedures in a multi-cloud environment can be complex and resource-intensive. Ensuring that failover mechanisms work seamlessly across different cloud providers requires thorough and frequent testing.

Conclusion

Single cloud multi-region deployments could provide a false sense of security in terms of disaster recovery as witnessed in some incidents where failure of shared infrastructure caused total outages or multi-region failures. Even in the case where only one or a few regions become unavailable, the healthy regions may become resource starved or overwhelmed when the workloads and traffic from problem regions failover. In order to have true DR in place, a multi-cloud DR strategy is essential.

The Single Cloud, Multi-Region Disaster Recovery Fallacy

Introduction

The Fallacy of Single Cloud, Multi-Region DR

Multi-regional Outages and Cascading Failures

Warm DR Resource Challenges

Single Points of Failure

The Case for Multi-Cloud DR

Implementing Multi-Cloud DR

Conclusion

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Afkham Azeez

No responses yet