Cloud Infrastructure Resilience: Lessons from Recent AWS and Cloudflare Outages

When the cloud goes down

The 2026 Amazon Middle East outage – caused by physical infrastructure damage from drone strikes – was unusual in its origin but entirely familiar in its effect. Businesses that had concentrated their workloads in a single region found their applications offline and their data unreachable. Services that companies assumed were simply "in the cloud" turned out to be very much in specific buildings, connected to specific power grids, within specific geopolitical boundaries.

What the incident made vivid was a principle that holds across every cloud outage regardless of cause: cloud providers guarantee the availability of their infrastructure, not the resilience of your application. When a region fails, the contractual remedy is service credits – a fraction of your monthly bill. The commercial impact of being offline for a day is yours to absorb. Responsibility for building systems that survive provider failures sits with the customer, not the provider.

Three outages, three different failure modes

The pattern of major outages at production scale has accelerated. In a 12-month window spanning late 2025 into 2026, three significant failures demonstrated something that should be read as a system-level lesson rather than a sequence of bad luck: the causes were entirely different each time.

AWS us-east-1, October 2025. A technical update to the DynamoDB API introduced an error in the service's DNS management – specifically, a race condition in which a stale routing plan overwrote a newer one before being deleted by cleanup automation. Applications that depended on DynamoDB couldn't resolve its IP addresses and lost connectivity. Because DynamoDB is an internal dependency for a large number of AWS services, the failure cascaded rapidly. 113 AWS services were affected in total. External platforms ranging from Slack and Zoom to Delta Airlines and Coinbase went offline simultaneously. The outage lasted approximately three hours, with processing backlogs continuing for several hours beyond the restoration of DNS. The root cause was a subtle timing issue in a routine infrastructure update – not a dramatic failure, but a cascade that reached effectively global scale from a single region.

Cloudflare, November 18, 2025. Cloudflare's own postmortem described this as their worst incident since 2019. The trigger was a permissions change applied to a ClickHouse analytics database. The change caused a query to return duplicate rows from two databases simultaneously, which doubled the size of the configuration file used by Cloudflare's Bot Management system – pushing it beyond the 200-feature processing limit enforced by the proxy software. The proxy panicked and restarted. The resulting failure affected CDN delivery, Turnstile, Workers KV, the dashboard, Access and Email Security. The incident lasted approximately six hours. The cause was not a cyberattack or a dramatic infrastructure event – it was a database permissions change that propagated in an unexpected way through a dependency chain that hadn't been stress-tested against that specific scenario.

Cloudflare, December 5, 2025. Less than three weeks later, Cloudflare had a second significant incident. A security vulnerability required rapid remediation, and engineers deployed a killswitch via global configuration – bypassing the gradual rollout process normally used for changes of that scope. The configuration change exposed a nil pointer exception in the Lua-based ruleset module used by the older generation of proxy software. When a rule of type "execute" was skipped rather than applied, the module crashed. Approximately 28% of HTTP traffic through Cloudflare was affected for around 25 minutes – a faster recovery than November's outage, but a clear demonstration that remediation actions taken under time pressure can themselves become failure events.

Earlier incidents established the same pattern. The December 2021 AWS Kinesis cascade took down Cognito, Lambda and CloudWatch simultaneously when a capacity scaling operation pushed a thread count past a configuration limit. The June 2022 Cloudflare BGP routing loop took 19 data centres offline in minutes via a planned network enhancement. The CrowdStrike July 2024 update crashed 8.5 million Windows devices when a configuration file logic error bypassed normal testing gates. The 2025 incidents confirm the pattern hasn't changed: failure modes are diverse, triggers are often routine operations, and the blast radius of a dependency failure consistently exceeds the failure itself.

Single-region and single-provider dependency

Most businesses using cloud infrastructure do so through a single provider and, within that provider, a single region. It's a natural outcome of how cloud adoption happens: pick a provider, pick the nearest or most convenient region, build there. Cost, simplicity and familiarity all push in the same direction.

The result is that single-region deployments inherit every failure mode of that region – power, networking, physical events, software cascades. There's no fallback because nothing was built to fall back to.

Single-provider dependency adds a second layer of risk that's easy to overlook. If your primary application runs on AWS and your backup also runs on AWS – even in a different region – a provider-wide software or network event affects both simultaneously. The CrowdStrike incident demonstrated that dependency on shared software layers can make multi-region deployments with a single provider fail in unison.

Geographic concentration matters too. Multiple data centres operated by different providers may share power substations, fibre routes or network peering points. A failure at that shared physical layer can affect multiple providers simultaneously, making architectures that appear independent on paper considerably less so in practice.

Multi-cloud and multi-region architecture

Genuine resilience requires making deliberate choices about where workloads run and how failures propagate. The main architectural options sit on a spectrum, with different cost and complexity trade-offs at each point.

Multi-region, single provider. Running active workloads in two or more regions of the same cloud provider addresses regional failures – hardware faults, localised network events, physical incidents. It doesn't address provider-wide software failures, as the CrowdStrike cascades showed. This is the most accessible starting point for most organisations: it uses a single cloud management plane, familiar tooling and doesn't require maintaining expertise across multiple platforms. For businesses whose primary risk is regional failure rather than provider-wide failure, it's often the right trade-off.

Active-active vs active-passive. Active-active deployments run live traffic across multiple regions simultaneously, with load balancers distributing requests and data replication keeping regions in sync. Failover is automatic; users typically don't notice a regional event. Active-passive deployments run a full or partial copy of the production environment in a secondary location, kept warm but not serving live traffic, with manual or automated promotion required during a failure. Active-active provides better RTO but significantly higher cost and operational complexity – particularly around data consistency, where simultaneous writes across regions require careful conflict resolution. The right choice depends on what downtime your business can tolerate and what you're willing to spend.

Multi-cloud. Running workloads across two or more cloud providers – AWS and Azure, for instance – provides the strongest protection against provider-specific failures, including software incidents that affect a single provider's infrastructure globally. The trade-off is real: different management planes, different service abstractions, different security models and a substantially larger operational burden. Infrastructure-as-code tooling (Terraform, Pulumi) and container orchestration (Kubernetes) reduce the friction of multi-cloud deployments, but they don't eliminate it. For most organisations, multi-cloud makes sense for critical workloads with aggressive RTO requirements, rather than as a blanket architectural approach.

DNS-based failover and global load balancing. Services like AWS Route 53, Cloudflare Load Balancing and Azure Traffic Manager can route users to healthy environments automatically based on health checks. Combined with data replication across regions or providers, they enable failover without requiring manual DNS changes – reducing effective recovery time from hours to minutes. The reliability of the failover mechanism itself matters: if your DNS is managed through a single provider and that provider fails, health-check-based routing fails with it. Using a DNS provider that is independent of your primary cloud provider is a worthwhile precaution.

Data consistency. The hardest problem in multi-region architecture is not compute – it's data. Replicating read-only or append-only workloads is straightforward. Ensuring that writes made to a primary database in one region are visible to an application reading from a replica in another region, without inconsistency during a failover event, requires careful design. For many systems, the practical answer is asynchronous replication with an accepted RPO – you may lose a small window of writes during a regional failure. Understanding and agreeing that RPO as a business decision, before the architecture is built, is what makes the trade-off explicit rather than accidental.

What business continuity planning actually requires

For cloud-dependent businesses, genuine continuity planning starts with two numbers: Recovery Time Objective and Recovery Point Objective. RTO is how long the business can be offline. RPO is how much data it can afford to lose. Both are business decisions, not technical ones, and they need to be agreed with leadership before an incident – not estimated under pressure during one.

From those numbers, the technical architecture follows. A four-hour RTO might be addressed by a warm standby that can be manually promoted. A 15-minute RTO requires active-active deployment with automatic failover. The cost difference between these two is significant, and it's a decision that should be made deliberately.

Backup strategies need to be specific. Data backed up to the same region as production doesn't help when that region is down. Backups should be in a different region or with a different provider, encrypted and tested. A backup that hasn't been successfully restored in a controlled exercise is not a backup – it's an untested assumption about what will work under pressure.

Criticality triage matters. Not every system needs the same level of resilience. Prioritise investment based on what the business genuinely can't function without – and accept that internal tooling, lower-priority applications and non-customer-facing systems can tolerate longer recovery windows.

Building a resilience programme that holds up

Map your cloud dependencies before you need to. Identify every cloud service your business uses, which regions and providers they run on, and what happens to operations if each one is unavailable. Most businesses find this exercise surfaces hidden dependencies – authentication services on one provider, CDN on another, monitoring on a third – that weren't considered when resilience was designed.

Review SLAs with a clear eye. Cloud provider SLAs offer financial credits when availability drops below a threshold. Those credits are typically a small fraction of your monthly bill, not compensation for lost revenue or reputational damage. The SLA is not your continuity plan.

Test your recovery procedures. When did you last simulate a regional failure and verify that failover actually worked? Testing under controlled conditions reveals gaps that paper planning doesn't. The organisations that handled the AWS Middle East incident with minimal disruption had run exercises. The ones that were caught out hadn't.

Consider connectivity resilience alongside compute resilience. An organisation with multi-region cloud infrastructure and a single internet connection still has a straightforward failure point. Secondary connectivity – a separate line from a different provider, or cellular failover – addresses a class of failure that cloud architecture can't.

Understand your software vendors' resilience posture. SaaS tools have their own cloud dependencies. Asking where a vendor hosts their platform, whether they have DR arrangements and what their historical availability looks like is a reasonable part of procurement due diligence.

The governance question

Cloud resilience is a business risk, not just a technical design problem. IT teams can build multi-region architectures and run DR exercises, but they can't set acceptable downtime thresholds or allocate budget to active-active deployments without sign-off from business leadership.

In many organisations, cloud infrastructure risk occupies an ambiguous space – too technical for the board to engage with directly, too commercial for IT to decide alone. The result is that resilience decisions get made by default, based on cost minimisation and convenience, rather than by explicit choice aligned with the business's actual risk appetite.

Treating cloud reliability as a managed risk means bringing it into the same governance framework as other operational risks: assessing likelihood and impact, deciding what level of protection is proportionate, assigning clear ownership and reviewing it on a defined schedule. The organisations that came through the recent wave of major outages with minimal disruption weren't lucky – they had made explicit decisions about resilience and invested accordingly.

"We use the cloud" is not a business continuity plan. The cloud is infrastructure, and infrastructure fails. The plan is what happens when it does.

Route B helps businesses assess and improve their cloud and IT infrastructure resilience – from dependency audits to DR planning and multi-region architecture.

Get in Touch

Cloud infrastructure resilience: lessons from recent AWS and Cloudflare outages