8 min read•Last updatedUpdated Oct 28, 2025

Where is your weakest link?

Written by

What the AWS outage on October 20th 2025 reminds us about architectural dependencies and real-world resilience

For several hours in the day, millions of users were staring at loading screens or frozen apps. Communication channels went quiet. Games, financial platforms, healthcare apps – and even beds – were all down.

The culprit this time: an outage in Amazon Web Services' us-east-1 region. AWS is the dominant provider of cloud compute globally, and us-east-1 is its oldest and busiest region. Countless businesses globally depend day-to-day on services hosted in that region.

Whole-region failures are nothing new of course. AWS aims to isolate its infrastructure provision as much as possible, dividing each region into multiple “Availability Zones” (AZs) – each being one or a small number of individual datacenters, with the aim of confining failures to only a single AZ. Nonetheless, inevitably there are elements, such as in the control plane, whose failure can affect the entire region.

Each time this happens, however, we really see how all of our online services are more and more dependent on underlying cloud services. Any real-world service is itself dependent on many other constituent services; the trend over time is that all of those have broader and deeper dependencies on cloud service providers, and most often AWS.

Chart showing the effect of the October 20 2025 AWS US-EAST-1 outage on realtime platforms and apps — The impact of the Oct 2025 outage on some realtime providers

This creates a new problem for anyone attempting to build and operate a service that is resilient to regional outages. A system that can do that - although the principles are well understood - always has additional complexities and costs, and both design and operation of such systems is a challenge. The new problem for builders of such services is that, as well as architecting their own system, they also need to ensure that all of their dependencies are designed to survive regional outages. When any underlying layer of the stack fails, everything built on top of it feels the impact.

This is one of the hardest questions for modern architectures: how do you avoid inheriting your infrastructure provider's single points of failure? Better still, how can you become immune? In this post we set out a bit more about how Ably itself approaches this question.

Building for 100% uptime

Ably was designed for this challenge from the outset. We recognise that any downtime for Ably will be downtime for all of our customers. Therefore the platform is architected to be fault-tolerant at a component level, be available in multiple regions with no client affinity or dependence on any single region, operationally support disabling of regions in the event of disruption; Ably’s client libraries are able to route to different regions in response to individual errors, so regional fallback is entirely automated.

Ably builds on services from other providers and these also have the potential to undermine Ably’s availability; therefore, just like any other service, we also need to understand how we address risks to Ably’s operation from any failure of those services. Instead of only relying on what a provider’s SLA suggests, we make an assessment of the actual likelihood of failure based on the best available understanding of the actual architecture of the service in question.

Redundancy: Truly independent data centers

If all your infrastructure lives in one region, that region becomes your single point of failure. True redundancy means multiple data centers in completely separate geographic locations, not just different buildings in the same city.

Ably operates across 11 core routing data centers in separate geographic regions – when US-EAST-1 fails, traffic flows through US-WEST, EU-WEST, and AP-SOUTHEAST independently. Combined with 700+ edge acceleration points globally, as well as ensuring proximity to your end-users, this creates geographic redundancy that continues operating as a federated whole despite degradation in any individual region. Persisted data is kept in multiple regions – in at least three continents unless region-locked – with no single region holding authoritative state that others depend on.

Map illustrating locations of AWS regions

Always available: Active-active, not cold standby

Cold standby means you're offline while systems "fail over." By the time you switch to backup infrastructure, customers have already left. Active-active means every region is already serving traffic – there's no switch, just automatic rerouting.

Every Ably region is live and processing messages constantly. When one region degrades, the system simply stops routing new connections there. SDKs detect failures within 30 seconds and automatically reconnect to healthy data centers. When needed, entire customer bases can be redirected to different regions within 2 minutes. Users see transparent failover, not error messages.

During the October 2025 AWS outage, Ably demonstrated this in real time. At around 1200 UTC, new connections stopped routing through the impacted US-EAST-1 region and were transparently handled by US-EAST-2 instead. Existing sessions in US-EAST-1 continued uninterrupted, with normal latency throughout the incident. To end users, the event never happened.

Diversity: Multiple underlying service providers

Ably makes use of multiple providers for underlying services wherever possible. This is most evident in the access network; where necessary, client connections to Ably, via the fallback mechanism, can be via a distinct top-level domain, a distinct DNS provider, distinct CDN, and using a distinct client protocol. This means none of Ably’s dependencies that have a global control plane are a single point of failure.

This is an example where we don’t simply rely on the SLA for the underlying service; where there is a single global control plane then there is a material risk that the service, or the control plane at least, will be unavailable in a regional incident.

Monitoring: Detect before users notice

If you only know about problems when customers complain, you've already lost them. Continuous monitoring detects degradation early enough to reroute traffic before anyone experiences errors.

Health checks run constantly across all Ably regions, detecting degradation and triggering automatic traffic shifts before users see impact. Combined with 50% capacity margin, this means sudden surges, from panic refreshes to viral traffic spikes, get absorbed without cascading failures.

Transparency in resilience reporting

We believe reliability and transparency go hand in hand. When outages happen elsewhere, our goal isn’t to hide impact behind “green pixels” – it’s to design systems where customer-facing impact is truly negligible.

During the October 20th AWS US-EAST-1 incident, Ably’s routing layer automatically redirected traffic before users experienced errors or visible latency. In fact, latency impact during the outage averaged just 3ms across US cities - well below normal variance and imperceptible to users.

We still monitor, log, and internally track all reroutes and latency shifts, but because traffic hand-off occurred within seconds and user sessions were uninterrupted, there was no material degradation to report publicly.

Client-side intelligence: Smart SDKs

Even perfect infrastructure has network hiccups. Smart clients that can detect and route around problems mean users stay connected even when infrastructure has momentary issues.

Ably's SDKs understand infrastructure topology and detect failures automatically, reconnecting to healthy regions without application code changes. This removes dependency on perfect infrastructure, clients become resilient to infrastructure imperfections.

During both the December 2021 and October 2025 AWS outages, this architecture kept platforms like Ably online while others waited for status updates. Genius Sports made this architectural choice non-negotiable:

Reliability isn’t promised - it’s designed

SLAs make reliability sound contractual. In reality, they’re insurance for expected failure; a way to remedy downtime rather than prevent it. The real question isn’t what percentage of uptime a provider is anticipated by the SLA, but whether the system itself is designed to stay up when everything around it breaks.

What matters is architecture, not assurance. If your platform’s resilience depends on someone else’s SLA, then this can be your weakest link. SLA remedies - although they least mean the provider has some skin in the game - are unlikely to be a compensation that’s commensurate with your losses from an incident, whether those are monetary or in user trust.

At Ably, 100% uptime isn’t a marketing goal, it’s an architectural one. The systems are built to never depend on a single region, component, or provider. Each part of the stack assumes failure will happen. It is designed to make it invisible to users when it does.

Reliability isn’t defined by the promises you make after an outage. It’s proven by the ones users never notice.

See how globally distributed architecture eliminates regional dependencies:
→ Four pillars of dependability

Read more about how Ably handled in October 20th AWS outage:

→ Blog post

Check live system health across all Ably regions:
→ status.ably.com