Edge network

The edge network serves as the entry point to Ably's realtime infrastructure and is core to the platform's ability to deliver reliable, low-latency messaging services globally.

Why the edge network matters

The edge network must ensure that clients can reliably connect to the closest available datacenter, providing optimal performance whilst maintaining resilience against various types of failures. Given that Ably is mission-critical for many applications, the reliability of the edge network is essential; any connectivity issues can potentially impact thousands or millions of end users, making the design and operation of the edge network a key element of Ably's infrastructure.

Edge network architecture

Ably's edge network is designed to provide reliable, low-latency connectivity to the Ably platform from anywhere in the world. The architecture leverages multiple AWS services and custom infrastructure components to ensure resilience, performance, and security.

Global deployment across multiple regions

Ably operates in multiple regions around the world, with servers distributed across more than 15 physical datacenters within the AWS network. This global distribution ensures that clients can connect to a nearby datacenter, minimizing the latency between the client and the entry point to Ably's network.

Each datacenter operates independently, with its own dedicated infrastructure for handling client connections, message processing, and storage. This isolation ensures that issues affecting one datacenter do not propagate to others, maintaining service availability even during regional outages.

The global distribution of datacenters also enables Ably to comply with data residency requirements by keeping data within specific geographic regions when required. This is important for customers operating in regulated industries or jurisdictions with strict data sovereignty laws.

AWS CloudFront and network load balancers

Client connections to Ably are handled through a combination of AWS CloudFront for global edge distribution and AWS EC2 Network Load Balancers (NLBs) for traffic routing within each region.

AWS CloudFront serves as the primary entry point to Ably's network, with over 700 edge locations globally. When a client attempts to connect to Ably, the request is first routed to the nearest CloudFront edge location. This reduces the public internet transit time, as clients connect to a nearby edge node rather than traversing the entire distance to an Ably datacenter.

Behind CloudFront, each Ably region employs AWS Network Load Balancers to distribute traffic to the application servers that handle client connections. NLBs operate at the transport layer and can handle millions of requests per second while maintaining ultra-low latencies. The NLBs distribute traffic to a fleet of frontend servers responsible for establishing and maintaining client connections, authenticating clients, and routing messages to the appropriate backend services. This architecture allows Ably to efficiently scale the number of concurrent connections by adding more frontend servers as needed.

DNS organization and latency-based routing

Ably uses DNS-based latency routing to direct clients to the nearest available datacenter. The primary endpoints for client connections and HTTP requests is main.realtime.ably.net.

When a client performs a DNS lookup for this endpoint, the DNS service resolves to the closest datacenter, among those that are currently enabled, to the client's location. This latency-based routing ensures that clients connect to the datacenter with the lowest network latency, maximising the responsiveness of the service.

Ably's DNS configuration uses a TTL of 60 seconds, allowing for relatively quick rerouting of traffic if a datacenter becomes unhealthy. The health of each datacenter is continuously monitored, and if issues are detected, Ably can modify the DNS routing to direct traffic away from the affected datacenter within minutes.

In addition to latency-based routing, Ably's DNS infrastructure is designed for resilience; an alternate DNS provider is used for fallback endpoints (see below) to eliminate single points of failure. This ensures that DNS resolution continues to function even if the provider for the primary endpoint experiences issues.

Fallback endpoints and secondary domains

While DNS-based routing is effective for normal operations, it has limitations in certain failure scenarios. It can take several minutes to determine that a region is unhealthy and disable it, and DNS changes can take time to propagate due to client-side caching and other factors. Furthermore, DNS itself can be a point of failure if the entire domain becomes unavailable.

To address these limitations, Ably implements a fallback mechanism in all client libraries. If a client cannot connect to the primary endpoint, it will automatically attempt to connect using alternative endpoints. These fallback endpoints include direct connections to specific datacenters, bypassing the default latency-based DNS routing.

Ably operates a completely segregated secondary domain, ably-realtime.com, which is designed to cater to any DNS failures on the primary ably.net domain. This secondary domain uses a different DNS provider form the primary domain, ensuring that domain-level issues do not affect both domains simultaneously.

When using fallbacks, clients may connect to a datacenter that is not the closest to them, potentially increasing latency by up to 150ms. However, Ably prioritizes service availability over optimal latency in failure scenarios, ensuring that clients can maintain connectivity even during significant infrastructure disruptions.

Protocol-level resilience

Ably's client libraries implement intelligent retry logic to handle temporary connectivity issues. The retry strategy adapts based on the type of failure, using shorter intervals for transient issues and longer intervals for persistent problems.

When a connection is re-established after a disconnection, Ably's client libraries attempt to recover the previous connection state, including message continuity. This minimizes the impact of brief connectivity interruptions on the application.

These client-side mechanisms work in conjunction with the server-side infrastructure to provide reliable connectivity. By providing automated reconnection, and automated use of fallback endpoints at the client level, Ably can maintain service availability even during significant infrastructure disruptions that might otherwise cause extended outages.

Infrastructure redundancy

Beyond DNS and client-side fallbacks, Ably's infrastructure includes multiple layers of redundancy to ensure continuous service availability. Each datacenter contains redundant servers, network paths, and storage systems to eliminate single points of failure within the datacenter.

The multi-region architecture provides an additional layer of redundancy, as the failure of an entire datacenter does not impact the availability of the service as a whole. Clients can continue to connect via other datacenters, albeit potentially with increased latency.

The use of CloudFront as the front-end distribution layer adds another level of redundancy, as CloudFront itself is designed to be highly available, with redundant capacity across multiple edge locations. This means that even if specific edge locations experience issues, clients can still connect via other edge points.

Protection against denial of service attacks

As a critical infrastructure provider for many businesses, Ably is a potential target for denial of service (DoS) attacks. The edge network includes multiple layers of protection to detect, mitigate, and absorb such attacks without impacting legitimate traffic.

Layer 3/4 DoS protection with AWS Shield Advanced

Ably utilizes AWS Shield Advanced to protect against volumetric DDoS attacks at the network and transport layers (Layers 3 and 4). These attacks typically involve flooding the target with a high volume of packets or connection attempts, aiming to exhaust network bandwidth or connection handling capacity.

AWS Shield Advanced provides automatic attack detection by continuously monitoring traffic patterns to detect anomalies that indicate an attack is underway. When an attack is detected, Shield automatically applies mitigations to filter malicious traffic while allowing legitimate traffic to pass through.

During large-scale attacks, AWS can leverage its global network infrastructure to absorb and diffuse attack traffic, preventing it from reaching Ably's resources. AWS Shield includes access to the AWS Shield Response Team (SRT), providing expert assistance during attacks.

These protections ensure that even large-scale volumetric attacks, which might otherwise overwhelm traditional network defenses, can be effectively mitigated before reaching Ably's infrastructure.

Layer 7 protection with CloudFront WAF

While Shield Advanced is effective against volumetric attacks, more sophisticated attacks target the application layer (Layer 7) by exploiting specific application behaviors or vulnerabilities. Ably uses the CloudFront Web Application Firewall (WAF) in part to protect against these application-layer attacks.

CloudFront WAF enables Ably to define rules that identify and block malicious requests based on their characteristics. These rules can target common attack patterns, suspicious user agents, or anomalous request rates. WAF can limit the rate of requests from specific IP addresses or address ranges, preventing a single source from overwhelming the service.

If an attack originates primarily from a specific geographic region, the WAF can temporarily block traffic from that region while allowing legitimate traffic from other regions to continue. WAF includes features to identify and manage bot traffic, distinguishing between legitimate bots (such as search engine crawlers) and malicious bots attempting to scrape data or perform attacks.

WAF works in conjunction with other AWS security services, allowing for coordinated defense across multiple layers of the infrastructure. By combining network-level protections from Shield Advanced with application-level protections from CloudFront WAF, Ably creates a comprehensive defense against a wide range of denial of service attacks.

Scalable infrastructure for absorbing traffic spikes

In addition to filtering malicious traffic, Ably's infrastructure is designed to scale rapidly in response to traffic spikes, whether caused by legitimate usage patterns or attack traffic that bypasses filtering. Ably's frontend services can automatically scale to handle increased connection rates, accommodating both legitimate traffic spikes and attack traffic.

Client connections are managed in a way that prevents a single client (or group of clients) from consuming excessive resources, limiting the impact of attempts to exhaust system resources. Ably implements rate limiting at multiple levels, including per account, per application, per key, per token, and per IP address, preventing a single entity from generating excessive traffic.

The connection handling infrastructure includes mechanisms to detect and mitigate connection-based attacks, such as slow loris attacks or connection flooding. In extreme cases, Ably can implement graceful load shedding, temporarily rejecting new connections or non-critical requests to preserve core functionality for existing connections.

This scalable infrastructure ensures that even if attack traffic reaches Ably's application servers, the impact on legitimate users is minimized. By rapidly scaling to absorb traffic spikes and implementing resource controls at multiple levels, Ably maintains service availability even during significant attack events.

Global connectivity considerations

Operating a truly global edge network presents unique challenges and considerations that Ably has addressed in its architecture.

Geographic coverage and edge presence

Ably's edge network is designed to provide low-latency connectivity from virtually anywhere in the world. The combination of AWS CloudFront's extensive network and Ably's strategically located datacenters ensures that most clients can connect to an entry point with minimal network latency.

The geographic distribution of these access points is continuously reviewed and optimized based on traffic patterns and customer requirements. As Ably's user base expands into new regions, additional datacenters can be added to maintain optimal performance.

In regions such as China with unique connectivity challenges, Ably has implemented specific strategies to ensure reliable service. These may include partnerships with local providers, alternative routing arrangements, and region-specific optimizations.

Protocol support and transport optimization

The edge network supports a range of transport protocols to ensure connectivity across diverse client environments. While WebSockets are the preferred transport for realtime connections, the system also supports comet (long-polling HTTP) for clients or network settings where WebSockets are not available or are blocked.

Transport selection is handled automatically by Ably's client libraries, which will use the most efficient available transport based on the client's environment and network conditions. This ensures that clients can connect to Ably even in restrictive network environments, such as corporate networks with restrictive firewalls or proxy servers.

Connection parameters, such as heartbeat intervals and protocol overhead, can also be configured to non-default values as required by the client type to balance reliability with resource efficiency.

Handling network transitions and interruptions

Mobile clients frequently experience network transitions as they move between different connectivity types (WiFi to cellular, for example) or through areas with varying signal strength. The edge network is designed to handle these transitions gracefully, working in conjunction with Ably's connection recovery mechanisms to maintain service continuity.

When a client transitions between networks it will lose its existing connections. The edge network's design allows clients to quickly reestablish connections and resume their session state, minimizing the impact of such transitions on the application experience.

Operational monitoring and management

Maintaining a reliable global edge network requires sophisticated monitoring and management capabilities, which Ably has implemented as part of its operational infrastructure.

Health monitoring and alerting

Ably continuously monitors the health of all components in the edge network, from edge locations and datacenters to individual servers and network paths. This monitoring includes both active probes that test connectivity and performance from various global locations, and passive monitoring that analyzes actual client connection patterns and performance metrics.

Automated alerting systems detect anomalies in the edge network's behavior, triggering notifications to Ably's operations team when issues are detected. The alerting system is designed to identify potential problems before they impact service availability, allowing for proactive intervention.

The health monitoring system also feeds into the automated traffic routing system, directing clients away from unhealthy regions or components without requiring manual intervention. This ensures that clients are always routed to the healthiest available entry point.

Capacity planning and scaling

All Ably infrastructure scales on demand so as to be able to handle the ambient traffic level. The infrastructure is typically provisioned with significant headroom above current demand, ensuring that sudden increases in traffic can be accommodated without impacting service quality. Additionally, the auto-scaling capabilities of the cloud infrastructure allow for dynamic adjustments to capacity based on realtime demand.

For planned high-traffic events with anticipated usage spikes, additional capacity can be provisioned in advance to ensure that all traffic can be handled seamlessly. The operations team works with customers to understand their traffic expectations and ensure that all relevant infrastructure components are provisioned for significant events.