8 min readUpdated Oct 11, 2023

Message durability and quality of service across a large-scale distributed system

Message durability and quality of service across a large-scale distributed system
Jo StichburyJo Stichbury

Ably is a distributed pub/sub edge messaging platform that acts as the broker in realtime data streaming pipelines. Publishers send messages to Ably, and we deliver those messages to subscribers. We guarantee:

  • regional and global fault tolerance to ensure message availability and survivability
  • superior messaging quality-of-service through four pillars of dependability.

In this article, we describe how Ably can make such guarantees in the face of network outages. We cover the redundancy built into our server clusters and the design we have used to ensure message ordering and exactly-once delivery. We also explain how publishers and subscribers can recover gracefully following network disconnection.

Fault tolerance for datacenter outages

Ably is hosted on AWS EC2 infrastructure and uses its associated services. Our main realtime production cluster is deployed to seven regions across the US, Europe, and Asia. We select two servers to host each channel for fault tolerance so all message operations continue seamlessly with no loss of data in the event of an outage.

We also store each message to a channel in at least one other region. Should service in an entire region and its datacenters be disrupted, such as in the case of the AWS US-EAST-1 outage in December 2021, there is no risk of message loss. Even if an instance or datacenter fails, Ably can statistically provide 99.99999% message availability and survivability.

When a publisher sends a message to the Ably platform, the message is stored and, only then, does the publisher receives an acknowledgment to indicate publication. This prevents the false-positive situation where a publisher believes a message to have been published, but it is lost before it is recorded.

Suppose a problem arises in a datacenter, and it is impossible for Ably to store a message in those multiple locations. The message will be rejected via an exception or failure callback to its publisher. The publisher knows that message publish failed, and can make a subsequent attempt, which should be successful because Ably uses various strategies to work around faulty datacenters.

When a client connects to Ably, they are automatically connected to the closest datacenter through our latency-based routing using DNS. When a client sends a request to our DNS servers, they determine the client’s location and respond with one or more IP addresses from the closest datacenter (closest meaning the one with the lowest latency).

If a datacenter is unhealthy, Ably automatically stops routing all traffic to that datacenter at a DNS level and ensures that any DNS requests are directed to the next closest healthy datacenter. All DNS has a TTL of 60 seconds, so a datacenter becoming unavailable will have traffic routed to one of our other datacenters within a minute.

Ably’s client libraries provide an additional level of DNS redundancy. If our DNS latency-based routing or the entire ably.io domain becomes unavailable, the client libraries connect to a datacenter from a backup domain.

We offer a 99.999% uptime SLA on the Ably realtime service because we know we have built the system to handle server hardware and software failures without any loss of data and service reliability. Disruption to AWS service is considered normal for the day-to-day operation and will have no impact on the Ably service.

| Related article Engineering dependability and fault tolerance in a distributed system |

Message integrity

The discussion above focuses on reliability, so that a publisher can be sure that nothing they send through Ably is lost. Equally important is the integrity of the published messages because duplicated or out-of-order messages can have significant consequences. For example, chat messaging relies on a well-defined sequence of messages.

Ably ensures that messages are delivered to subscribers in the order they were published because every message sent has a unique incrementing message serial number. The serial number is based on a timestamp, and it’s used to disambiguate messages published in the same millisecond.

Ably offers an exactly-once guarantee: when the publisher sends a message, we perform an existence check for already-accepted messages with the same identifier and discard any duplicates.

De-duplication processing ensures that previously processed messages are detected and discarded.
De-duplication processing ensures that previously processed messages are detected and discarded.

Having ensured that duplicate messages from the publisher are discarded, it is slightly more complex to ensure that subscribers receive messages precisely once. Suppose a subscriber client disconnects and later reconnects. In that case, the stream of messages from the publisher must resume precisely where it left off without repetition of any message or loss of others. The subscriber needs to indicate the last message received to resume receiving messages from the stream at the correct point.

Numerical ordering simplifies the implementation of exactly-once semantics.

From an engineering perspective, numerical ordering simplifies the implementation of exactly-once semantics. When subscribers reconnect, they only need to send the serial number of the last message they’ve received to Ably. Based on the serial number, which specifies a position in an ordered sequence of messages, we can resume the stream, ensuring that the subscriber doesn’t receive a message twice or none at all. Furthermore, when the stream resumes, the subscriber receives all messages in the correct order.

| Related article Achieving exactly-once delivery with Ably |

Connection state recovery

What happens when a network failure causes a client to disconnect?

It is inevitable for a user to lose connection every so often, such as when they are using a phone and move into an area with no coverage, or if their computer goes offline because of a network issue. When a connection drops, the apps and services a user depends upon should gracefully handle the outage. And when the connection resumes, the user should be able to pick up where they left off, as the apps they are using recover smoothly.

Ably has designed robust connection management into its client libraries. Once a client has connected to Ably, the server regularly sends a WebSocket ping frame every 15 seconds as a keep-alive. This ‘heartbeat’ allows both ends of the connection to detect if the connection state changes.

An Ably client library cannot rely on the WebSocket API to propagate information from the transport layer if the socket closes. Instead, the client library explicitly monitors the connection state by watching the ping frames because if those stop arriving, the socket has closed. Because the Ably client library has complete information about the WebSocket connection, any app that uses it can respond accordingly to changes in the connection state.

The client library responds to the ping frames, so the server also detects a change to the network state if the heartbeat of responses ceases. Ably’s client libraries will attempt to reconnect after a disconnection every 15 seconds and whenever the OS notifies the client library that the internet connection has resumed. If a client library remains disconnected for under 2 minutes, when the connection is re-established, it can retrieve any missed messages during disconnection.

Once the client has been disconnected for more than 2 minutes, the client library moves into the suspended state to indicate that all connection state is lost, and any channels are automatically suspended. Once the connection is re-established, the client library will automatically reattach the suspended channels and emit an attached event that indicates that the disconnection period lasted longer than two minutes. Apps can reflect that they have been disconnected to a user, and so manage their expectations.

Ably offers message history persistence to apps that need to avoid message loss. If this is enabled, messages are stored for 24 – 72 hours on disk and can be retrieved using the History API. To ensure persistence succeeds messages are stored in three regions.

This historical data is also used by Ably’s Rewind feature so that a client can retrieve messages sent prior to its connection point. For example, when a user first joins a room, they can load the chat stream and see messages exchanged by members of a room prior to the point that they entered.

Find out more about Ably

Predictability and reliability are crucial to the user experience of any online app or service. The four pillars of Ably engineering enable us to guarantee that our customers can build and deliver a predictable experience to their users.

For more information, look at the Ably website for case studies and articles about the platform’s benefits and solutions, or get in touch with us with any questions.

New call-to-action

Join the Ably newsletter today

1000s of industry pioneers trust Ably for monthly insights on the realtime data economy.
Enter your email