14 min read•Last updatedUpdated Jan 10, 2024

CRDTs are simpler and more common than you think

Written by

CRDTs can sometimes be talked about as complex data structures that you use with CRDT libraries. And they can be that, but they don't have to be. Some of the natural solutions that any software engineer might come up to solve a problem in a distributed system are CRDTs, even though the implementer might not know or care that they are. It can be useful to identify label them as such.

A single boolean, such as a flag for whether some event has happened, is a CRDT if it can’t ever be set back to false after being tripped.

I'm going to give a couple of examples of how some very simple CRDTs are used in practice, in a large distributed system that forms the backend of Ably, a distributed pub/sub platform-as-a-service.

The two examples I'll use are the netmap (our term for the set of nodes, spread over many AWS instances and across multiple regions, that form part of a single cluster, such as our main production cluster), and Presence (the set of connections attached to a given channel who have entered into the 'presence set' of that channel, and which is visible to everyone else attached to the channel).

CRDT usage in the netmap

At Ably, part of every cluster is a service discovery layer, which we call 'gossip'. In every data center in which the cluster operates, we run several gossip nodes. Other ('leaf') nodes each connect to a gossip node in the same data center as it, and by that connection, they both discover the netmap and contribute themselves to it. That is, the gossip node sends the leaf node the set of other nodes in the cluster, and the leaf node in turn sends information about itself to the gossip nodes, which propagates that information to all other gossip nodes over a (Scuttlebutt-inspired) gossip protocol. Each of them in turn passes the update on to all leaf nodes connected to them.

That all means that there are many coequal gossips nodes, which will each see the same set of events happen in different orders (as each receives events from all the other gossips it’s talking to from time to time). We want them all to eventually come to the same conclusion about what nodes are in the netmap.

The way we do this is that each gossip node has a set of leaf nodes that it owns, and it gossips that set around to the other gossip nodes. A leaf node is considered in the cluster if and only if it is in the owned-nodes set of at least one gossip. So if a leaf node disconnects from one gossip and reconnects to a different one, it doesn't matter what order other gossips see those two events, eventually everyone will see the node removed from one set and re-added to another set, so still in the netmap.

This is a very simple set CRDT. The underlying data structure is a map, whose keys are ids of the gossips in the cluster, and whose values are sets of leaf nodes. The 'set' abstraction is just the union of all the underlying sets. It isn't one of the standard set state CRDTs, because they need to cope with environments where any two events might commute; here we are taking advantage of the fact that events from a particular source are ordered, so the only operations that need to commute are events from different sources. Since by construction those apply to different keys in the map, we're done.

The point is that there's nothing complicated going on here. We have coordination-free semantics with nothing more than a bit of care in the choice of underlying data structure, a map with the right index instead of a bare set. No special CRDT library, no abstract algebra.

You might ask, what's even the point in conceptualising this as a CRDT then? Isn't that just adding unnecessary complexity? One answer is that having that concept gives you a way to think about what properties you have to maintain as the constraints and requirements change. Perhaps we might want to trial a gossip transport which does not preserve message ordering, or perhaps implement an admin command which can forcibly remove a node no matter how many nodes are contributing it, and so on. Having 'the set must stay a CRDT' as a conceptual handle makes it more likely that we can evolve the system while keeping the guarantees that were present in the original design.

(Actually what we do in practice isn't quite as simple as described above, because we want to err on the side of not removing an instance from the cluster unless and until we’re sure it’s actually gone. Imagine you are a gossip in the cluster. In that case mentioned earlier, where a leaf node migrates from gossip X to another gossip Y, you might hear from X about the removal a little while before you hear from Y about the addition. But you don't want to remove the leaf node from the netmap in the time inbetween, which would be unnecessarily disruptive. So for removals, gossip nodes wait until every gossip that they know of has that talked to X — that is, has the removal of X from A in their causal history (as determined by a 'maxVersionSeen' vector clock) — before they actually consider X removed from the cluster. The result is still eventually consistent because the gossip protocol guarantees that for each node, eventually you’ll hear a new enough update from that node (or else conclude it’s dead). It just has a stronger criteria for removing instances than adding them. Though to be clear, the use of this as a mechanism for detecting and removing unresponsive or dead nodes from the cluster is only needed for unexpected removals, such as node crashes or network partitions. Planned removals, for example during autoscaling, involve instances broadcasting their intention to stop well in advance.)

What would the alternatives to this design? One possibility would be a consensus mechanism such as Raft or Paxos. For example, in an architecture based around Raft, instead of making events commute, you could have a single gossip node (at any one time) in each region which determines a single, authoritative ordering of netmap addition and removal events in that region. That event log would then be replicated to everyone else. The thing about this is that if you want to scale gossip nodes horizontally, you want to distribute which gossip node netmap clients connect to, not have them all connect only to the leader. That means that the leader still needs to do a calculation that looks very similar to the one that that all the nodes were doing previously, in order to aggregate all the netmap events other gossip nodes are sending it into an ordering of netmap events. You do have the benefit that only one node has to do that calculation rather than all of them, and that you get a single canonical historical ordering that all nodes agree on. But that just isn't very important in this usecase (what's important is that all the nodes agree on the cluster membership now), and in return for those benefits, Raft adds a lot of extra complexity and new failure modes.^[1]

CRDT usage in presence

Presence is a mechanism where connections attached to a given channel can enter into the 'presence set' of that channel. That connection (along with its chosen 'client identifier' and any data it wants) is then visible, along with any other members of the presence set, to everyone who is attached to that channel.

In general, people subscribed to the channel from different regions will see presence events from different people in different orders. But we want everyone to eventually agree on who is in the presence set.

The presence CRDT is interesting because it uses two different CRDT models at different times.

Most of the time, it uses a tombstone-less model that relies on elementwise ordered updates. That is, it requires updates from each particular other presence member to be delivered in order, which Ably guarantees. When we have this guarantee, we can use a trivial CRDT: a normal map, just like the netmap example. If updates to a given key are delivered in order, then you only need to guarantee commutativity for updates to different keys, which is just how maps work normally.

However, during a sync (when a newly-attached client wants to synchronize the presence set with Ably), it switches to a different CRDT model, one that does not assume elementwise ordered updates. This allows us to avoid having to carefully synchronize getting the presence set with updates coming live.

Because we can no longer assume ordering, we now need to be careful about a few things. We need to carefully compare each incoming presence update against any matching member that we know about to see which is actually newer (according to the original source of the update). And we need tombstones (that is, we keep the leave events we receive), otherwise (for example) a leave event that arrives live in the middle of a sync, where the sync still includes the member, could result in a situation where we think the member is still there, because we processed the 'leave' event and removed the member from the presence set before we processed the 'present' event in the sync (and even though the leave was newer, we didn't compare the two events for newness because we removed them from the set entirely when we saw the leave).

But we only need the tombstones for a limited period of time: during the sync. Once the sync is complete, we can switch back to our other model; we can go through the set deleting all the tombstones, and stop storing them altogether (unless and until the next time the presence set needs to be synced, eg after the SDK disconnects and reconnects).

This means that we neatly sidestep one of the tricky bits of tombstone-CRDTS, that of how long to keep the tombstones (and how to avoid the size of your data structure growing unboundedly if there's a lot of churn), because we only use tombstones during periods when we actually need them.

Both models are CRDTs, just with different consistency requirements for the transport. It's quite convenient to be able to mix and match the two.

There are a couple of tricky edge cases here. For example, when we detect that a connection that is in the presence set has disconnected without explicitly leaving the set or gracefully closing the connection (commonly because its internet connection is disrupted), 15 seconds after we detect this, we generate a 'synthesized leave' event at our end. This lets everyone subscribed to the channel know that we think that connection has gone. But technically this is a breach of the ordering requirement, as now there is more than one source of updates to entries with the same key. In particular, is possible that that connection will reconnect to us (to a different frontend node), concurrently with the node it was previously connected to generating the synthesized leave. We currently solve this by having an arbitrary tiebreaker (high-res timestamp) for newness between events from different sources, and have the client listen for synthesized leaves for itself, and react by choosing to re-enter if it sees that it has been removed from the presence set when it thinks it should still be in there.

(Note the two halves of that fix: the first half alone (the arbitrary tiebreak) would be sufficient to preserve the CRDT properties, ensuring that every attached clients' presence sets end up in the same state. But that state might not be the correct state from the point of view of the application semantics.)

This works, but is messy. In the latest version of our wire protocol, we're changing to a slightly different presence model, that makes this simpler and more elegant.

In the new model, together with each member's presence data, we will now store the set of frontend nodes that claim to contribute that presence member. Synthesized leave messages will remove the frontend node that generated them from that set, and will only be sent onwards to everyone subscribed to that channel if they result in the size of that set decrementing to zero. After reconnecting to a different frontend node, SDKs will re-enter presence unconditionally, adding the new frontend to that set. To put it another way, we're adding the requirement that presence adds and removes delivered from the same ultimate source but by a different path (that is, different frontends) commute with each other, meaning we can now simplify the SDK logic and removing the special-case. The point here isn't that we fixed it by making it a proper CRDT: it was already a CRDT, the change just let us simplify the logic for merging updates.

Takeaway

Both the CRDTs I've described here are very simple ones. There are certainly complex CRDTs with nontrivial merge algorithms that you should be using a library for. But the point I want to make in this post is that CRDTs don’t have to be complex, and mostly aren't. They are the sort of solution any software engineer might come up with without knowing or caring that they have used a CRDT, and certainly without needing to seek out a CRDT library. I think it can still be useful to know that is what they are, because that lets us give a label to the pattern, identify the common features, make sure we are adhering to the prerequisites, and give the ability to extend the structure into more complex CRDTs from the academic literature or otherwise) where necessary.

It can sometimes be helpful, instead of focusing on CRDTs as a set of data structures and associated merge algorithms, to instead just have a mindset where you’re designing your distributed systems to have coordination-free semantics. Speaking very loosely, there’s a sense in which a useful mindset is that we’re programming in such a way that seeing any new event from the outside can only ‘move you forward’. The boolean example from the introduction gives a nice intuition: once it’s tripped it can’t be untripped. (And if you want to be able to untrip it, you need something like a second boolean with similar semantics, that if tripped cancels out the first: still everything’s only moving forward). Nothing once done can ever be undone.

There are other benefits to Raft. In particular, with current architecture, during a network partition, it's possible for the cluster to split into multiple independent clusters, each of which continues to serve requests. With Raft, you could guarantee that only one side (the side that had a majority of gossip nodes and so could elect a leader) continued to do so. However, there are problems with this. The biggest one is that we operate clusters in many regions around the globe. A single Raft group across the world would have very long leader election times, as the duration of the election is a significant multiple of the inter-node round-trip time. If we want a minority cluster in a partition to not do any work — not just 'not serve requests' in the sense of not changing its cluster membership (which is what the Raft log events here would be here), but actually refuse to service clients, then that means an unacceptable level of global downtime during elections. As such, Raft groups would almost certainly be implemented only within a given region. But the vast majority of network partitions are between regions, within-region traffic is generally pretty reliable. So in practice the use of Raft wouldn't help as much as you might think. ↩︎

CRDTs are simpler and more common than you think

CRDT usage in the netmap

CRDT usage in presence

Takeaway

Recommended articles

Patterns for building realtime features

Scaling Pub/Sub with WebSockets and Redis

Data integrity in Ably Pub/Sub

Join the Ably newsletter today