Engineering message appends for AI Transport: three vignettes

At Ably we recently shipped AI Transport, our drop-in transport layer for streaming LLM output over Ably channels, with all the resumability, multi-device continuity, and handover guarantees that implies.

One of the tricky things about token streaming from a pub/sub point of view is that while a model progressively emits fragments of text, those fragments logically all belong to a single message. You want to deliver them to the user live as they're produced, but for many other purposes you want to be able to treat the whole aggregated thing as a unit; e.g. at the end of it all you want your channel history to show a single response, not be cluttered with thousands of fragments. The primitive that we use to resolve this tension is a new type of pub/sub message called an append: a way of publishing an update to a message to extend its payload one fragment at a time. Each token the model produces becomes an append, and any subscriber gets a coherent view of the response as it grows.

Doesn't sound particularly complicated. But reality has a surprising amount of detail.

The space of possible API designs, semantics, and implementation decisions you're working with when implementing something like this is enormous. Here are three vignettes from the build, each covering a place where we ended up somewhere interestingly non-obvious.

1. Dual representations: why is every append stored twice?

Within the Ably system, when we send an append around and persist it, it always includes two alternate versions of itself: the actual append fragment (just the new bit), and the full accumulated payload so far (everything the producer has emitted on that message up to and including this fragment).

On its face this seems silly: the whole point of an append is that it's small. The publisher just sent a token; now we have to carry the whole 2,000-token transcript alongside it? Isn't it a nasty n-squared bug?

Consider a new subscriber, who attaches to a channel partway through a long response. The response so far is a couple of thousand characters long. What's the first thing they should receive? The obvious answer is, they receive the appends as published, and to get the earlier part they could use continuous history to get history up until the attach point. This is a fine answer. But it's not a great answer. An append on its own without the earlier part of the message is not really useful. So that answer would force everyone attaching part way through a response to get history, and wait for the response to that, in order to reconstruct the full message so far. That's both slow and a bit annoying: the current message being streamed isn't exactly history, it's happening now.

A nicer API would just give you the fully-reconstructed message so far as the first message you receive, and then just the incremental appends for all further changes subsequent to that first one (avoiding n-squared bandwidth usage).

Now consider: what do you want to get if you do a REST history query for the last 10 messages? You probably want the last 10 actual messages (with their latest, fully-aggregated content), not the last 10 fragments of some one message. So history should store only fully-aggregated messages, no fragments.

So ideally we need both representations of an append, either the fragment or the full version, available at the subscriber-facing edge, where we can choose between them depending on the consumer, the context, and whether this is the first append for a given message on a new attachment or not.

We could compute the "other" representation on demand at the edge where it's needed, but this gets ugly quickly, and has pathological worst-case behaviour.

So instead, we do the computation exactly once, at the point of publish, and propagate both representations together as a single object with two alternate payloads. This is a little more expensive for us internally: but it isn't that bad; message transit bandwidth within our network is cheap, using a larger and more complicated representation internally to simplify things for the publishing and subscribing clients is usually a good tradeoff.

And part of the reason it's not that bad is the next design decision:

2. Conflation as a semantic property of the message, not of a 'conflation step'

In the traditional pub/sub model, conflation is something that only happens if the user explicitly creates a conflation rule with a specified conflation key, applied within a configurable conflation interval. Within that conflation step, if two messages published share a conflation key, the older one is dropped in favour of the newer.

We could have kept this for appends. And if we had, that would have been fine. But not great. There would have been some rough edges.

Some models can routinely emit 150 or more tokens per second. Our default per-connection inbound rate limit is 50 messages per second. If every append is a normal immutable message right up until it hits a conflation rule, then someone using such a model would immediately get rate-limited. We'd probably advise them to do some clientside pre-publish batching. Which is a few lines of code, nothing tricky. Still, annoying.

And traditional conflation semantics are the wrong semantics for appends. You never want to keep only the latest fragment of a token stream, what you actually want is to concatenate them. We could have added an option to a conflation rule to concatenate appends instead of dropping all but the latest one, and encouraged users to configure such a step. Which would have been... fine.

But the thing to notice here is: concatenation is not just one of several possible ways of combining appends together. The system stores a fully-aggregated version of the message - that means the system already privileges concatenation as the unique way of combining appends together. That is: appends by definition admit a natural, lossless conflation semantic.

Which means -- why limit it to a 'conflation step'? This actually isn't at all like our normal message conflation, no payload data is dropped. So what if, instead of making the user explicitly set up a 'conflation rule which does appending', we define the semantics of append messages to allow the server to eagerly concatenate them together at every opportunity?

In which case all the questions about where in the pipeline should conflation apply, and how should the user opt in, disappear. There is no need for a policy surface, the message type itself carries the license, and we can just 'do the right thing'. And any kind of batching rule you configure gets to conflate appends for free, without the user needing to configure anything special. And if someone publishes a high rate of appends that would trigger their connection rate limiter, we can just automatically trigger some brief batching of appends at the connection level, reducing a 150/s token stream into something that will fit the connection rate limit (and we do exactly that). The problem is solved for the user before they hit it.

Getting the semantics of an operation right — what guarantees it makes and what guarantees it doesn't make — is one of trickiest but most important things to get right for a great API.

3. Generic primitives beat bespoke features: how we already had most of token streaming before we started

Ably's Chat product has supported editing and deleting messages since last year. A chat user types "helo wolrd", realises, edits to "hello world"; some time later decides the thought was bad and deletes it. Standard chat behaviour.

The easy way to build that would have been to put edit and delete logic inside the Chat product layer. Chat sits on top of pub/sub, and could have implemented editing itself, on top of the existing pub/sub system. The feature would have shipped faster, and would probably have worked well enough for the chat use case.

We didn't do that. We built it from the ground up as a first-class primitive of our pub/sub layer, of which chat was just the first user.

This was a not small amount of work. Updates broke an assumption the old history implementation had relied on — that messages are delivered in the order they were first created and that their content never changes after the fact. We ended up building a whole new history storage layer that allowed history requests to efficiently return messages at their original creation position, but with their latest content.

Was this over-engineering? Why do all the work to make it generic when chat was the only user who had asked for this?

There is often pressure, when you're building a feature motivated by a specific, important use-case, to build only for that use case. The customers you're talking to are the ones who need it for that, the indicators by which you're measuring success are those customers. Genericising takes longer and serves people who haven't asked for it yet.

But I've found that when we do do that work, we later find ourselves glad that we've done so, and when we don't, we regret it. The generic primitive is a lever with unknown future application.

A year or two after doing updates and deletes for chat, we started building AI Transport. And we realised that the semantics we wanted for token streaming was really quite similar to editing messages: a publisher has already put a message onto a channel and wants to modify its payload, and the channel history should reflect only the modified version. The difference is only that you're concatenating instead of replacing. Most of the substrate — the API pattern for publishers referencing existing messages, the capabilities, the history layer that places messages at their creation time, and so on — was already there. If we'd built edits inside Chat, we'd have had to build it again for AI Transport. Instead, we could take advantage of the work we'd already done, just extending it by adding an append operation to update and delete.

You don't always know what the future feature is going to be. When we were building Chat's edit-and-delete, we had no idea we'd shortly be using that primitive to stream tokens.

But adding powerful, composable, generic primitives has a habit of paying off in ways you can't predict.

Engineering message appends for AI Transport: three vignettes

1. Dual representations: why is every append stored twice?

2. Conflation as a semantic property of the message, not of a 'conflation step'

3. Generic primitives beat bespoke features: how we already had most of token streaming before we started

Continue reading

How to connect to Ably directly (and why you probably shouldn't) – Part 2

VPC peering vs Transit Gateway and beyond: Key choices in AWS network design

CRDTs solve distributed data consistency challenges

1. Dual representations: why is every append stored twice?

2. Conflation as a semantic property of the message, not of a 'conflation step'

3. Generic primitives beat bespoke features: how we already had most of token streaming before we started

New posts from the Ably team, monthly.

Continue reading

How to connect to Ably directly (and why you probably shouldn't) – Part 2

VPC peering vs Transit Gateway and beyond: Key choices in AWS network design

CRDTs solve distributed data consistency challenges