Token streaming

Open in

Token streaming allows progressively streaming the tokens that are generated by LLMs to clients in realtime, as the response is being generated.

The Ably channel delivers each individual token to clients subscribed in realtime and automatically compacts the tokens into full LLM responses so clients do not have to re-stream the entire conversation token-by-token when they reconnect, refresh, or load history.

Diagram showing how AI Transport uses message appends for token streaming

How it works

Token streaming allows clients to receive and consume tokens as they are generated by the LLM, but also allows clients to consume the full responses as a single coherent message when not subscribing in realtime. For example, when looking at history, refreshing the client, or returning to a conversation.

A key feature of AI Transport's transport layer is that it understands the relationship between responses and their individual tokens. By doing this, the service can support clients that resume an interrupted connection, or those that refresh, during a streamed response. AI Transport supports token streaming by enabling agents to form responses incrementally by appending each token to the content of a single message. Each appended token can be received immediately by a subscriber consuming in realtime. Clients that are not connected in realtime do not need to consume each individual token in order to rebuild the response, these clients can consume the entire response up to the last appended token as a single message.

Using the AI Transport SDK on the server with Vercel's AI SDK, a single call streams the entire response:

JavaScript

const { reason } = await turn.streamResponse(result.toUIMessageStream())

That single line reads the LLM stream, encodes tokens through the codec, publishes messages to the Ably channel, handles abort signals, and returns when the stream completes or is cancelled.

On the client, the view updates as tokens arrive:

JavaScript

const { nodes } = useView({ transport })
// nodes contains messages with streaming text that updates in real time

Stream lifecycle

Each streamed response goes through three states:

streaming - tokens are being appended. The message grows as tokens arrive.
finished - the stream completed normally. The message is final.
aborted - the stream was cancelled or errored. The partial message is preserved.

The stream status is tracked in the message header (x-ably-status). Clients can check whether a message is still streaming or complete.

Implement token streaming

Server

The server creates a turn, invokes the LLM, and streams the response:

JavaScript

import { createServerTransport } from '@ably/ai-transport/vercel'

const transport = createServerTransport({ channel })
const turn = transport.newTurn({ turnId, clientId })

await turn.start()
await turn.addMessages(messages, { clientId })

const result = streamText({
  model: anthropic('claude-sonnet-4-20250514'),
  messages: conversationHistory,
  abortSignal: turn.abortSignal,
})

const { reason } = await turn.streamResponse(result.toUIMessageStream())
await turn.end(reason)

streamResponse accepts any ReadableStream. For Vercel AI SDK, result.toUIMessageStream() provides the right format. For other frameworks, produce a ReadableStream of your codec's event type.

Client

With Vercel's useChat:

JavaScript

const transport = useClientTransport({ channelName: chatId })
const chatTransport = useChatTransport(transport)
const { messages } = useChat({ transport: chatTransport })

With generic hooks:

JavaScript

const { nodes } = useView({ transport })
// Each node.message contains the streamed content, updating in real time

Under the hood

The codec converts domain events to Ably operations:

Start - creates a new Ably message on the channel.
Append - appends content to the existing message (Ably message append operation).
Close - updates the message with a terminal status (finished/aborted).

If an append fails, for example due to a transient network issue, the encoder falls back to a full message update operation to recover. This ensures the accumulated response is never lost.

Append rollup

LLM token streaming introduces high-rate traffic patterns, with some models outputting upwards of 150 distinct token events per second. AI Transport automatically manages this by rolling up multiple appends into a single published message, preventing a single response stream from reaching the message rate limit for a connection.

Your agent streams tokens to the channel at the model's output rate.
Ably publishes the first token immediately, then automatically rolls up subsequent tokens on receipt.
Clients receive the same content, delivered in fewer discrete messages.

By default, Ably delivers a single response stream at 25 messages per second or the model output rate, whichever is lower. Ably charges for the number of published messages, not for the number of streamed tokens.

Configure rollup behaviour

Ably concatenates all appends for a single response that are received during the rollup window into one published message. Set the rollup window for a connection using the appendRollupWindow transport parameter:

`appendRollupWindow`	Maximum message rate for a single response
0ms	Model output rate
20ms	50 messages/s
40ms (default)	25 messages/s
100ms	10 messages/s
500ms (max)	2 messages/s

JavaScript

const ably = new Ably.Realtime(
  {
    key: 'demokey:*****',
    transportParams: { appendRollupWindow: 100 }
  }
);

API key:

DEMO ONLY

Cancellation - stop a stream mid-response.
Reconnection and recovery - resume streams after disconnection.
History and replay - load past streamed responses from channel history.
Chain of thought - stream reasoning alongside text.
Server transport API - reference for streamResponse and other server methods.
Transport - how the transport layer encodes and delivers tokens.
Get started - build your first AI Transport application.

Token streaming

How it works

Stream lifecycle

Implement token streaming

Server

Client

Under the hood

Append rollup

Configure rollup behaviour

Related features