Resume tokens and last-event IDs for LLM streaming: How they work & what they cost to build

When an AI response reaches token 150 and the connection drops, most implementations have one answer: start over. The user re-prompts, you pay for the same tokens twice, and the experience breaks.

Resume tokens and last-event IDs are the mechanism that prevents this. They make streams addressable – every message gets an identifier, clients track their position, and reconnections pick up from exactly where they left off. The concept is straightforward. The production scope is not: storage design, deduplication, gap detection, distributed routing, and multi-device continuity all follow from the same first decision.

LLM token stream cut by a connection break, leaving an incomplete AI response on screen

How resume tokens actually work

Resumable streaming has four moving parts.

Message identifiers. Every token or message gets a sequential ID when published – monotonically increasing, so each new message has a higher ID than the previous one.

Client state. The client tracks the ID of the last message it successfully received. In a browser, that's typically held in memory or local storage. On mobile, it needs to survive app backgrounding.

Reconnection protocol. When the connection drops, the client presents the last ID it saw. The server responds with everything that arrived after that ID, then transitions to live streaming.

Catchup delivery. The client receives missed messages in order before live tokens resume. The seam should be invisible.

The stream itself becomes the source of truth. The client doesn't reconstruct what it missed – the stream delivers it.

What SSE's Last-Event-ID header gives you

Server-Sent Events implements this natively. When an SSE connection drops, the browser automatically includes a Last-Event-ID header on reconnection. The server sees which event the client received last, and resumes from there.

event: token
id: 150
data: {"content": "production"}

event: token
id: 151
data: {"content": " systems"}

// On reconnection after receiving event 150,
// browser automatically sends:
GET /stream HTTP/1.1
Last-Event-ID: 150

// Server resumes from 151

The browser handles reconnect logic. Application code doesn't change between initial connection and reconnection. For the happy path – stable connection, single device, short responses – SSE with Last-Event-ID works well.

The problems start at the boundary of what SSE can do.

SSE is unidirectional and HTTP-only. It has no native history beyond what you implement server-side. It doesn't handle bidirectional messaging, so live steering – users redirecting the AI mid-response – requires a separate channel. On distributed infrastructure, a reconnecting client may reach a different server instance that has no record of the original session. SSE handles the reconnect handshake. Everything else – distributed state, per-instance routing, multi-device history – is still your problem. For use cases that need bidirectional messaging, WebSockets vs SSE covers the tradeoffs in detail.

Building resume into WebSockets

WebSockets don't include resume semantics. When a WebSocket closes, the connection is gone. Reconnecting creates a new socket with no knowledge of the previous one.

Building resume on WebSockets means building all of it yourself:

Session IDs generated at stream start, stored server-side, presented by the client on reconnection. Message IDs assigned sequentially. Server logic to look up a session, find the position, replay history, then transition to live. Buffer management to decide how long to keep messages for sessions that haven't reconnected yet. Cleanup logic to expire stale sessions without cutting off legitimate reconnects.

Each piece is straightforward in isolation. The edge cases are where the weeks go.

The storage problem teams underestimate

Token-level storage is where most implementations hit an unexpected wall.

A 500-word response generates roughly 625 tokens. If you store each token as a separate record, loading one response means retrieving 625 records. A conversation with 20 exchanges is 12,500 records. Multiply across thousands of concurrent users and history retrieval becomes the performance bottleneck.

This matters because history retrieval is on the critical path for multi-device continuity. When a user switches from laptop to phone, the speed of catchup determines whether the experience feels continuous or broken.

The more practical model is to treat each AI response as a single logical message and append tokens to it rather than publishing them individually. Clients joining mid-stream receive the full message so far, then get new tokens as they arrive. One record per response instead of hundreds.

Duplicates and gaps: the two failure modes that break trust

Duplicates happen when the connection drops after the client receives a message but before the acknowledgement reaches the server. On reconnect, the server doesn't know whether to replay that message. Without deduplication logic, the client renders the same token twice.

Two diagrams showing duplicate tokens and gap tokens in a message stream

The fix is using message IDs as deduplication keys on the client – straightforward in principle, but it needs to survive page reloads and work across tabs.

Gaps happen when sequential IDs arrive out of order or not at all. If a client receives message 153 after 150, messages 151 and 152 are missing. Without gap detection, the client silently renders an incomplete response. With it, you need logic to request missing messages, decide what to do if they can't be retrieved, and handle the state when the client gives up waiting.

Both failure modes are rare enough to be invisible in testing. Both surface under real network conditions: mobile handoffs, flaky WiFi, corporate proxy timeouts. The first time you see them is usually a support ticket.

What distributed deployment adds

A single-server implementation can tie session state to process memory and mostly work. As soon as you run multiple instances – which you will, for reliability and scale – a routing problem appears.

A client that connected to instance A reconnects to instance B. Instance B has no record of the session. Your options: route all reconnections back to the originating instance (a pinning strategy that creates hotspots and defeats the purpose of multiple instances), or store session state in shared infrastructure that all instances can read.

Shared session storage means Redis or equivalent: network round-trips on reconnect, cache invalidation logic, and failure handling when the cache is unavailable. This is solvable. It's also not in the first implementation.

For the broader build-vs-buy framing - when Redis is enough, and when a durable session pays off - see AI chat stream resumption.

The multi-device gap

Multi-device continuity is where connection-oriented design hits a wall.

When state lives in the connection – or in server memory tied to that connection – device switching loses context. The phone doesn't know what the laptop received. Without a shared source of truth for message history that any device can query, each reconnect from a new device is a new session.

True multi-device continuity requires decoupling state from connections entirely. The conversation lives in a channel or persistent store. Devices subscribe and catch up rather than resuming a connection.

This is a different architectural model than resuming an HTTP stream. For most teams, that realisation arrives after the first implementation is already in production.

Ably shared session channel routing AI messages across devices without interrupting the response

When resumable streaming matters most

Not every streaming application needs this. For short-lived, single-session interactions on stable connections, standard HTTP streaming is fine.

Resume becomes critical under specific conditions:

Mobile clients handle network handoffs between WiFi and cellular constantly. Each one is a potential disconnection.

Long responses – anything over 30 seconds – have a high probability of encountering a transient failure.

Multi-device usage means the conversation needs to live in a channel, not a connection.

Multi-agent systems, where several agents publish updates to a shared channel. A reconnecting client needs to catch up on everything all agents published, not just the primary response thread.

The alternative is forcing users to restart on every interruption. That breaks trust fast, and the cost compounds on longer or more complex tasks where restarting is most painful.

What you're actually signing up for to build this

Teams that have shipped resumable streaming in production describe a consistent arc: the first implementation takes a week, the edge cases take a month, and cross-device reliability is still not fully solved six months later.

The full scope of a production-grade build: session management, message storage with efficient retrieval by ID range, client-side deduplication, gap detection, distributed routing, cache invalidation, buffer expiry, and monitoring to surface issues you can't reproduce locally.

Good transport infrastructure handles duplicates and gaps automatically. Application logic shouldn't need to check for either – that's the infrastructure's job.

Build vs infrastructure

Building resumable streaming yourself is a reasonable choice if you have a stable team, time to maintain it, and no multi-device or distributed requirements.

It's a harder choice than the SSE documentation makes it look. One team described spending several weeks on custom session management and still not fully solving cross-device reliability. The problems weren't obvious in the design phase – they appeared under mobile network conditions, under load, and when users did things the system wasn't built to handle.

The alternative is transport infrastructure that implements resume as part of the platform. You keep control of your LLM, prompts, and application logic. Session continuity, offset management, ordered delivery, and multi-device state become infrastructure concerns rather than application concerns.

Both paths are defensible. The costs of building are real and most of them are invisible until the first deploy.

Streaming responses between AI agents and clients? Ably AI Transport includes resumable token streaming, automatic replay, and channel-based delivery with guaranteed ordering. Docs go deeper.

Resume tokens and last-event IDs for LLM streaming: How they work & what they cost to build

How resume tokens actually work

What SSE's Last-Event-ID header gives you

Building resume into WebSockets

The storage problem teams underestimate

Duplicates and gaps: the two failure modes that break trust

What distributed deployment adds

The multi-device gap

When resumable streaming matters most

What you're actually signing up for to build this

Build vs infrastructure

Continue reading

Conversation tree branching in @ably/ai-transport

The model is fine. The session is broken.

Engineering message appends for AI Transport: three vignettes

How resume tokens actually work

What SSE's Last-Event-ID header gives you

Building resume into WebSockets

The storage problem teams underestimate

Duplicates and gaps: the two failure modes that break trust

What distributed deployment adds

The multi-device gap

When resumable streaming matters most

What you're actually signing up for to build this

Build vs infrastructure

New posts from the Ably team, monthly.

Continue reading

Conversation tree branching in @ably/ai-transport

The model is fine. The session is broken.

Engineering message appends for AI Transport: three vignettes