WebSocket is arguably the most widely-used protocol to power realtime user experiences such as chat and to transmit instant updates such as live sports data. WebSocket connections enable bidirectional, full-duplex communication between client and server, to send and receive data simultaneously.
Sequence for a websocket connection/disconnection
If you’re using WebSockets and your production system needs to handle thousands or more concurrent connections, you’ll need a fully scalable architecture. This blog post describes the challenges of scaling WebSockets dependably.
Some of the issues associated with scaling include load balancing, fallback strategy, and connection management; these can be exacerbated by use cases that have unpredictable numbers of WebSocket connections.
This article will familiarize you with the issues involved so you can be better prepared to tackle them. We’ll start by comparing how you can scale your WebSocket server layer, either by scaling it up (vertical scaling) or out (horizontal scaling).
Vertical scaling, or scale-up, adds more power (e.g., CPU cores and memory) to an existing server.
At first glance, vertical scaling seems attractive, but imagine you’ve developed an increasingly popular live user experience such as a live sports data app. It’s a success story, but you’re dealing with an ever-growing number of WebSocket connections as more people sign up and use the app. If you have just one WebSocket server, there’s a finite amount of resources you can add to it, which limits the number of concurrent connections your system can handle.
Some server technologies, such as NodeJS, cannot take advantage of extra CPU cores. Running multiple server processes can help, but you’ll need an additional mechanism, such as another intervening reverse proxy, to balance traffic between the server processes.
With vertical scaling, you have a single point of failure. What happens if your WebSocket server fails or you need to do some maintenance and take it offline? It’s not a suitable approach for a product system, which is why the alternative, horizontal scaling, is recommended instead.
Horizontal scaling, or scale-out, involves adding more servers to share the processing workload. If you compare horizontal and vertical scaling, horizontal scaling is a more available model in the long run. Even if a server crashes or needs to be upgraded, you are in a much better position to protect your overall availability because the workload is distributed to the other nodes in the network.
TIP: Horizontal scaling is a superior alternative to vertical scaling because there is no single point of failure or absolute limit to which you can scale.
Horizontal scaling can be challenging but is an effective way to scale your WebSocket application.
To ensure horizontal scalability, you’ll need to tackle some engineering complexity.
A major element is that you need to ensure the servers share the compute burden evenly: you need a load balancer layer that can handle traffic at scale.
Load balancers detect the health of backend resources. If a server goes down, the load balancer redirects its traffic to the remaining operational servers. The load balancer automatically starts distributing traffic whenever you add a new WebSocket server.
Load balancing distributes incoming network traffic across a group of backend servers.
TIP: It’s much more complicated to balance load efficiently and evenly across machines with different configurations, so aim for homogeneity.
Load balancing is an essential part of horizontal scalability, since an effective load balancing strategy endows your architecture with the following characteristics:
Fault tolerance, high availability, and reliability.
Ensures no single server is overworked, which can degrade performance.
Minimizes server response time and maximizes throughput.
Load balancing also enables you to add or remove servers as demand dictates.
A load balancer will follow an algorithm to determine how to distribute requests across your server farm. There are many different ways to load balance traffic, and the algorithm you select will depend on your needs.
Round-robin: Each server gets an equal share of the traffic. For a simplified example, let’s assume we have two servers, A and B. The first connection goes to server A; the second goes to server B; the third goes to server A; the fourth goes to B, and so on.
Weighted round-robin: Each server gets a different share of the traffic based on capacity.
Least-connected: Each server gets a share of the traffic based on how many connections it currently has.
Least-loaded: Each server gets a share of the traffic based on how much load it currently has.
Least response time: Traffic is routed to the server that takes the least time to respond to a health monitoring request (the response speed indicates how loaded a server is). Some load balancers might also factor in each server’s number of active connections.
Hashing methods: Routing is decided by a hash of various data from the incoming connection, such as port number, domain name, and IP address.
Random two choices: The load balancer randomly picks two servers and routes a new connection to the machine with the fewest active connections.
Custom load: The load balancer queries the load on individual servers using the Simple Network Management Protocol (SNMP) and assigns a new connection to the server with the best load metrics. You can define various metrics to look at, such as CPU usage, memory, and response time.
Your choice of algorithm will depend on your most common usage scenario. Consider, for example, a situation where you have the same number of messages sent to all connected clients, such as live score updates. You can use the round-robin approach if your servers have roughly identical computing capabilities and storage capacity.
By contrast, if your typical use case involves some connections being more resource-intensive than others, a round-robin strategy might not distribute the load evenly, and you would find it better to use the least bandwidth algorithm.
TIP: Have a good understanding of your use case, in terms of typical usage patterns and bandwidth, before you choose a load-balancing algorithm.
A sticky session is a load balancing strategy where each user is “sticky” to a specific server. For example, if a user connects to server A, they will always connect to server A, even if another server has less load.
Sticky sessions can be helpful in some situations but can also be fragile and hinder your approach to scale dynamically. For example, if your WebSocket server becomes overwhelmed and needs to shed connections to balance traffic, or if it fails, a sticky client will keep trying to reconnect to it. It’s hard to rebalance a load when sessions are sticky, and it’s more optimal to use non-sticky sessions accompanied by a mechanism that allows your WebSocket servers to share connection state to ensure stream continuity without needing a connection to the same server.
While this article covers WebSockets in particular, there is rarely a one-size-fits-all protocol in large-scale systems. Different protocols serve different purposes better than others. Under some circumstances, you won’t be able to use WebSockets. For example:
Some proxies don’t support the WebSocket protocol or terminate persistent connections.
Some corporate firewalls, VPNs, and networks block specific ports, such as 443 (the standard web access port that supports secure WebSocket connections).
WebSockets are still not entirely supported across all browsers.
Your system needs a fallback strategy, and many WebSocket solutions offer such support. For example, Socket.IO will opaquely try to establish a WebSocket connection if possible and will otherwise fall back to HTTP long polling.
In the context of scale, it’s essential to consider the impact that fallbacks may have on the availability of your system. Suppose you have many simultaneous users connected to your system, and an incident causes a significant proportion of the WebSocket connections to fall back to long polling. Your server will experience greater demand as that protocol is more resource-intensive (increased RAM usage).
To ensure your system’s availability and uptime, your server layer needs to be elastic and have enough capacity to deal with the increased load.
TIP: Falling back to another protocol changes your scaling parameters because stateful WebSockets fundamentally differ from stateless HTTP. An ideal load balancing strategy for WebSockets might not always apply equally well; you may consider different server farms to handle WebSocket vs. non-WebSocket traffic.
In addition to horizontal scaling, you should also consider the elasticity of your WebSocket server layer so that it can cope with unpredictable numbers of end-user connections. Design your system in such a way that it’s able to handle an unknown and volatile number of simultaneous users.
There’s a moderate overhead in establishing a new WebSocket connection — the process involves a non-trivial request/response pair between the client and the server, known as the opening handshake. Imagine tens of thousands or millions of client devices trying to open WebSocket connections simultaneously. Such a scenario leads to a massive burst in traffic, and your system needs to be able to cope.
You should architect your system based on a pattern designed to scale sufficiently to handle unpredictability. One of the most popular and dependable choices is the publish and subscribe (pub/sub) pattern.
Pub/sub provides a framework for message exchange between publishers (typically your server) and subscribers (often, end-user devices).
Publishers and subscribers are unaware of each other, as they are decoupled by a message broker, which usually groups messages into channels (or topics). Publishers send messages to channels, while subscribers receive messages by subscribing to relevant channels.
The pub/sub pattern
Decoupling can reduce engineering complexity. There can be limitless subscribers as only the message broker needs to handle scaling connection numbers. As long as the message broker can scale predictably and reliably, your system can deal with the unpredictable number of concurrent users connecting over WebSockets.
Numerous projects are built with WebSockets and pub/sub; many open-source libraries and commercial solutions combine these elements. Examples include Socket.IO with the Redis pub/sub adapter, SocketCluster.io, or Django Channels.
At a particular scale, you may have to deal with traffic congestion since if the situation is left unchecked, it can lead to cascading failures and even a total collapse of your system.
You need a load shedding strategy to detect congestion and fail gracefully when a server approaches overload by rejecting some or all of the incoming traffic.
Here are a few things to have in mind when shedding connections:
You should run tests to discover the maximum load that your system is generally able to handle. Anything beyond this threshold should be a candidate for shedding.
You must consider a backoff mechanism to prevent rejected clients from attempting to reconnect immediately; this would just put your system under more pressure.
You might also consider dropping existing connections to reduce the load on your system; for example, the idle ones (which, even though idle, are still consuming resources due to heartbeats).
TIP: You need to have a load shedding strategy; failing gracefully is always better than a total collapse of your system.
Connections inevitably drop, for example, as users lose connectivity or if one of your servers crashes or sheds connections. When scenarios like these occur, WebSocket connections need to be restored.
You could implement a script to reconnect clients automatically. However, suppose reconnection attempts occur immediately after the connection closes. If the server does not have enough capacity, it can put your system under more pressure and lead to cascading failures.
An improvement would be to exponentially increase the delay after each reconnection attempt, increasing the waiting time between retries to a maximum backoff time. Compared to a simple reconnection script, this is better because it gives you some time to add more capacity to your system so that it can deal with all the WebSocket reconnections.
You can improve exponential backoff by making it random, so not all clients reconnect simultaneously.
TIP: Use a random exponential backoff mechanism when handling reconnections to protect your server layer from being overwhelmed, prevent cascading failures, and allow time to add more capacity.
Data integrity (guaranteed ordering and exactly-once delivery) is crucial for some use cases. Once a WebSocket connection is re-established, the data stream must resume precisely where it left off. Think, for example, of features like live chat, where missing messages due to a disconnection or receiving them out of order leads to a poor user experience and causes confusion and frustration.
If resuming a stream exactly where it left off after brief disconnections is essential to your use case, you’ll need to consider how to cache messages and whether to transfer data to persistent storage. You’ll also need to manage stream resumes when a WebSocket client reconnects and think about how to synchronize the connection state across your servers.
TIP: Some connections will inevitably break at some point. Determine your strategy to ensure that after WebSocket connections are restored, you can resume the stream with guaranteed ordering and (preferably exactly once) delivery.
The WebSocket protocol natively supports control frames known as Ping and Pong. These control frames are an application-level heartbeat mechanism to detect whether a WebSocket connection is alive. At scale, you should closely monitor heartbeats’ effect on your system. Thousands or millions of concurrent connections with a high heartbeat rate will add a significant load to your WebSocket servers. If you examine the ratio of Ping/Pong frames to actual messages sent over WebSockets, you might send more heartbeats than messages. If your use case allows, reduce the frequency of heartbeats to make it easier to scale.
TIP: Keep track of idle connections and close them. Even if no messages (text or binary frames) are being sent, you are still sending ping/pong frames periodically, so even idle connections consume resources.
Backpressure is one of the critical issues you will have to deal with when streaming data to client devices at scale over the internet. For example, let’s assume you are streaming 20 messages per second, but a client can only handle 15 messages per second. What do you do with the remaining five messages per second that the client cannot consume?
You need a way to monitor the buffers building up on the sockets used to stream data to clients and ensure a buffer never grows beyond what the downstream connection can sustain. Beyond client-side issues, if you don’t actively manage buffers, you’re risking exhausting the resources of your server layer — this can happen very fast when you have thousands of concurrent WebSocket connections.
A typical backpressure corrective action is to drop packets indiscriminately. To reduce bandwidth and latency, in addition to dropping packets, you should consider something like message delta compression, which generally uses a diff algorithm to send only the changes from the previous message to the consumer rather than the entire message.
We have discussed the challenge of scaling WebSockets for production quality apps and online services, which can involve significant engineering complexity and draw heavily upon resources and time. You could spend several months and a heap of money.
Ably’s serverless WebSockets platform abstracts the worry of building a realtime solution so that you can deliver a live experience without delays in going to market, runaway costs, or unhappy users. Using Ably allows your engineering teams to focus on core product innovation without building out realtime infrastructure. Our existing customers have already developed massively scalable realtime apps with us, with rapid go-to-market and average savings of $500k on build cost.
Ably extends a basic serverless WebSockets solution to augment your product. Furthermore, while Ably’s native protocol is WebSocket, there is rarely a one-size-fits-all protocol: different protocols serve different purposes better than others. Ably offers multiple protocols such as WebSocket, MQTT, SSE, and raw HTTP.
To support live experiences, Ably adds features such as device presence, stream history, channel rewind, and handling for abrupt disconnections, making building rich realtime applications easier. There is a range of webhook integrations for triggering business logic in realtime. We offer a gateway to serverless functions from your preferred cloud service providers. You can deploy best-in-class tooling across your entire stack and build event-driven apps using the ecosystems you’re already invested in.
Why not get in touch to find out how we can work with you?
WebSockets vs. HTTP: Comparing pros and cons
An overview of the HTTP and WebSocket protocols, including their pros and cons, and the best use cases for each protocol.
What Server-Sent Events is - and how and when to implement it
This article explores what server-sent events (SSE) is, how it works, when to use it, and key considerations and challenges to be aware of.
Scaling Socket.IO - practical considerations
A review of Socket.IO’s advantages, limitations & performance. Learn about the challenges of using Socket.IO to deliver realtime apps at scale.