1. Topics
  2. /
  3. Protocols
  4. /
  5. Scaling Socket.IO - practical considerations
20 min readPublished Aug 24, 2022

Scaling Socket.IO - practical considerations

Alex Diaconu
Written by:
Alex Diaconu

Socket.IO is a well-known realtime library that enables bidirectional, low-latency communication between web clients and web servers. Built on top of WebSockets, Socket.IO provides additional capabilities such as automatic reconnections or falling back to HTTP long polling. Socket.IO is mainly used for developing live and collaborative features like chat, document collaboration, planning apps, multiplayer games, and many more. 

There are plenty of demos and small-scale projects built with Socket.IO around. But is Socket.IO a good choice for engineering production-ready systems that can scale to handle millions of concurrently-connected users? That’s what we’ll assess in this blog post, by looking at the following:

Copy link to clipboardSocket.IO advantages

Let’s start by reviewing Socket.IO’s features that are essential (or at least desirable) for building scalable realtime apps. 

Copy link to clipboardMultiplexing and broadcast support

Socket.IO supports multiplexing through namespaces. A namespace is essentially a communication channel that allows you to combine (and split) your app logic over a single shared connection. Making use of namespaces enables you to minimize the number of TCP connections used, and save socket ports on the server. 

Socket.IO also allows the server to flexibly broadcast events to clients. You can broadcast to all connected clients:

io.emit("hello", "world");

You can also broadcast to all connected clients, with the exception of the sender:

io.on("connection", (socket) => {
 socket.broadcast.emit("hello", "world");
});

Finally, Socket.IO offers a room feature, which you can use to broadcast events to a subset of clients (who have joined the room):

io.on("connection", (socket) => {
 socket.to("some room").emit("some event");
});

Copy link to clipboardDisconnection detection and automatic reconnections

In a large-scale system that facilitates communication over the public internet, disconnections will inevitably occur. To deal with such scenarios, Socket.IO provides a configurable Ping/Pong heartbeat mechanism, allowing you to detect if a connection is alive or not. Additionally, if and when a client gets disconnected, it automatically reconnects. You can configure an exponential back-off delay for reconnections, so your servers don’t get overwhelmed.

Note: When broadcasting, if a client is disconnected, they won’t receive the event. Storing the event in a database (so it can be resent when the client reconnects) is something you will have to manage separately.

Copy link to clipboardAdapters for horizontal scaling

When you scale horizontally, you need to use a mechanism to keep your Socket.IO servers in sync. This is required so that events are properly routed to all clients, even if they are connected to different servers. Socket.IO officially provides several adapters (integrations) with various backend solutions you can use to keep your servers in sync: Redis, MongoDB, Postgres. There are also a few community-maintained adapters (AMQP/RabbitMQ, NATS). 

Copy link to clipboardFallback support

Some environments (such as corporate networks with proxy servers) will block WebSocket connections. To deal with these situations, Socket.IO supports HTTP long-polling as a fallback. 

Copy link to clipboardSocket.IO limitations

We’ll now dive into Socket.IO’s known limitations; these are aspects to bear in mind if you’re planning to build and deliver realtime features at scale.

Copy link to clipboardMemory leak issues

Socket.IO is known to have some issues related to memory leaks. There are a couple of ways to mitigate these issues, but they come at a cost. One (part) of the solution is to disable the perMessageDeflate (message compression & decompression) option. Note that the ability to compress WebSocket messages is desirable, since it can help significantly reduce network traffic.

Another way to avoid or at least reduce memory leak issues might be to disable the HTTP long-polling fallback, and only use WebSockets (although this doesn’t seem to solve the problem for all users). This isn’t a great solution, especially if you have clients connecting from networks where WebSockets aren’t supported; your service would be unavailable to them.

Copy link to clipboardMessage ordering is not guaranteed

If the order in which messages are delivered is important to your use case, then Socket.IO might not be a great fit. You could make Socket.IO’s WebSocket transport more reliable, by adding a sequential ID (serial number) to your message payload, to ensure that the receiver processes messages in order.  

However, you have no strong ordering guarantees if Socket.IO falls back to HTTP long-polling, as it’s possible for multiple HTTP requests from the same client to be in flight simultaneously. Due to various factors, such as network congestion, there’s no guarantee that the requests issued by the client will be processed in the right order by the server.

Copy link to clipboardLimited platform support

Socket.IO is not compatible with other WebSocket implementations. Per the Socket.IO documentation:

"Although Socket.IO indeed uses WebSocket for transport when possible, it adds additional metadata to each packet. That is why a WebSocket client will not be able to successfully connect to a Socket.IO server, and a Socket.IO client will not be able to connect to a plain WebSocket server either."

For a long time, Socket.IO officially only provided a Node.js server implementation, and a JavaScript client. More recently, the Socket.IO creators have also developed several other clients, in Java, C++, and Swift. 

Beyond the officially-supported implementations, there are multiple other Socket.IO client and server implementations, maintained by the community, covering languages and platforms such as Python, Golang, or .NET. 

However, most of the community-maintained Socket.IO packages come with limitations: some are not actively maintained, or have a limited feature set. For example, the Golang server implementation only supports Socket.IO version 1.4 (although, nowadays, Socket.IO has reached version 4.x), and has a limited feature set: rooms, namespaces, and broadcast. Another example is the Socket.IO C++ client, which only supports WebSockets (no fallback to HTTP long-polling). 

If you’re pondering whether to select Socket.IO for your project, this is something to bear in mind: depending on what languages you want to use on the client and server side, you might end up working with a limited feature set, or even have to build your own Socket.IO library, which is not only time-consuming but requires an ongoing commitment to maintenance.

Copy link to clipboardLimited native security capabilities 

Socket.IO has some limitations around security features:

  • End-to-end encryption. If this is important to your use case, you will have to build or integrate your own end-to-end encryption capability, as this is something Socket.IO doesn’t support out of the box. 

  • Authentication. While Socket.IO allows clients to send credentials, it does not provide a mechanism to generate and renew tokens - you will need to use a separate service that is responsible for authentication.

Copy link to clipboardSingle point of failure 

When you scale beyond one single Socket.IO server, you need to use a tool such as Redis to keep your servers in sync, and ensure that updates are delivered to all relevant clients. However, like all services, you have to expect maintenance windows and some unexpected downtime. When your Redis server (or whatever similar solution you plan to use) is down, your system will also be severely affected. 

Copy link to clipboardSingle-region design

Socket.IO is designed to work in a single region, rather than a multi-region architecture. To clarify, I’m not saying it’s impossible to build a globally-distributed Socket.IO architecture, but it would be extremely complex to engineer and manage at scale, and there is no evidence that anyone has achieved this. 

Socket.IO’s single-region design can lead to issues such as:

  • Increased latency. If perhaps you’re building a game or financial services platform, and latency matters to you, then you have a problem if your visitors are not near your servers. If, for example, you have two users playing a realtime game in Australia, yet your Socket.IO servers are located in Europe, every message published will need to go halfway around the world and back.

  • System downtime. What happens if the datacenter where you have your Socket.IO servers goes through an outage? Your system would experience downtime, and become unavailable to users. 

Copy link to clipboardSticky load balancing

If you scale beyond one Socket.IO server and plan to support HTTP long-polling as a fallback, you will have to use sticky sessions. The problem with sticky sessions is that they hinder your ability to scale dynamically. For example, let’s assume some of your Socket.IO servers are overwhelmed and need to shed connections. Even if your system can scale dynamically, the dropped Socket.IO clients would keep trying to reconnect to the same (overwhelmed) servers, rather than connecting to new ones. 

It’s much more efficient to dynamically scale your Socket.IO server layer when you aren’t using sticky load balancing, but that means you can’t rely on any fallback; if the WebSocket transport isn’t supported, then your app would be unavailable to users. It’s a trade-off you will have to consider.  

Copy link to clipboardSocket.IO performance review

This section covers aspects related to Socket.IO’s performance; specifically, how it performs compared to some WebSocket alternatives, and how you can benchmark it yourself. 

Copy link to clipboardSocket.IO vs WebSocket vs SockJS

A study published in 2020 analyzes the performance differences between Socket.IO, SockJS, and plain WebSockets. The author measured the following:

  • How does the time to establish a connection change with different levels of clients?

  • How does the time to receive a message change with different levels of clients?

  • How does the memory usage on the server change with different levels of clients?

The test bench consisted of a workstation (to spawn clients), and a server to manage the incoming requests, and send data back to the clients. This was the hardware used:

MachineCPUMemoryOS
WorkstationIntel Core i5-4690K12 GB of DDR3 1333 MHzUbuntu 19.04
ServerIntel Core i3-83008 GB of DDR4 2400 MHzUbuntu 18.04 LTS

The key findings are presented below.

Time to establish a connection

Time to establish a connection. Source: https://go.ably.com/bcd

When it comes to how long it takes to establish a new connection, plain WebSockets were roughly three times faster than both Socket.IO and SockJS, up to 1,000 clients. With 10,000 concurrent clients, it took 411 ms to open a new plain WebSocket connection. In comparison, it took 519 ms with Socket.IO. 

The difference can be explained because there are additional steps in opening a new Socket.IO connection compared to a vanilla WebSocket one. 

Note that SockJS could not handle more than 7,000 concurrent clients (470 ms to open a new connection at this concurrency level).  

Time to receive a message after the connection has been established

Time to receive a message after connection establishment. Source: https://go.ably.com/bcd

With 1,000 concurrent clients, it took the following amount of time to receive a message from the server once a new connection was established:

  • 73 ms for plain WebSockets

  • 47 ms for SockJS

  • 525 ms for Socket.IO

SockJS was not able to handle more than 7,000 concurrent connections. However, up to about 5,000 concurrent clients, it performed significantly better than plain WebSockets and Socket.IO, as shown in the graph above.

Memory usage on the server

Memory usage on the server. Source: https://go.ably.com/bcd

The following amount of memory was roughly needed to handle 1,000 simultaneous clients:

  • 80 MB when plain WebSockets were used

  • 94 MB when SockJS was used

  • 200 MB when Socket.IO was used

SockJS reached its limit at 7,000 clients (241 MB required). 201 MB were needed to sustain 10,000 concurrent clients connected over vanilla WebSockets. As for Socket.IO? Almost 2 GB.

There are significant differences in the amount of memory required to handle the same number of clients. The test shows that plain WebSockets and SockJS have a much lower memory requirement compared to Socket.IO. This could be interpreted as proof that Socket.IO doesn’t scale very well with high concurrency. However, some differences are to be expected to begin with; after all, Socket.IO is a much more complex (and demanding) solution than raw WebSockets and the simplistic SockJS library. 

Copy link to clipboardHow to test Socket.IO

Many factors can influence Socket.IO’s performance: the hardware used, changing network conditions, number of concurrent connections, frequency of heartbeats, message frequency, and many more. It’s essential to test Socket.IO yourself, to understand if it meets the performance requirements of your specific use case.  

You can use a testing framework to load test Socket.IO. One of the most popular choices is the Artillery toolkit; see the Artillery documentation on Socket.IO for more details. 

Alternatively, the Socket.IO documentation provides a basic script you can use as a starting point.

Copy link to clipboardHow to scale Socket.IO

Let’s now take a closer look at how to scale Socket.IO, and the challenges you’ll face along the way. 

Copy link to clipboardHow many connections can a Socket.IO server handle?

A question that often comes up is: what’s the maximum number of connections a single Socket.IO server can handle? As we saw in the performance section of this article, a Socket.IO server can sustain 10,000 concurrent connections. However, this number is only indicative; your Socket.IO server may be able to deal with more connections or fewer, depending on factors such as the hardware used, or how “chatty” the connections are. 

In any case, “How many connections can a Socket.IO server handle?” is not the best question to ask, and that's because scaling up has some serious practical limitations. 

Regardless of how good your hardware is, you can only scale it vertically up to a finite capacity. What happens if, at some point, the number of concurrent connections proves too much to handle for the server? With vertical scaling, you have a single point of failure. 

In contrast, horizontal scaling is a more dependable model in the long run. Even if a server crashes or needs to be upgraded, you are in a much better position to protect your system’s overall availability since the workload of the machine that failed is distributed to the other Socket.IO servers.

Copy link to clipboardHow to scale Socket.IO to multiple servers

Scaling Socket.IO horizontally means you will need to add a few new components to your system architecture:

  • A load balancing layer. Fortunately, popular load-balancing solutions such as HAProxy, Traefik, and NginX all support Socket.IO.   

  • A mechanism to pass events between your servers. Socket.IO servers don’t communicate between them, so you need a way to route events to all clients, even if they are connected to different servers. This is made possible by using adapters, of which the Redis adapter seems to be the most popular choice.

Below is a basic example of a potential architecture:

Scaling Socket.IO horizontally with Redis

The load balancer handles incoming Socket.IO connections and distributes the load across multiple nodes. By making use of the Redis adapter, which relies on the Redis Pub/Sub mechanism, Socket.IO servers can send messages to a Redis channel. All other Socket.IO nodes are subscribed to the respective channel to receive published messages and forward them to relevant clients. 

For example, let’s say we have chat users Bob and Alice. They are connected to different Socket.IO servers (Bob to server 1, and Alice to server 2). Without Redis as a sync mechanism, if Bob were to send Alice a message, that message would never reach Alice, because server 2 would be unaware that Bob has sent a message to server 1.

Copy link to clipboardWhat makes it difficult to scale Socket.IO?

There are many hard engineering challenges to overcome and aspects to consider if you plan to build a scalable, production-ready system with Socket.IO. Here are some key things you need to be aware of and address:

  • Complex architecture and infrastructure. Things are much simpler when you have only one Socket.IO server to deal with. However, when you scale beyond a single server, you have an entire cluster to manage and optimize, plus a load balancing layer, plus Redis (or an equivalent sync mechanism). You need to ensure all these components are working and interacting in a dependable way. 

  • High availability and elasticity. For example, if you are streaming live sports updates, and it's the World Cup final or the Superbowl, how do you ensure your system is capable of handling tens or thousands (or even millions!) of users simultaneously connecting and consuming messages at a high frequency? You need to be able to quickly (automatically) add more servers into the mix so your system is highly available, and ensure it has enough capacity to deal with potential usage spikes and fluctuating demand.

  • Fault tolerance. What can you do to ensure fault tolerance and redundancy? We’ve already mentioned previously in this article that Socket.IO is single-region by design, so that’s a limitation from the start. 

  • Managing connections and messages. There are many things to consider and decide: how are you going to terminate (shed) and restore connections? How will you manage backpressure? What’s the best load balancing strategy for your specific use case? How frequently should you send heartbeats? How are you going to monitor connections and traffic?

Successfully dealing with all of the above and maintaining a scalable, production-ready system built with Socket.IO is far from trivial. See the next section for details of a real-life Socket.IO implementation, and get a taste of how hard it is to scale it dependably. 

Copy link to clipboardUse case: scaling a chat system with Socket.IO 

Crisp provides customer service software, including a chatbox service that can be embedded into websites. Users connect to Crisp servers over Socket.IO’s WebSocket transport to interact with this chatbox service.

For about six years, Crisp used a simplistic topology, which proved to be stable. Here it is:

Crisp’s Socket.IO + RabbitMQ old topology. Source: https://go.ably.com/akk

Microservices emit payloads that are published to AMQP (RabbitMQ) queues. Socket.IO servers consume from these queues and forward the events to chatbox users.

Since the Crisp team doesn’t know which upstream server a chatbox is connected to, events that must be delivered to clients can’t be routed selectively - they are broadcasted to all Socket.IO workers (16 servers in total).  Furthermore, as each queue is bound to one Socket.IO server, RabbitMQ has to pick each published message and clone it, for each queue. 

As you would expect, this setup is highly resource-intensive for Socket.IO servers (which receive all events happening on the platform), but especially for RabbitMQ. 

As their user base grew, Crisp realized they couldn’t scale this topology further. So they underwent a bumpy optimization journey. They initially tried to make minimal changes, by using routing keys: event publishers would attach a routing key to each payload, and RabbitMQ would only forward the event to the queue(s) bound with the same routing key. 

They deployed this new setup to production, and things seemed to be working well (aside from some fluctuating RabbitMQ memory usage, which wasn’t regarded as an issue). However, Crisp had to restart a Socket.IO node for maintenance. Once they did that, their entire infrastructure was affected:

"Crisp messages were not going through any more, user availabilities were not being processed any more, etc. We connected to the RabbitMQ Web Management dashboard, to discover with horror that it was unavailable, throwing HTTP 500 errors. All AMQP queues were stuck, and did not accept any publish order anymore."

The Crisp team discovered that all RabbitMQ  “CPUs were going through the roof and memory was swapping like crazy”. Even after increasing the capacity of their RabbitMQ cluster 4X, all RabbitMQ nodes still froze upon stopping a single Socket.IO worker (even though this time they had plenty of CPU and RAM capacity left). It became apparent that tearing down a queue with a lot of routing keys (40,000+) was something RabbitMQ could not handle.

Crisp had to roll back to the initial non-routed topology while exploring other ways to optimize their architecture. 

In the end, the Crisp team solved the issue by using prefix routing keys, thus limiting the number of possible routing keys to a predictable number (256 routing prefixes, to be more exact). In addition, Crisp are now using affinity routing; this lets the client include the routing key in the URL, which effectively groups Crisp chatboxes with the same routing key to the same Socket.IO worker. 

In essence, RabbitMQ messages are now only forwarded to the Socket.IO server responsible for the routing prefix set in the message routing key. Clients are routed to the correct server thanks to the affinity key passed in the URL. 

Crisp’s Socket.IO + RabbitMQ new topology. Source: https://go.ably.com/akk

This new topology enables Crisp to significantly reduce the CPU load on both RabbitMQ and Socket.IO servers. It wasn’t an easy task, but it was necessary:

"Looking back to our early days at Crisp, it would have been overkill to add this kind of complexity to our infrastructure topology earlier than today. However, with the growth of our user base came scaling issues with our initial (simple) broadcast routing system, that made changes necessary to support our growth in the future."

Copy link to clipboardWho uses Socket.IO at scale?

There are plenty of small-scale and demo projects built with Socket.IO covering use cases such as chat, multi-user collaboration (e.g., whiteboards), multiplayer games, fitness & health apps, planning apps, and edTech/e-learning.

However, there is limited evidence that organizations are building large-scale, production-ready systems with Socket.IO. We have the Crisp chatbox example, which we discussed in the previous section.

Trello used Socket.IO in the early days. To be more exact, Trello had to use a modified version of the Socket.IO client and server libraries, because they came across some issues when using the official implementation:

"The Socket.IO server currently has some problems with scaling up to more than 10K simultaneous client connections when using multiple processes and the Redis store, and the client has some issues that can cause it to open multiple connections to the same server, or not know that its connection has been severed."

Even using the modified version of Socket.IO had some hiccups:

"We hit a problem right after launch. Our WebSocket server implementation started behaving very strangely under the sudden and heavy real-world usage of launching at TechCrunch disrupt, and we were glad to be able to revert to plain polling..."

After a while, Trello replaced Socket.IO with another custom implementation that allowed them to use WebSockets at scale:

"At peak, we’ll have nearly 400,000 of them open. The bulk of the work is done by the super-fast ws node module. [...] We have a (small) message protocol built on JSON for talking with our clients - they tell us which messages they want to hear about, and we send ‘em on down."

In addition to Crisp and Trello, Disney+ Hotstar considered using Socket.IO to build a social feed for their mobile app. Socket.IO and NATS were assessed as options, but disqualified in favor of an MQTT broker. 

Beyond the examples above, I couldn’t find proof of any other organization currently building scalable, production-ready systems with Socket.IO. Perhaps they’re trying to avoid the complexity of having to scale Socket.IO themselves (or maybe they’re just not writing about their experiences). 

Copy link to clipboardWrapping it up

I hope this article helps you gain a good understanding of Socket.IO’s strengths and weaknesses. Without a doubt, Socket.IO is one of the most popular open-source solutions for developing live and collaborative apps (like chat, whiteboards, and multiplayer games), as demonstrated by the multitude of small-scale and demo projects built with it.

However, developing a demo app or a small-scale system with Socket.IO is one thing; using Socket.IO at scale is an entirely different matter. There isn’t much evidence that many organizations are delivering scalable, production-ready apps with Socket.IO. The only notable exception is the Crisp chatbox use case; and while this example demonstrates that you could use Socket.IO at scale, it also shows just how hard it is to do it right, and how much complexity is involved.

Ultimately, it is up to you to asses if Socket.IO is the best choice for your specific use case, or if there are any better alternatives. If you decide to include Socket.IO in your tech stack and plan to use it at scale, you just have to be aware that it won't be a walk in the park. One of the biggest challenges you’d have to face is to dependably scale and manage complex system architecture and messy infrastructure. Sometimes it’s easier and more cost-effective to offload this complexity to a managed realtime solution that offers similar features to Socket.IO.


About Ably

Ably is a serverless WebSocket platform operating at the edge. We make it easy for developers to build live and collaborative experiences for millions of users in a few lines of code, without the hassle of managing and scaling infrastructure.

Our platform is underpinned by a globally-distributed, autoscaling network consisting of 16 datacenters and 307 edge acceleration points of presence. We reach more than 300 million devices across 80 countries each month. 

Ably provides capabilities and guarantees that match and exceed Socket.IO’s features:

If you’re looking for alternatives to Socket.IO, we invite you to sign up for a free Ably account and see what our platform can do for you.

Join the Ably newsletter today

1000s of industry pioneers trust Ably for monthly insights on the realtime data economy.
Enter your email