Adventures in BEAM optimization with our MQTT adapter

This is a journey of obscure settings, red herrings, and poll vs. epoll. A story of how I had spent a couple days performance profiling and optimizing some Elixir code, only to discover that the true solution to the performance issues was an obscure setting in the Erlang VM that controlled which system call the IO polling subsystem used, and that my code-level optimizations were irrelevant. This particular setting has been made redundant by a rewrite of IO polling in a subsequent Erlang/OTP release, but still, it was a valuable lesson.

Back in 2018, we released our MQTT protocol adapter. Like our Pusher and PubNub adapters, it was implemented as an Elixir service, which sits in front of our normal service frontend and live-translates to and from Ably's own protocol.

In the run-up to the release, we decided to do some load testing to figure out the CPU cost per instance of holding onto substantial numbers of mostly-idle MQTT connections (driven by a customer with a large number of IoT devices).

Initial results were not great.

With 15K idle MQTT connections, on one of our usual instances (an AWS m5.large), CPU usage was reaching 60%, mostly from the adapter (with the remainder from the normal Ably frontend process). So we started trying to figure out why that was.

There's a wide variety of ways you can get performance data out of the BEAM (the Erlang VM). To start with there's the observer, which you can use to view all running processes (Erlang processes, not system processes), ordered by reductions (≈function calls, a proxy for CPU usage) or by memory usage. There are also various tracing facilities, which can trace all calls from particular processes, all calls to a given function from any process, and so on. (The built-in tools can be a little low-level, but there are various more user-friendly wrappers available).

I couldn't see anything obvious from just looking at the observer; no processes appeared to be using an excess of resources. What I really wanted was to trace the whole system, with all function calls, and then generate flame graphs out of that data, to get a sense of what was actually using up all that CPU. Doing that using traditional BEAM tracing of every function call would be very expensive; tracing adds too much overhead to do it for every function call ever. Luckily, Scott Fritchie has a great project called eflame2 https://github.com/slfritchie/eflame (building off Vlad Ki's eflame https://github.com/proger/eflame), which makes some experimental changes to the BEAM to allow it to support time-sampled trace data: generating a stack trace every (say) 100ms, and using that to construct a flame graph.

The results were a little puzzling. It showed the vast majority of time per the flamegraphs spent in SLEEP, even though observed CPU usage was still very high.

Experienced BEAMers may now be thinking "A-ha! I bet this is an sbwt issue!" The BEAM process scheduler by default, when it runs out of Erlang processes to schedule, spins for several milliseconds to stop the OS scheduler from scheduling a different OS process, to reduce latency if new work does need to be done. This is known as scheduler busy wait. This can be reasonable behavior if the BEAM is the only important process running on the box that it's running on — which is often the case. But in our case it isn't, it has to play nicely with our frontend processes, and others as well. This behavior often catches out people who try to benchmark Erlang/Elixir against other languages, making the BEAM appear to consume a lot of CPU, even when average latency is low. (I'm pretty sure this mistake is what led to the conclusion that "Elixir nearly saturated all 36 cores while delivering a surprisingly consistent performance" in this blog post by Stressgrid, for example). It can easily be disabled by passing +sbwt none ('t' = threshold) to the Erlang runtime.

This was not the cause of my problem; I had already made that change.

So I shrugged, put aside for now the SLEEPs as a measurement artifact, and filtered them out of the trace output, generating some flamegraphs to see what the biggest consumers of CPU were out of all the non-idle samples.

One source was :crypto.strong_random_bytes when responding to WebSocket pings from our Ably frontends. The adapter connects to our frontends the same way our client libraries do: over WebSockets, one for each connection.

(Using WebSockets to connect two services running on the same box may seem an odd choice: the reasoning was that the translated connection would just appear and behave as a normal external WebSocket, the Ably frontend wouldn't have to do anything special or different for a connection via the MQTT adapter compared to a normal external WebSocket connection.)

The frontend sends a WebSocket ping to each client every 15 seconds, primarily as a way of detecting (and allowing the client to detect) whether the connection is genuinely still alive. When sending a message (even just a pong), a WebSocket feature called masking requires the client to generate a random "mask" — random bytes which are XOR'd with the payload. The point of this is to defeat cache poisoning attacks on proxies. This is of course unnecessary for a WebSocket between two services running on the same instance. I disabled masking (using elixir-socket this was as simple as passing mask: nil as an option to Socket.Web.send).

Another source of CPU usage was responding to MQTT pings from clients. The current code was responding to them in the same way it responded to all messages: decoding them into a Hulaaki Message , passing that to a main MQTT.FromClient.handle() function that pattern-matches on the message and sends a ping response:

def handle(%Hulaaki.Message.PingReq{}, state) do
  Handler.send_to_client(Message.ping_response(), state)
  {:noreply, state}
end

Not much, but do it a couple times a minute for tens of thousands of connections and it all adds up. Optimizing this was just a matter of adding a special-case to the ranch handler to recognise a ping and respond with a pong before the normal MQTT message-parsing logic is invoked. With Elixir's binary pattern matching this is very simple:

@ping <<192,0>>
@pong <<208,0>>

def handle_info({:tcp, socket, @ping}, state) do
  @transport.send(socket, @pong)
  {:noreply, %{state | last_activity: now()}}
end

def handle_info({:tcp, socket, data}, state) do
  // handle all other packets

(For most messages we have to handle messages spread over multiple TCP packets; for a two-byte message this is not an issue.)

I kept going, looking at the other (smaller but still prominent) CPU users from the flamegraphs. URI.encode_query was spending a fair amount of time in URI.encode_www_form; I replaced it with a custom query string encoder that didn't bother with that, since I knew I was already passing URI-safe strings. Timex.Time.now() / elapsed() (used to keep track of the last activity time on a connection) was showing quite a bit; I replaced it with :erlang.system_time(:seconds). And so on.

And after all these in retrospect somewhat desperate micro-optimizations, I was... well, capturing flamegraphs that were now even more SLEEP dominated — having filed down everything else to the bare minimum —but system CPU was still much, much higher than it should have been.

Hmm.

Then, at some point I stumbled across a mailing list thread from 2011 about the +K flag (an argument to erlang runtime, like +sbwt). The reference manual states that it "Enables or disables the kernel poll functionality if the emulator supports it. Default is false (disabled)."

It turns out that with this disabled, the BEAM uses the poll system call to check for incoming IO events on Linux; with it enabled, it uses epoll. Julia Evans has a great blog post on the difference between those; the short version is that for listening to events on a large numbers of mostly-idle sockets, epoll is dramatically more efficient.

Enabling +K true did the trick. With it on, BEAM CPU usage in the idle-MQTT load test dropped by an order of magnitude.

So all the within-BEAM profiling and tracing, all my careful optimization of MQTT ping responses and so on were completely irrelevant, all a red herring. The BEAM was spending almost all of its CPU cycles on IO polling, only a tiny fraction actually executing bytecode, and I'd been spending all my time optimizing that small fraction.

One might question why that wasn't the default. There apparently exist usecases where poll may have an edge over epoll (extremely active sockets), but the general advice seems to be that the vast majority of people should use epoll. One person in that mailing list thread suggests that it's not the default because some OSes such as Windows don't have kernel poll support, but this doesn't make all that much sense — there's no reason to cripple platforms that do have epoll/kqueue/etc. for consistency with one that doesn't. (Especially since I doubt there are many people running the BEAM on Windows servers).

Happily, it seems that the Erlang/OTP team had been thinking along similar lines, and coincidentally, shortly after I discovered this, they announced that in OTP 21, the IO polling implementation was being completely rewritten. IO polling would be moved from scheduler threads to dedicated IO threads, and kernel-space polling (using epoll on Linux) would become the default (and in fact the +K switch would be removed entirely; epoll would be used always unless disabled at compile-time). Erlang/OTP 21 was duly released in June 2018, and this particular issue was solved at the source for everyone.

This was all two years ago. I left my micro-optimizations in: while marginal they did make some difference. I haven't had the chance to work on the adapter much recently; the problem with building services in Elixir/Erlang is that they rarely go wrong, so I lack an excuse to work on them. The MQTT adapter continues to quietly churn away, allowing any MQTT 3.1.1 client library to connect to Ably and interoperate with the rest of the system.

Ably is an enterprise-ready pub/sub messaging platform with integrated services to easily build complete realtime functionality delivered directly to end-users. Developers from startups to industrial giants build on Ably to simplify engineering, minimize DevOps overhead, and increase development velocity. Try our APIs for free or get in touch.

Adventures in BEAM optimization with our MQTT adapter

Continue reading

How to connect to Ably directly (and why you probably shouldn't) – Part 2

VPC peering vs Transit Gateway and beyond: Key choices in AWS network design

CRDTs solve distributed data consistency challenges

New posts from the Ably team, monthly.

Continue reading

How to connect to Ably directly (and why you probably shouldn't) – Part 2

VPC peering vs Transit Gateway and beyond: Key choices in AWS network design

CRDTs solve distributed data consistency challenges