8 min readUpdated Mar 23, 2023

Ably Debugging Tales - A Mystery in Elixir

Ably Debugging Tales - A Mystery in Elixir
Simon WoolfSimon Woolf

How to debug in Elixir

I enjoy debugging, and I enjoy reading tales of debugging. They're like miniature detective stories — the mystery, the investigation, and (hopefully!) the resolution. So we (the Ably engineering team) decided to start sharing some of our own debugging adventures. Here's the first. This particular story deals with a misbehaving Elixir application.

The opening of the rabbit hole was small, as it often is. Something that just doesn't quite work as expected.

$ curl https://staging-translator-realtime.ably.io/nonexistantpath -v

< HTTP/1.1 500 Internal Server Error
* Server Cowboy is not blacklisted
< server: Cowboy

Getting an unknown path was returning a 500 (internal server error), not a 404 (not found) error. But only in production (or in this case, staging). Not on my local machine, not in ci.

This shouldn't happen as I have a 'catch-all' handler in the webserver I'm using that serves a nice 404. Which definitely works — it works locally, it has tests, and at least up until now, it's worked in production.

First things first: check the logfile... which is empty. Which is strange, since anything which causes a 500 should print an error log or a crash trace. In Elixir/Erlang, if something happens that you didn't anticipate or can't handle locally, you 'let it crash': the process (an Erlang-VM-controlled process rather than an OS process, like a green thread but with an isolated heap) handling that particular task dies, and if needed, is restarted in a known good state by its supervisor. When this happens you get a crash report in the logs.

So this is odd. I do the same thing a few more times, and suddenly:

=INFO REPORT====
    application: logger
    exited: shutdown
    type: permanent

=ERROR REPORT====
Error in process <0.438.0> with exit value:
{[{reason,undef},
  {mfa,{'Elixir.AppName.Web.CatchAll',handle,2}},
  {stacktrace,
      [{'Elixir.Inspect.List',inspect,
           [[],
            #{'__struct__' => 'Elixir.Inspect.Opts',
              width => 80,
              ...}],
           []},
       {'Elixir.Kernel',inspect,2,[{file,"lib/kernel.ex"},{line,1566}]},
       {'Elixir.AppName.Web.CatchAll',handle,2,
           [{file,"lib/web/catch_all.ex"},{line,11}]},
           ...

Aha. application: logger, exited: shutdown: the logger has crashed. Which at least explains why I wasn't seeing anything in the logs: the logger application must be responsible for flushing stdout; without that the logs were building up in the stdout buffer.

So now we can find out the reason for the crash. The module it crashed in my 404 handler. But the reason doesn't seem to make much sense. It's saying that inspect is undefined, at least when called with one argument, an empty list. (More precisely Inspect.List.inspect/2; the second argument is just the default options).

This is part of the Elixir standard library, and is used to pretty-print data structures for debugging. How could it be undefined?

Here's the line it crashed on:

Logger.warn "Unknown path: #{:cowboy_req.path(request) |> elem(0)},
  params: #{:cowboy_req.qs_vals(request) |> elem(0) |> inspect}"

In this particular case there were no qs_vals (querystring parameters). So the logic it crashed on reduces to inspect([]) — as expected, it's just inspecting an empty list.

But inspect definitely does work. On that same instance that was having the issue, I log some data when the application starts using inspect, and that works fine.

Admittedly none of that data involves lists.

So I add something to app startup to just inspect and log an empty list to see what happens. And the bug disappears. Everything works fine.

I take that away again, and it starts happening again. (Sometimes, at least; it doesn't happen on every deploy)

So at this stage it seems like this is only ever happening if the very first call to inspect for some datatype is done from a webserver request handler process, instead of just on app startup. Which meant I could work around the problem by doing the following

Logger.debug "Some data structures:  #{inspect {%{}, [], MapSet.new, 0.1, 1..2, ~r//, :a, <<1>>}}"

on app startup — inspecting and logging a simple example of every datatype I can think of — so that the first call to inspect with an array happens immediately, instead of in a web server request handler.

This is crazy. Let's dig a bit deeper.

In the Erlang ecosystem you can directly query whether a function with a given signature (module, function name, and arity trio) is available using :erlang.function_exported\3, e.g. :erlang.function_exported(Elixir.Inspect.List, :inspect, 2). That returned false at the times when the bug happened.

And yet the Elixir.Inspect.List.beam file, the physical file containing the compiled code for inspect, is present and correct in the slug — I downloaded the slug and checked — and as far as I could tell it seemed to have been compiled correctly.

In Erlang when running in interactive mode (which we were doing), modules are loaded on-demand as needed from the .beam files. You can use Code.ensure_loaded? to make sure a given module is loaded. (Note that this is not true if you're running a 'release', e.g. built using Distillery or the new Mix release feature of Elixir 1.9, which run in 'embedded mode', where all modules are preloaded up front).

When we ran Code.ensure_loaded?(Elixir.Inspect.List) in the request handler before inspecting the list, sometimes it would return true and sometimes false. If it successfully loaded the code, subsequent function_exported(Elixir.Inspect.List, :inspect, 2) calls would return true, and inspect calls would work correctly. If not, function_exported(Elixir.Inspect.List, :inspect, 2) calls would continue to return false, and all subsequent inspect calls would fail (or if done by the logger, crash it).

And to be clear, this isn't just between builds, it's between deploys of a single build — the same slug, the same compiled artifacts.

Eventually, with assistance from José Valim and the rest of the always-wonderful Elixir community, we figured out that the code being first run from a webserver request handler was a red herring. The reason that made a difference was just because if it ran on app startup, it was running earlier.

Which led to the answer!

(spoiler: it was our fault)

First, a bit of background on how we deploy.

Our CI server compiles a slug, with everything needed to run the app, which we then copy to where where it's needed and mount it with OverlayFS. That's a kind of filesysem that means that when the application tries to write something to the slug, it doesn't change the actual slug, but instead writes the changes to a separate 'upper' layer, that's then overlayed on top of the lower, slug layer to produce the virtual filesystem that the application sees.

This upper directory is in /tmp, in a directory unique to each docker container instance. This means multiple independent docker containers can share the same underlying slug at the same time, can be redeployed and start from a clean slate with no residual state, and so on.

Normally this all just works.

But. We had a maintenance script that ran shortly after a deploy completes, that among other things was cleaning up up any files in /tmp with a last-modified time of older than 2 day ago:

if File.exists?('/tmp')
  logger.info 'Cleaning up temp files older than 2 days'
  warn 'Could not delete old temp files' unless system(
    'sudo find /tmp -type f -mtime +2 -exec rm -f {} \;'
  )
end

Since the things in /tmp are these 'upper directories', this cleanup script didn't change the actual slug. So whenever I downloaded and checked that, it was all correct. But it did mark files as 'deleted' in overlayfs, so they were hidden from the docker container.

The result was that any BEAM files that had not yet been loaded by the Erlang VM at the time the cleanup script ran were being marked as deleted, and so became inaccessible.

So the problem only showed up with inspect because it happened that every other file and protocol was being loaded and exercised earlier on, on app startup or by successful requests — but, as it happens, not Inspect.List.

The fix: move the OverlayFS upper directories out of /tmp.

  def slug_volume(slug_path)
    # use an overlayfs mount so that the container cannot change files in the
    # underlying slug directory
-   tmp = Dir.mktmpdir("slug-")
-   run "sudo mount -t overlayfs \
-     -o lowerdir=#{slug_path},upperdir=#{tmp} overlayfs #{tmp}"
+   FileUtils.mkdir_p "/slugs/mnt"
+   mnt = Dir.mktmpdir("slug-", "/slugs/mnt")
+   run "sudo mount -t overlayfs \
+     -o lowerdir=#{slug_path},upperdir=#{mnt} overlayfs #{mnt}"
    {
-     host:  tmp,
+     host:  mnt,
      guest: "/app"
    }
  end

Ably Realtime provides a globally distributed, high-performance data stream network. It allows users to stream data anywhere, between any device, with mission-critical reliability.

This post is based on a talk, as featured on HackerNoon's #CrazyTechStories

Join the Ably newsletter today

1000s of industry pioneers trust Ably for monthly insights on the realtime data economy.
Enter your email