• Blog:

  • Home
  • Company
  • Engineering
  • Developers
  • Edge Messaging
  • Redis scripts do not expire keys atomically

    By: Andrew Dunstall 3 min read

    This short post by a member of Ably's engineering team describes how we resolved a problem that is typical of the challenges we face each week. We thrive on solving hard distributed system problems that are mostly platform agnostic and theoretical in nature, and this is the first post in a long-term series of articles about things we've learned recently.

    How we use Redis at Ably

    Ably is a platform for pub/sub messaging. Publishes are made on named channels, and clients subscribed to a given channel have all messages on that channel delivered to them. We use Redis, a distributed in-memory database for key-based storage, to store various entities such as authentication tokens and ephemeral channel state. It’s a good fit for temporary storage of messages while we process them.

    We have billions of active Redis keys at any given time, which are sharded across numerous Redis instances. The sharding strategy places related keys in the same shard so that we can perform operations that update related keys atomically. We use Lua Redis scripts extensively to query and update keys and rely on the atomicity of script execution to preserve the integrity of values of related keys. That is, either all commands in the script run, or none at all run, and no other commands execute at the same time.

    We also use expiring keys extensively; the nature of the Ably service is that much of the state of a channel is ephemeral and only retained for a limited period of time (typically 2 minutes). We set keys to have a TTL so they auto-expire.

    The issue

    The integrity of a set of related keys requires that either all keys exist, or none exist. We had assumed that the atomic nature of script execution would also apply to expire operations invoked by a script, but it isn't in fact true that naively expiring multiple keys in the same script will preserve that integrity.

    While expire operations execute atomically within the same script (with no opportunity for intervening operations to occur), nonetheless the timestamps associated with each expire operation are not necessarily the same.

    Running TIME shows two different values:

    -- time.lua       
    
    local a = redis.call('time')       
    local b = redis.call('time')       
    return {a, b}       
    
    $ ./redis-cli --eval /app/time.lua      
    
    1) 1) "1638280442"     
       2) "996960"     
    2) 1) "1638280442"     
       2) "996966"      
    

    Checking the actual expiry time:

    -- expire_check.lua     
    
    redis.call('set', 'foo', '1')     
    redis.call('expire', 'foo', 1)     
    
    -- slow calls...
    
    redis.call('set', 'bar', '2')     
    redis.call('expire', 'bar', 1)     
    
    local fooExpiry = redis.call('PEXPIRETIME', 'foo')     
    local barExpiry = redis.call('PEXPIRETIME', 'bar')     
    return {fooExpiry, barExpiry}     
    
    $ ./redis-cli --eval /app/expire_check.lua     
    
    1) (integer) 1638280843717     
    2) (integer) 1638280843730     
    

    The expire might not be pin-point accurate, and it could be between zero to 1 milliseconds out.

    The implication is that there could be times at which some keys have expired, but other related keys have not and this could lead to an inconsistent state.

    Our solution

    The solution is to use EXPIREAT to set an absolute expiry time for all related keys, rather than rely on a relative expiry time through the TTL.

    The Redis documentation is not clear if multiple key expiry is guaranteed to occur at the same time if keys have the same EXPIREAT setting. To be cautious, we reordered key expiry to ensure that, regardless, we avoid inconsistency.

    -- expire_new.lua     
    
    -- Unix time     
    
    local now = redis.call('time')[1]     
    local expiry = now + 1     
    redis.call('set', 'foo', '1')     
    redis.call('expireat', 'foo', expiry)     
    
    -- slow calls...     
    
    redis.call('set', 'bar', '2')     
    redis.call('expireat', 'bar', expiry)     
    local fooExpiry = redis.call('PEXPIRETIME', 'foo')     
    local barExpiry = redis.call('PEXPIRETIME', 'bar')     
    return {now, fooExpiry, barExpiry}     
    
    $ ./redis-cli --eval /app/expire_new.lua
    
    2) (integer) 1638281266000     
    3) (integer) 1638281266000     
    

    This is typical of one of the many engineering problems we troubleshoot and solve each week here at Ably.

    Fancy working with us in the realtime sphere? Our engineers have a range of broad technology skills across infrastructure, security, distributed systems, and beyond.

    You can find us on Twitter or LinkedIn, and apply to join us in one of our open roles.

    | Discuss this post on Hacker News |


    Latest from Ably Engineering