Routing around single point of failure DNS issues

Today swathes of popular websites have been mostly down due to a major attack on Dyn’s DNS servers. As a developer, if GitHub, Twitter and Reddit are down, then for all intents and purposes, you may as well go home — the Internet is broken :)

October 2019 update: Cloudflare suffered some DNS issues in June/July 2019. We've published a post on five practical strategies to remove single points of DNS failure.

The infrastructure team at Ably has spent a lot of time thinking about availability and scaling problems and how to reduce single points of failure. So when today Dyn was under attack, to see tech giants fall over and not have a strategy to work around DNS failures, I was relieved by the fact that we’ve been fortunate enough to have thought hard about these types of problems and implemented numerous fallback strategies for exactly this type of outage. Unfortunately not every realtime service was quite so lucky today.

Ably is a data stream network. Our customers rely on us to be up, 100% of the time. Whilst it’s possible their own sites may be susceptible to DNS failures, our customers expect us to have these problems covered, and we do.

Given the problems so many of us in the dev community have experienced today, I wanted to share what Ably has done to work around these types of seemingly show-stopper issues.

No reliance on a single domain registry

We operate two primary endpoints for all of our client libraries, rest.ably.io and realtime.ably.io. Quite clearly, the common denominator here and single point of failure is the ably.io part of our domain. The IO domain is run by a relatively small operation out of the UK (Internet Computer Bureau), and from our perspective is a single point of failure. Human error too is a single point of failure, and it’s plausible a domain may not be renewed when it should have been.

As such, we operate a secondary domain on a different top level domain, *.ably-realtime.com. All of our client libraries are designed to use both domains, with *.ably.io being the primary, and ably-realtime.com being our secondary fallback domain should ably.io become unavailable.

No reliance on a single DNS provider

As we saw today, a single DNS provider going down is as good as your entire root domain going down. We considered a number of strategies to load balance our DNS between multiple providers, but because we implement latency based routing ensuring customers are always routed to the closest of our 15+ data centers, we decided that load balancing between different providers would most likely introduce weird routing behaviour as users bounce between different DNS providers.

So instead we decided to simply have a wholly different DNS service for each of our domains. We use Amazon Route53 for our primary domain, and a different DNS provider who does not rely on Amazon Web Services for our ably-realtime.com secondary domain.

No reliance on a single datacenter

So whilst we now have root level DNS failover capabilities, and DNS service provider failover capabilities, it’s still quite possible that our users may be subject to network partitioning or routing issues, or one of our datacenters may be offline due to hardware or network failures.

Our latency based routing + health checked endpoints ensure that users are always routed to the closest health datacenter, but from our experience that means:

If a datacenter has just become available, it will be realistically at least 5 minutes before we detect the failure, update the DNS, and that propagates to the client.
If the customer is affected by a network partitioning or routing issue of some sort, yet the datacenter itself is still healthy, then they will simply keep trying to connect to the datacenter they are unable to reach.

As a result, we have implemented a fallback capability in all of our client libraries that ensures that any request that fails (due to timeout or server error i.e. 50x status code), is retried to an alternative datacenter. As we cannot rely on our primary domain anymore, our client libraries simply randomly hit one of our ably-realtime.com fallback domains, each of which points to one of our datacenters globally.

Customers who are trying to connect to an Ably datacenter that is unavailable, in theory, will simply connect to an alternate typically within 10 seconds.

Have a Distributed Denial of Service (DDoS) strategy

No matter how big the defensive shield you have, I believe you will always be susceptible to denial of service attacks to some degree. However, it’s important you have a strategy and an escalation process in advance to deal with an attack. If necessary, operating with reduced service too is a strategy if it means core operations can continue to operate.

We rely on Cloudflare to help provide us with DDoS protection for all of our fallback data centres. The strategy we have is that should we be under attack, we reduce service in our primary data centres and route all traffic through Cloudflare who provide Level 3 through to Level 7 protection. They have 10Tbps of network capacity, so whilst they certainly cannot stop all untoward traffic, they provide a formidable barrier to attacks. The fact that even our fallback DDoS protection service provider is susceptible to issues reiterates the importance of having strategies in place.

Conclusion

Although the theory of the black swan will lead a lot of us to believe that you cannot plan for every eventuality, fortunately most of the time we’re not dealing with those sorts of hugely unpredictable events. Your technology stack can be reviewed for single points of failure and congestion at each layer, and I recommend if uptime matters to you, that you take the time to think about it.

We’ve been fortunate enough at Ably to have decided early on that these types of issues matter and as a result, allocate resource and time to solving these problems before they happen. If you can make the time, and at least have a strategy to start removing single points of failure, when the next big Internet outage occurs, I hope like us, you’ll be relieved that your customers are unaffected.

Routing around single point of failure DNS issues

No reliance on a single domain registry

No reliance on a single DNS provider

No reliance on a single datacenter

Have a Distributed Denial of Service (DDoS) strategy

Conclusion

Continue reading

How to connect to Ably directly (and why you probably shouldn't) – Part 2

VPC peering vs Transit Gateway and beyond: Key choices in AWS network design

CRDTs solve distributed data consistency challenges

No reliance on a single domain registry

No reliance on a single DNS provider

No reliance on a single datacenter

Have a Distributed Denial of Service (DDoS) strategy

Conclusion

New posts from the Ably team, monthly.

Continue reading

How to connect to Ably directly (and why you probably shouldn't) – Part 2

VPC peering vs Transit Gateway and beyond: Key choices in AWS network design

CRDTs solve distributed data consistency challenges