7 min readUpdated May 7, 2024

Overcoming scale challenges with AWS & CloudFront - 5 key takeaways

Overcoming scale challenges with AWS & CloudFront - 5 key takeaways
Matt HammondMatt Hammond

The Ably service handles massive amounts of data throughput and concurrent connections for many customers while maintaining a highly reliable and available service, with a 5x9s uptime guarantee.

Ably has no scale ceiling, and that’s challenging work (it’s one of the reasons I joined Ably). While the challenges we face in delivering our service are compelling, we sometimes face novel internet scale problems, such as breaching the limits of AWS services!

Fortunately we have a strong working relationship with AWS. And a recent incident led me to reflect on how important this is - and how engaging with our providers helps to resolve the novel issues we encounter. When you offer a service that pushes the boundaries of your upstream providers it’s imperative that you have a well established, productive working relationship with them.

The reality is, these relationships occur all over the technology industry, up and down the technology stacks we use day to day. And without them, we’d see far less progress because everyone would be building everything from scratch!

So, first let me tell you about the incident we had and why we had to work with AWS to resolve it to gain a better understanding of the root cause. Then, I’ll outline the 5 key takeaways that describe why engaging with providers for issue resolution is critical to your success…

The Incident

Ably uses AWS CloudFront primarily to leverage its global edge network that provides low latency, high throughput network connectivity to the AWS backbone at over 600 points of presence (POPs) globally.

Clients connect to the Ably service using the relevant API hostnames (e.g. realtime.ably.io) which resolve to one of many CloudFront distributions. The distribution in turn uses latency-based DNS resolution to route to the nearest available and healthy NLB (which will usually be the nearest datacenter geographically).

Unexplained errors…

During a recent incident, an Ably customer reported an increased level of errors from the Ably service. These error responses were in fact HTTP 500 errors being returned from the AWS CloudFront service rather than the Ably service itself. What had triggered this issue was maintenance work that was routing Ably traffic from one legacy CloudFront distribution to a new one.

We reverted the changes and routed traffic back to the legacy CloudFront distribution to avoid further description, but wanted to get to the bottom of why we’d experienced this issue…

Engaging with AWS

Logically comparing the CloudFront distributions was a first step to understanding where the issue may have arisen, however there were no obvious differences in configuration. We engaged AWS support and notified our account manager that this was a high priority support issue for us.

AWS got back in touch with us and explained that in fact there were configuration differences between the CloudFront distributions, but that these differences were not visible to us.

At a previous date we’d increased the 250,000 requests per second, per distribution default quota limit for many of our CloudFront distributions (Ably services operate at a scale where this limit is frequently exceeded). This quota increase had not been applied to the new CloudFront distribution we’d implemented in error - and as a result we saw an impact where as described above we returned an increased level of error responses from this distribution.

Gaining a deeper understanding

Ably is proud to offer No scale ceiling and the reality of that offering is that we need to work with both our customers and service providers to understand requirements and work together to push the boundaries of what’s possible.

So, not content with simply increasing the limit and moving on, we sought to understand what the limits of AWS CloudFront were, in terms of requests per second,  to future proof our service as this was our current bottleneck. Ideally we’d encounter no limitations but of course no service is genuinely limitless despite many being marketed as such. So as far as possible we would aim not to be bound by any that did exist by implementing improved architecture and/or configuration.

Developing a long term solution…

So, working closely with our account manager, solutions architect and feedback from the CloudFront team we were able to make an additional configuration change ensuring that this issue did not re-occur.

Ultimately through our relationship with AWS we were able to understand the root cause and make changes that would allow us to meet our goals and prevent any future instances of the issue we’d experienced.

The 5 key takeaways…

We’ve built a number of key relationships with product teams in AWS and other providers - and ultimately it’s those relationships that enable us to offer the quality of service that we do to Ably customers. Read more about Ably’s Four Pillars of Dependability here.

Resolving our challenges with the help of AWS has reminded us of the importance of engagement with providers. Here are some takeaways you could apply in your provider relationships:

1. Establish communication channels with provider support teams

  • In the case of the incident described above, having access to the team responsible for the AWS product affected meant we could quickly create a feedback loop to get the problem resolved.
  • The earlier you establish communication channels with your provider support teams the better.
  • They are there to help and the ability of your provider's support to quickly resolve your issues will make or break the success of your product or service.

2. Leverage support plans and escalation procedures for timely issue resolution

  • If you experience an issue or outage with your service that can only be resolved with support from an upstream provider it’s imperative that you have escalation procedures in place to manage expectations.
  • Ultimately being able to quickly and honestly communicate with your customers is of utmost importance during incidents and having all the information available to you will give confidence to your customers.
  • Ably has enhanced support plans in place with AWS and regular syncs with our AWS customer success team. This played a significant role in our ability to resolve the incident we faced.

3. Collaborate with provider's technical experts to troubleshoot and resolve complex issues

  • Your provider will appreciate an understanding of your product use case to aid their analysis of any issues you may be facing. In particular they should be interested in where you are pushing the boundaries of their product's use. In those margins lie the opportunities that will show the provider’s product owners how to improve the product and push it to succeed.
  • Get in touch with your service provider on complex issues and seek their input, they should be happy to help you resolve them.
  • Ably staff regularly attend sessions dedicated to specific AWS products that mean we are informed of roadmap developments and also have the opportunity to ask targeted questions related to our use of AWS products.

4. Engage in feedback loops to provide input on service improvements and feature requests

  • At Ably we’ve been able to submit or add interest to existing service improvement requests. This ensures that our needs are met as products develop.
  • Having sight of product roadmaps from your service providers can be invaluable in understanding how future changes to products and services can enhance (or degrade!) your service.
  • Engage with your service provider on service improvements and feature requests and get that early visibility!

5. Explore opportunities for joint problem-solving and co-innovation through partnership programs

  • Tech partnership programs are collaborations between technology companies, where they work together to achieve common goals such as product development, innovation, or market expansion.
  • These programs often involve sharing resources, expertise, and technologies to create mutually beneficial outcomes. They can range from formal alliances between major corporations to informal agreements between startups or smaller companies.
  • Ably is proud to be an AWS partner! Read more about Ably’s partnership with AWS here.

Ultimately strengthening relationships with your providers will set you and your customers up for success. In technology, we continually build on each other's progress, and to some extent, we all rely on the advancements of others in the technology space and a well established, productive working relationship with all your providers is critical to your success.

Join the Ably newsletter today

1000s of industry pioneers trust Ably for monthly insights on the realtime data economy.
Enter your email