34 min read•Last updatedUpdated Feb 5, 2025

Hidden scaling issues of distributed systems - System design in the real world

Written by

We recently started an interview series aimed at sharing experiences and challenges faced by Distributed Systems Engineers. In the first episode, we explore the hidden scaling issues. of distributed systems.

Who did we interview?

Paul Nordstrom — Entrepreneur & former Systems Architect at Google, Amazon & OfferUp — and Paddy Byers — Co-Founder & CTO at Ably Realtime.

*Paul Nordstrom is a seasoned entrepreneur and systems architect with vast experience having worked with distributed systems in the likes of Google, Amazon and most recently OfferUp*

*Paddy Byers is a Co-founder and CTO of Ably Realtime with over two decades experience in software and services with a deep technical expertise in building scalable distributed systems*

Interview transcript

Intro
Unexpected Problems
Limits of scale
Microservices
CAP theorem
Fault tolerance
Team structure
Tips

Intro

[Matthew O’Riordan] — Welcome to our first episode of Distributed Deep Dive. In this series of videos, we’re going to be talking to interesting people who have worked on distributed systems. In our first series, we’re talking about hidden scaling issues and distributed systems in the real world. And today we have Paul and Paddy, and they will be talking about the real problems they face scaling internet scale systems, and we’re hoping that what you might learn from this is how you can avoid some of the mistakes that Paddy and Paul have made while building these systems. Paul would you like to introduce yourself?

[Paul Nordstrom] — Sure, my name’s Paul Nordstrom, and I’ve been writing software since high school, I programmed my way through college and then, a brief detour into finance but I ended up building financial systems. And when the internet came along Amazon offered me a job in 1999 so I was lucky enough to come at a time when they needed somebody to do a clean sheet architecture for their website. So that was my first internet-scale system and some of the service-oriented architectural things that we invented there in fact, have permeated the industry. So that was a very successful system. They’re still running parts of it, some of it’s been replaced. After about eight years of Amazon, I took a break, I wanted a change of venue, I went to this other little internet search company called Google. And at Google, I was in charge of the design and building of a system that’s been published. So if you’re curious about some of the examples you can read the paper called MillWheel and I was at Google for about 10 years.

[Matthew O’Riordan] — Excellent. Paddy, would you like to introduce yourself?

[Paddy Byers] — Sure, so my background is in Mathematics. In my doctorate, I worked on mathematical problems underpinning formal verification. In my career of work, I’ve done a wide range of different systems, safety-critical systems, security critical, embedded, realtime, and a lot of consumer electronics. At Ably Realtime I’ve built the core realtime messaging product, and although we have a growing engineering team now, I still spend most of my time coding and building the next set of features for the product.

Unexpected problems

Watch chapter.

[Matthew O’Riordan] — Paul, so if it’s okay we’ll start with you. The first question I think we’d like to talk about is I think we have a good understanding of what the textbook problems are with scaling distributed systems but I think in the real world the problems we encounter are the problems we didn’t expect. Can you tell us a bit about your experience with this?

[Paul Nordstrom] — The first thing you do when you’re designing a system you try to figure out which dimensions it’s going to scale in. And that’s the part you’re talking about that we usually understand and although doing a good job of that involves shedding some of your egos, you should involve other people in the design discussion because there are things you’re going to miss. And you know, you obviously want to minimize the things that you didn’t think about in advance, so that’s the first step of dealing with the unintended scaling issues is to not have some of them. Once you get past that though, you have to accept the fact that you’re going to run into problems that you didn’t anticipate. And even if you’ve done a great job of minimizing it, the real world just throws more, the world of software is more complicated than a human mind can encompass. So you do everything you can, you build mathematical models for your system so that you have a coherent framework within which your system is going to operate and you can reason about it using the mathematical models. So that’s a great first step. And I think that in the industry that’s one of the short thrift, I think very few people really do a good job, and it’s one of the things I found when I started looking at Ably Realtime I was most impressed with was that it had a mathematical model, and you could understand what it was intended to solve, what problems, what scale dimensions were possible. MillWheel is the same, it has a mathematical model for distributing computations on time series, treating a time series as a foundational abstract thing that you could reason about. To get onto the meat of your question, once you’ve done everything you can, and you find that you have unintended scaling problems, well what do they look like? And maybe having heard about a couple of them here you can help anticipate that happening to your systems. One unintended one that you think of in advance and convert it into a known problem in advance will make an incredible difference in the quality of the system you build, and the time within which you get it done. Because solving these problems after the design phase is you know, sorta order 10 to 100 times as difficult right, we all experienced that. Fixing a bug takes at least 10 times longer then it does to avoid that bug in advance by making a good design. In MillWheel one of the things we didn’t anticipate was the rate at which, so MillWheel has this partition space, okay every piece of data being manipulated has a key let’s say the data that you’re manipulating is Google queries, in fact, MillWheel is actually used to process the stream of Google queries received on the internet. And we anticipated that some queries happen more often than others, okay. We didn’t do a very good job of analyzing the data and we didn’t realize there was this one query that so far exceeds every other query that not a single machine could handle the processing of it. It happens to be the query Google. People type “google” into the Google search box with a great regularity more than any other single query. If we had talked to the team, one of the reasons we didn’t anticipate this is we didn’t know that MillWheel’s gonna be applied to this problem, we couldn’t talk to the team that did it. But we could have done a better job of talking to more of the community about what they might use a system because it was fairly well understood that Google needed a system for processing continuous data at the time. They had MapReduce and MapReduce is a big, enormously scalable system that solves plenty of problems but not the continuous data problem. So all of these are like their little lessons but they add up, spend more time talking to your users about how they would use your system, show your design to more people, you know, just shed the ego and shed this need for secrecy if you can, so that you get a wider spectrum of people who can tell you, I’m gonna use it like this. And then, when you run into the inevitable problem, you know, then you just have to, that having done the work that did before, your system will be cleaner design, you’ll have this mathematical model. You know, then at least when you get into this case of a problem that you didn’t anticipate you will have a cleaner bed from which to solve it.

[Matthew O’Riordan] — Would you like to add anything to that Paddy?

[Paddy Byers] — Yeah, I think just to echo what Paul said so, when you set out to build a system, you know roughly speaking, what the first sort of problem is you’re trying to solve. The scaling problems you run into are the second order of problems often. So this is not just how many messages can I support in the channel but the second order of problems, the rate of change of a number of channels, or rate of change with the number of messages. Sometimes you’re caught out because the engineering catches you out. Things turn out to be harder then you thought, and sometimes the user catches you out. So they use the system in ways that you didn’t expect when you designed it so, you find a whole bunch of new use cases that you then have to go and deal with or they give you scaling issues that you didn’t solve originally.

[Paul Nordstrom] — Alright so, let me add to that, sometimes you’re going to come across, you know attempt to solve a problem, one that can’t be reasonably solved. And sometimes you just have to accept that your system can scale in every dimension. And you might just have to say to the customer inherent in our system is a limit on the number of channels we can support, or the rate in which you can add new channels, or whatever it is in your system. In fact, tying this back to the Mill Wheel problem, one of the things we said is if your key space can’t be partitioned so a single key can be handled by a single machine we’re not going to solve that for you, you’re just going to have to find a different solution for that problem. And I think it’s important to realize that trying to make your system solve every problem eventually ends up with a system that is too complicated to use or one that doesn’t work very well for any of the problems you’re trying to solve.

Limits of scale

Watch chapter.

[Matthew O’Riordan] — The limitations of the system and the limits of your scaling parameters is obviously important from the start. How do you think you go about understanding those limits, do you understand them once you’ve built the system or do you think you can preempt some of that or do you think you just deal with problems as they arise in areas that you may not understand fully?

[Paul Nordstrom] —I think the first point is that you have to understand at least one scaling issue from the get go which is the scaling issue that your customers have that you’re here to solve. If you haven’t talked to people about what their needs are then you haven’t clearly identified the way in which your system is going to scale in the dimensions they need, you haven’t done your homework and you’re doomed to fail. Unless you know, without just great luck. If you have chosen a set of scaling dimensions that you’re going to attack with your design, and you study those dimensions and you make sure that your architecture, I think that people get this part though. I think that they understand that you need to solve the issues of the customer and I think they understand how to design it’s the second order ones again. And you know, we talked earlier about how do you know, at least try to head off some of the issues with the second order scaling issues but really the answer to your question is if you’ve done your homework, and you have addressed the issues that your customers need you to solve, and hopefully you’re providing something nobody else does or you are much better at than other systems out there. You know, your systems need to have a competitive advantage too. So they need to solve customer problems better than the other systems that are available to your customers, than you’re gonna have a successful system. Whether or not you end up with second-order scaling issues that you can or cannot solve you’ve met this primary need, and but what I think is that you’re really about you know, clear thinking and have a clear process for identifying those and solving those.

[Matthew O’Riordan] — Paddy do you think, I mean when designing building and then running at scale, do you think there have been second order type problems that even now looking back, you think would have been hard to predict what those problems were until those problems arise, I mean are there any sort of specific examples you can think of that kind of you know, show how difficult it maybe is to predict these types of problems sometimes.

[Paddy Byers] — Yeah, I think the kind of example that comes to mind is the sort of thing I mentioned where the customer catches you out. So they’re doing something that you didn’t anticipate. So we built something imaging that channels were long-lived and we optimized to the greatest extent possible the cost of processing a single message. But that catches you out if what the customer’s really doing is they’re creating and destroying channels at a very high rate. Another example is a fan App, right we imagined we’d be dealing with typically, one of the scaling dimensions is, I want to be able to fan out messages to a very, very large number of subscribers but then we have customers who don’t have that problem, instead they have channels with a single subscriber and what you need to do instead is minimize the cost of establishing the first subscriber not establishing a million subscribers. The other thing I would add to that is in understanding the scaling limits I think what you have to look out for is what’s an order end feature of your system, and what is or can be better than order end feature of your system. If everything, in the end, will become a limit to scaling, the question is do I have to deal with it now or do I know I can deal with it upfront, so if it’s an order end thing I know I’m gonna have to deal with it at some point, but do I know a way that I could improve that when the time came or is it something I have to address now.

[Paul Nordstrom] — I think that’s one of my fundamental precepts of the building of a system. Even if you have a pretty clear idea of the design you can’t build every dimension of your system to its ultimate capability on the first pass. You have to pick ones that you can understand well enough to build right, you should build those right, but the ones that you know you don’t understand well enough to build right you should build a hack that has a shell around it that you can then you know, use as let’s say, an interface, or an obstruction layer, but eventually but not actually tries to build it when you don’t know enough to do it. And one of those things is, when it’s a customer, this is tying off of what you just said, when what you don’t really know is how people are gonna use it and then it’s not that you couldn’t solve one of the problems and you don’t understand but it’s you don’t know which is really your problem yet. Then you can make a much better decision later on.

Microservices

Watch chapter.

[Matthew O’Riordan] — How does what we’ve been talking about relate to microservices?

[Paul Nordstrom] — Well, you’re pushing one of my buttons here because personally, I don’t think there’s such a thing as microservices, I consider the whole thing to be a giant buzzword that people have used so they get internet audiences and so on. I think services span a range from very, very small, you’re free to call those microservices if you’d like, but there is no cut off where they start becoming not microservices they just get bigger and bigger until they’re gigantic services that have large scale functionality. And your choice of the use of these depends on your problem and you should choose the right one for the job. And you should make a balanced choice between functionality and complexity, power, size, etc., and ease of use. If it were up to me I’d wipe the term off the internet discussion list and get rid of it. But the important things about the microservices movement, in my opinion, that is valuable is that the tendency should be to push yourself down a scale when possible as appropriate the smaller service is probably the better one. It’s easier to reason about okay, it’s easier to swap something else in if you need different characteristics then the choice you made at the beginning of the problem then the choice you made at the beginning of the problem. So I don’t believe in microservice, in fact, I probably will never say the word again after this interview. But I do believe that learning to balance the use of out of the box components and to chose wisely should be geared towards the smaller of the two choices if all other things are being equal. I think that’s important, okay. But that said, often microservices are in contradiction of the need of a system to scale in dimension because one of the ways you get scaling out of a particular functionality is to couple two parts of it, the tightly coupled one usually outperforms the loosely coupled one. And architecture composed purely of microservices in the long run, if it’s attacking a truly difficult scaling problem, I don’t think you’re going to, in general, be able to solve it just by composing microservices, no matter how well you do. I think you’re going to be forced to choose in certain cases, to couple search semantics with the storage semantics, for instance, and you end up with something like relational databases instead of something like Spanner. It depends on the situation, I’m not calling out Spanner in a negative way, Spanner was a brilliant piece of work, and it solves a huge swath of the problems space you know, in the dimension that it’s approaching in storage that had never been attacked before.

[Matthew O’Riordan] — So Paddy, at Ably we’ve, I know we’ve had this discussion about the dreaded word microservices, but my understanding was that the problem with splitting a lot of the services that we run up was that actually, we’re creating not just potential performance bottlenecks, which are probably less of a concern, but more operational bottlenecks. And that trying to sort of manage versioning across APIs is across lots of different services, just creates a lot of complexity in the system unnecessarily. What are your thoughts on what we’re doing within the RPC layers and how components talk to each other but still sort of keeping them as single larger components.

[Paddy Byers] — A lot of what we have done, the systems grew up in relatively model ethics, and we face the question of when is it a good idea to split something into a separate service? And as you say, there are operational complexities and performance complexities that come about when you do that. However, every single thing you do eventually becomes a problem no matter how insignificant or negligible performance wise it seems to be initially, eventually, it will become a problem. So the real question for us is at what point does something become significant in a way that means you want to scale it independently? So do you want to be independently elastic with a particular service or is there a particular advantage, a person needs to be able to deploy it and give it a different deployment lifecycle from other services. And that’s the point of which you would then consider splitting yourself out into a microservice, and that’s broadly the approach you’d take.

[Paul Nordstrom] — I wanted to add to that is that I think that you should know and be conscious of the core functionality, we talked about earlier, what problem is your system solving for its customers and in what dimension in scaling, in particular, it’s solving a problem for your customers. And, you know when somethings not central it’s much easier to split it out and make it something independent and if it is central it’s much less likely to become a service by itself. Because like I said, the loosely coupled hardly ever outperforms the tightly coupled. You can’t afford to have everything tightly coupled so and of course, everything isn’t central. So it’s not that necessarily hard a thing to do but I think it’s a conscious decision. You said alright, this is central to my system, I’m willing to be coupled here.

CAP theorem

Watch chapter.

[Matthew O’Riordan] — So I think it’s interesting that the conversation has largely been talking about balances and trade-offs in systems and understanding where your limiting factors are. It reminds me a bit of CAP theorem and the idea that you can only honour two of the three principles. What are your thoughts on that in regards to distributed systems and how that applies to what we’ve been talking about?

[Paul Nordstrom] — The CAP theorem tells you that you will have to make trade-offs, right? To my knowledge, there are no examples of a system that you know, violates that, that manages to make no trade-offs between the CAP principle properties. My favorite system design, in fact, it’s come up before, is Spanner. And Spanner makes great promises in all three dimensions but it doesn’t violate CAP it just hits the sweet spot. And the reason I’m bringing it up is because I think it’s a great example of something people should consider when they’re designing and building a system is to recognize and acknowledge they’re going to have to make trade-offs and then find the sweet spot for whatever it is they’re designing that solves a wide swath of users’ needs and that does a great job of capturing the users’ demands in each of the dimensions they care about. You should be aware you’re going to have to choose at most two of those you know, the CAP properties that you’re going to truly solve. But I think a system that does a really good job in each of them and not perfection, I mean, we were talking about outages you know, earlier today and outages happen regardless of what your system was designed for, you may perfect guarantees around eventual integrity. But that doesn’t mean that your users are going to see perfection because they’re not, their personal connection might be down to their internet provider. And what they really care about is their business. I think that doing a good job of satisfying scaling needs in every dimension that you do identify, as we sort of talked about at the beginning of this talk, is more important to them than meeting a sort of theoretical CAP properties test.

[Matthew O’Riordan] — Okay, I think that’s, I mean, you know, in the context of a messaging system, Paddy, I can see how those lines would get blurred and deciding what that sweet spot is because you know, the latency of a message to be delivered is something that will be variable and I expect certain use cases require latency guarantees. So I mean, what are your thoughts Paddy, on how I suppose, that thinking is applied at Ably Realtime.

[Paddy Byers] — How does it relate, so yeah so first of all, sort of naively, the CAP theorem would have you believe that these are binary properties. That meansthe system is either available or not available. And in reality, that’s never true. You get outages in some part of the world, but in a global system, you can’t deny service just on the basis of a single regional fault. So availability and the other properties are not binary attributes. But the other thing that CAP specifically applies to is the understanding of integrity if you want to build an asset database or you linear ability as a property of your system. In our case though we get to choose the semantics. So we’re able to say, in the specific case of messaging, in the case of a conflict between consistency and availability that we get to choose exactly what semantics we decide are useful for our customers and what things we guarantee to uphold in the presence of certain kinds of failure and what things we don’t guarantee to uphold. So we have a lot more flexibility in designing our system then you would have if you were designing a relation of a database, and you were trying to make the CAP tradeoffs. So what we’ve done through a combination of what our customers tell us what their requirement is and our own understanding of what’s achievable in practice.

Fault tolerance

Watch chapter.

[Matthew O’Riordan] — What are your thoughts on fault tolerance, Paddy?

[Paddy Byers] — Fault tolerance, so I think you have to set out with the idea that everything will fail at some point. So you have to do things as a matter of routine so as to recover from certain failures. And then the decisions you face are what kinds of failures am I gonna handle routinely with no degradation of service and what kinds of failures am I prepared to allow for some degradation of service? And in doing that then you have to look at well how would I survive these failures, what level of redundancy am I gonna maintain, what is the cost of maintaining that, not just the resource cost but also the performance cost, the operational cost of managing that redundancy? Also if there’s more then one failure do I really think those failures are independent? So if I think that my failures are not going to be independent, so when one thing fails then I’ll have a greater likelihood of the thing I’m failing over to, also failing, then no amount of redundancy is gonna give 100% fault tolerance. So it’s a trade-off between understanding customer requirements, understanding the business operational cost of achieving it and understanding the real world engineering practicality of actually making it possible.

[Paul Nordstrom] — I agree with Paddy’s comments about that, and another dimension of it to be really conscious of is to consider how to balance between minimizing the mean time between failures and mean time to recover. In any system that I’ve worked on, and once I know about, often times this isn’t done thoughtfully and consciously upfront, but sometimes minimizing the time to recover works out much better than minimizing the time between failures. An engineer’s natural tendency is to try and minimize the time between failures, right, to make the system as reliable as they can. That’s not always the right answer and sometimes just making it recover really quickly from failures gets you where you need to go, in terms of value to your customer, and simplicity of the engineering, and it solves problems that you might not have considered. You know, I mean by nature it’s trying to minimize the mean time between failures means figuring out what’s gonna fail and preventing it from failing, right. Minimizing time to recover solves problems you never considered before, and we talking about second order and unanticipated scaling problems, well this is sort of applying that same concept to the failure scenarios where you try to minimize unanticipated failure problems too. So this is something I was you know, happen to consider, didn’t learn until I went to Google and at Google, they’re super sophisticated and genius there obviously. This is one of the things that they taught me, is that stop at the beginning say, okay in every kind of area of failure, what are your areas of failures, that goes to what you just said, but then secondly am I going to attack this area of failure by making the system recover very quickly from it or by avoiding the problem through extensive tests and great design, and so on. And I think there’s a really important lesson here.

[Matthew O’Riordan] — Do you think that applies also to dealing with an unexpected load? I mean there’s an element of sort of, a hitting a high water mark, and that I think for me, my understanding is that the high water mark kind of sits almost in both camps, that you’re saying I’m willing to reject some work so that I can save the system and recover, but equally I’m degrading the service at the same time or degrading the service at least. I mean, how do you think that applies to what you’ve been talking about, I mean is there anything specific you think out listeners would want to know about?

Well I think yeah, reflecting on what Paul said, so in terms of coping with load, so you’ve designed things to be elastic, they can scale when loading scales but then you have the second order problem which is how quickly can you do that. And you know, you can do two things, one is you can operate with a margin of capacity to be able to handle a certain spike in load without having to scale and then you can react to that instantly or you can build things to be able to scale very quickly. So building things to be, not just inherently elastic, but inherently scalable, being able to react to spikes. You can reduce the complexity of the system potentially if your way of reacting to spikes is to be able to scale very quickly.

Team structure

Watch chapter.

[Matthew O’Riordan] — One area I think we haven’t yet touched on is the people element of how you build and maintain a distributed system and arguably this is the most important element. Paul, I think you’ve obviously got a huge amount of experience and insight into this area. Can you share your experience, wisdom on how to structure teams?

[Paul Nordstrom] — Well I’m not sure structuring them is the right question, I think that you know, to build a scalable system you need the whole team focused on that. There’s no substitute for just experience building these systems, you have to engage the team and you have to make sure that people, that there’s at least somebody on the team who has this experience or you’re not going to end up with a system that’s scalable. But beyond that, I think that there are all sorts of, I don’t want to call them tricks, but they feel like tricks and they’re effective at ending up with a design. Part of that is to make sure that the entire team, I’m just afraid here that I’m gonna end up saying, you know, what’s the expression? Mom and apple pie, you know? I’m afraid that this part of the discussion is gonna end up with a mom and apple pie kind of answer because I don’t think there are any silver bullets here. I think that there is, it’s important to involve the team in the design so they’re all bought into it, I’ve made that mistake before a couple of times, you know, where I had the idea in my head, the design in my head and thought you know, that it was so clear to me, right? Bad answer. But you know, there’s other things you can do like if there’s a design question that is difficult and not clear one of the things that I found really effective is to make you know, pick an advocate of let’s say the two most likely solutions to the problem, and then have one person argue one side of that and another person argue the other side of it and then make them swap. And argue the opposite side of what they believe, and this is has worked out great for me in terms of getting to a consensus because it makes people look at the problem from the other person perspective. And people don’t want to argue the side of the question that don’t agree with or believe in, but it’s really effective. The other thing about getting a team to build a you know, a system that scales both in the you know, error space which we were talking about a minute ago, and in the sort of production throughput space which is where we started the discussion, is to have a conscious list of things you’re going to address and get those on the table early on. A lot of things we discussed today you know, if you gather them into a little checklist, you can then make sure that you didn’t forget to do those. There are so many aspects of system design, right? And a lot of them are judgment calls, they’re not answers, they’re just you know, but if you don’t consider those questions you’re not gonna make good judgment calls. And so, having together a really coherent and inclusive list of the aspects of system design, really this is a kind of a summary of what we’ve said today, is that having that checklist is what leads to, and then using the team concept to explore the checklist, you know, assigning people to aspects of that checklist, to consider and to you know, come up with a recommendation, and a plan and so on. You know, I think this sort of ties together the whole conversation we’ve had today which is that there aren’t silver bullets but there are points that need to be addressed you know, consciously and thoughtfully and that if you do so you end up with both the good teamwork side of this whole problem addressed and with good answers to the technical questions that need to be addressed, so.

[Matthew O’Riordan] — That’s interesting, I think is also quite important to consider one thing that is also the complexity of the system. And I’m interested to know how often you’ve had to reduce the complexity of a system knowing it will affect the ability for it to scale potentially because the complexity is too great for the team the maintain. And so as an example, Rob Pike famously talking about how Go was designed to not necessarily to be a language with the most features but a language that allowed the team to work coherently together and understand each others code. Has that ever happened in the designing and building of distributed systems?

[Paul Nordstrom] — It’s a great question, partly because that’s what I would consider being my largest failing as an engineer, the place where I get into the most trouble, is not simplifying when I should have. So as a lesson to my future system designs, and as a lesson to other people maybe who are a little like me in this way I think it’s brilliant thing to point out is that that’s a choice that you should lean towards, you know well especially if you’re inclined like me to design things that are maybe a little too complicated. Intentionally restricting both the capabilities but at the same time the complexity of you’re system, I think is a brilliant way to end up with a system that works well, and you’re just like going a little bit too far and then coming back to your comfort zone, you know you just end up with a system that works better and is inherently more understandable, and those are both important points.

[Paddy Byers] — Definitely, less is more a lot of the time. I think that you have to remember that fundamentally, building software is a human activity and it’s you know, it’s performed by humans and maintainability is just as important a property of the software you write as reliability, and security, and performance, and all those other things. And ultimately, you do have to make sure that you build things you know, within the cognitive limits of the team that you have.

[Paul Nordstrom] — In fact as a leader of one of these teams building something you know, something that I wish I had done before would be to stop and say okay, here’s the team, my team has to build this piece of software, right. In a practical I want something to come up that’s out of this effort that beautiful and works really well. I have to take into account just the general capabilities, the expertise level, the experiences that they’ve had in the past, and even how well they work together to use that to influence what you just said about choosing what level of complexity I’m going to design the system too.

Tips

Watch chapter.

[Matthew O’Riordan] — I think what’d be really nice to extend a set of recommendations to our listeners, to just say if you’re going to, after watching this video, what actionable things can you do to take this learning and apply it to distributed systems that you may be trying to build. Can you share your tips Paul?

[Paul Nordstrom] — I’ve mentioned earlier about having a checklist. What I don’t believe in is that there is one checklist that is right for everybody, that I should put down my checklist and put it on the website and then you guys can use it. I think that each of the engineers watching this video are facing a set of problems with a different team, in a different environment, and has his or her own personal experience level and needs. I’m hoping that you know, maybe on review or maybe you’ve been doing it as you went along, you’ve heard some things that resonated with you and so my final point is only that you should consciously make that effective by turning it into a list that you can use to remind you not to omit things when you’re doing your next system design. Because the design of a system in my mind, it’s the most complex thing undertaken by mankind, is the design of a large software system. You know how man-years of efforts go into it, and it’s not just man-years of work, it’s man-years of thought okay. It’s unbelievable. Alright, so there’s no way anybody I know can keep this whole thing in their head without extra aids. I advise that you create external aids customized to yourself, and to your problem, and your environment. And then use those, apply them to the systems you’re designing in the future.

[Paddy Byers] — What I’ll add to that is, go back to what we said at the very beginning, which is a computer science text book will list out the problems that you have to face when you’re building a distributed system, you have to cope with unreliable networks, you have to cope with latency, you have to cope with consistency issues, but really what we’ve been talking about here is, is what you learn when you go beyond that, when you design things to operate at scale and when you have experience operating them. So what I would encourage people to do is get away from the text book and just go build stuff and that’s the best way to find out how to do this for real.

--END--

Hidden scaling issues of distributed systems - System design in the real world

Who did we interview?

Interview transcript

Table of contents

Intro

Unexpected problems

Limits of scale

Microservices

CAP theorem

Fault tolerance

Team structure

Tips

Recommended articles

Patterns for building realtime features

Scaling Pub/Sub with WebSockets and Redis

Data integrity in Ably Pub/Sub

Join the Ably newsletter today