Introduction

As you scale out your applications, any failure that can happen will eventually happen. Hardware failures, software crashes, memory leaks — you name it. The more components you have, the more failures you will experience.

Suppose you have a buggy service that leaks 1 MB of memory on average every hundred requests. If the service does a thousand requests per day, chances are you will restart the service to deploy a new build before the leak reaches any significant size. But if your service is doing 10 million requests per day, then by the end of the day you lose 100 GB of memory! Eventually, the servers won’t have enough memory available and they will start to trash due to the constant swapping of pages in and out from disk.

This nasty behavior is caused by cruel math; given an operation that has a certain probability of failing, the total number of failures increases with the total number of operations performed. In other words, the more you scale out your system to handle more load, and the more operations and moving parts there are, the more failures your systems will experience.

Remember when we talked about availability and “nines” in chapter 1? Well, to guarantee just two nines, your system can be unavailable for up to 15 min a day. That’s very little time to take any manual action. If you strive for 3 nines, then you only have 43 minutes per month available. Although you can’t escape cruel math, you can mitigate it by implementing self-healing mechanisms to reduce the impact of failures.

Chapter 15 describes the causes of the most common failures in distributed systems: single points of failure, unreliable networks, slow processes, and unexpected load.

Chapter 16 dives into resiliency patterns that help shield a service against failures in downstream dependencies, like timeouts, retries, and circuit breakers.

Chapter 17 discusses resiliency patterns that help protect a service against upstream pressure, like load shedding, load leveling, and rate-limiting.