Introduction

So far, we have learned how we can get two processes to communicate reliably and securely with each other. We didn’t go into all this trouble just for the sake of it, though. The end goal has always been to use multiple processes, and services, to build a distributed application that gives its clients the illusion they interact with a single node.

Although achieving a perfect illusion is not always possible or desirable, it’s clear that some degree of coordination is needed to build a distributed application. In this part, we will explore the core distributed algorithms at the heart of large scale services.

Chapter 6 introduces formal models that encode our assumptions about the behavior of nodes, communication links, and timing; think of them as abstractions that allow us to reason about distributed systems by ignoring the complexity of the actual technologies used to implement them.

Chapter 7 describes how to detect that a remote process is unreachable. Since the network is unreliable and processes can crash at any time, a process trying to communicate with another could hang forever without failure detection.

Chapter 8 dives into the concept of time and order. In this chapter, we will first learn why agreeing on the time an event happened in a distributed system is much harder than it looks, and then propose a solution based on clocks that don’t measure the passing of time.

Chapter 9 describes how a group of processes can elect a leader who can perform operations that others can’t, like accessing a shared resource or coordinating other processes’ actions.

Chapter 10 introduces one of the fundamental challenges in distributed systems, namely keeping replicated data in sync across multiple nodes. This chapter explores why there is a tradeoff between consistency and availability and describes how the Raft replication algorithm works.

Chapter 11 dives into how to implement transactions that span data partitioned among multiple nodes or services. Transactions relieve you from a whole range of possible failure scenarios so that you can focus on the actual application logic rather than all possible things that can go wrong.