6 System models

To reason about distributed systems, we need to define precisely what can and can’t happen. A system model encodes assumptions about the behavior of nodes, communication links, and timing; think of it as a set of assumptions that allow us to reason about distributed systems by ignoring the complexity of the actual technologies used to implement them.

Let’s start by introducing some models for communication links:

The fair-loss link model assumes that messages may be lost and duplicated. If the sender keeps retransmitting a message, eventually it will be delivered to the destination.
The reliable link model assumes that a message is delivered exactly once, without loss or duplication. A reliable link can be implemented on top of a fair-loss one by de-duplicating messages at the receiving side.
The authenticated reliable link model makes the same assumptions as the reliable link, but additionally assumes that the receiver can authenticate the message’s sender.

Even though these models are just abstractions of real communication links, they are useful to verify the correctness of algorithms. As we have seen in the previous chapters, it’s possible to build a reliable and authenticated communication link on top of a fair-loss one. For example, TCP does precisely that (and more), while TLS implements authentication (and more).

We can also model the different types of node failures we expect to happen:

The arbitrary-fault model assumes that a node can deviate from its algorithm in arbitrary ways, leading to crashes or unexpected behavior due to bugs or malicious activity. The arbitrary fault model is also referred to as the “Byzantine” model for historical reasons. Interestingly, it can be theoretically proven that a system with Byzantine nodes can tolerate up to $\frac{1}{3}$ of faulty nodes and still operate correctly.
The crash-recovery model assumes that a node doesn’t deviate from its algorithm, but can crash and restart at any time, losing its in-memory state.
The crash-stop model assumes that a node doesn’t deviate from its algorithm, but if it crashes it never comes back online.

While it’s possible to take an unreliable communication link and convert it into a more reliable one using a protocol (e.g., keep retransmitting lost messages), the equivalent isn’t possible for nodes. Because of that, algorithms for different node models look very different from each other.

Byzantine node models are typically used to model safety-critical systems like airplane engine systems, nuclear power plants, financial systems, and other systems where a single entity doesn’t fully control all the nodes1. These use cases are outside of the book’s scope, and the algorithms presented will generally assume a crash-recovery model.

Finally, we can also model the timing assumptions:

The synchronous model assumes that sending a message or executing an operation never takes over a certain amount of time. This is very unrealistic in the real world, where we know sending messages over the network can potentially take a very long time, and nodes can be stopped by, e.g., garbage collection cycles or page faults.
The asynchronous model assumes that sending a message or executing an operation on a node can take an unbounded amount of time. Unfortunately, many problems can’t be solved under this assumption; if sending messages can take an infinite amount of time, algorithms can get stuck and not make any progress at all.
The partially synchronous model assumes that the system behaves synchronously most of the time, but occasionally it can regress to an asynchronous mode. This model is typically representative enough of practical systems.

In the rest of the book, we will generally assume a system model with fair-loss links, nodes with crash-recovery behavior, and partial synchrony. For the interested reader, “Introduction to Reliable and Secure Distributed Programming” is an excellent theoretical book that explores distributed algorithms for a variety of other system models not considered in this text.

But remember, models are just an abstraction of reality, and sometimes abstractions leak. As you read along, question the models’ assumptions and try to imagine how algorithms that rely on them could break.

For example, digital cryptocurrencies such as Bitcoin implement algorithms that assume Byzantine nodes.↩︎