Understanding Distributed Systems
Understanding Distributed Systems
Copyright
About the author
Acknowledgements
Preface
0.1
Who should read this book
1
Introduction
1.1
Communication
1.2
Coordination
1.3
Scalability
1.4
Resiliency
1.5
Operations
1.6
Anatomy of a distributed system
(PART) Communication
Introduction
2
Reliable links
2.1
Reliability
2.2
Connection lifecycle
2.3
Flow control
2.4
Congestion control
2.5
Custom protocols
3
Secure links
3.1
Encryption
3.2
Authentication
3.3
Integrity
3.4
Handshake
4
Discovery
5
APIs
5.1
HTTP
5.2
Resources
5.3
Request methods
5.4
Response status codes
5.5
OpenAPI
5.6
Evolution
(PART) Coordination
Introduction
6
System models
7
Failure detection
8
Time
8.1
Physical clocks
8.2
Logical clocks
8.3
Vector clocks
9
Leader election
9.1
Raft leader election
9.2
Practical considerations
10
Replication
10.1
State machine replication
10.2
Consensus
10.3
Consistency models
10.3.1
Strong consistency
10.3.2
Sequential consistency
10.3.3
Eventual consistency
10.3.4
CAP theorem
10.4
Practical considerations
11
Transactions
11.1
ACID
11.2
Isolation
11.2.1
Concurrency control
11.3
Atomicity
11.3.1
Two-phase commit
11.4
Asynchronous transactions
11.4.1
Log-based transactions
11.4.2
Sagas
11.4.3
Isolation
(PART) Scalability
Introduction
12
Functional decomposition
12.1
Microservices
12.1.1
Benefits
12.1.2
Costs
12.1.3
Practical considerations
12.2
API gateway
12.2.1
Routing
12.2.2
Composition
12.2.3
Translation
12.2.4
Cross-cutting concerns
12.2.5
Caveats
12.3
CQRS
12.4
Messaging
12.4.1
Guarantees
12.4.2
Exactly-once processing
12.4.3
Failures
12.4.4
Backlogs
12.4.5
Fault isolation
12.4.6
Reference plus blob
13
Partitioning
13.1
Sharding strategies
13.1.1
Range partitioning
13.1.2
Hash partitioning
13.2
Rebalancing
13.2.1
Static partitioning
13.2.2
Dynamic partitioning
13.2.3
Practical considerations
14
Duplication
14.1
Network load balancing
14.1.1
DNS load balancing
14.1.2
Transport layer load balancing
14.1.3
Application layer load balancing
14.1.4
Geo load balancing
14.2
Replication
14.2.1
Single leader replication
14.2.2
Multi-leader replication
14.2.3
Leaderless replication
14.3
Caching
14.3.1
Policies
14.3.2
In-process cache
14.3.3
Out-of-process cache
(PART) Resiliency
Introduction
15
Common failure causes
15.1
Single point of failure
15.2
Unreliable network
15.3
Slow processes
15.4
Unexpected load
15.5
Cascading failures
15.6
Risk management
16
Downstream resiliency
16.1
Timeout
16.2
Retry
16.2.1
Exponential backoff
16.2.2
Retry amplification
16.3
Circuit breaker
16.3.1
State machine
17
Upstream resiliency
17.1
Load shedding
17.2
Load leveling
17.3
Rate-limiting
17.3.1
Single-process implementation
17.3.2
Distributed implementation
17.4
Bulkhead
17.5
Health endpoint
17.5.1
Health checks
17.6
Watchdog
(PART) Testing and operations
Introduction
18
Testing
18.1
Scope
18.2
Size
18.3
Practical considerations
19
Continuous delivery and deployment
19.1
Review and build
19.2
Pre-production
19.3
Production
19.4
Rollbacks
20
Monitoring
20.1
Metrics
20.2
Service-level indicators
20.3
Service-level objectives
20.4
Alerts
20.5
Dashboards
20.5.1
Best practices
20.6
On-call
21
Observability
21.1
Logs
21.2
Traces
21.3
Putting it all together
22
Final words
Title Page
Cover