Understanding Distributed Systems

  1. Understanding Distributed Systems
  2. Copyright
  3. About the author
  4. Acknowledgements
  5. Preface
    1. 0.1 Who should read this book
  6. 1 Introduction
    1. 1.1 Communication
    2. 1.2 Coordination
    3. 1.3 Scalability
    4. 1.4 Resiliency
    5. 1.5 Operations
    6. 1.6 Anatomy of a distributed system
  7. (PART) Communication
  8. Introduction
  9. 2 Reliable links
    1. 2.1 Reliability
    2. 2.2 Connection lifecycle
    3. 2.3 Flow control
    4. 2.4 Congestion control
    5. 2.5 Custom protocols
  10. 3 Secure links
    1. 3.1 Encryption
    2. 3.2 Authentication
    3. 3.3 Integrity
    4. 3.4 Handshake
  11. 4 Discovery
  12. 5 APIs
    1. 5.1 HTTP
    2. 5.2 Resources
    3. 5.3 Request methods
    4. 5.4 Response status codes
    5. 5.5 OpenAPI
    6. 5.6 Evolution
  13. (PART) Coordination
  14. Introduction
  15. 6 System models
  16. 7 Failure detection
  17. 8 Time
    1. 8.1 Physical clocks
    2. 8.2 Logical clocks
    3. 8.3 Vector clocks
  18. 9 Leader election
    1. 9.1 Raft leader election
    2. 9.2 Practical considerations
  19. 10 Replication
    1. 10.1 State machine replication
    2. 10.2 Consensus
    3. 10.3 Consistency models
      1. 10.3.1 Strong consistency
      2. 10.3.2 Sequential consistency
      3. 10.3.3 Eventual consistency
      4. 10.3.4 CAP theorem
    4. 10.4 Practical considerations
  20. 11 Transactions
    1. 11.1 ACID
    2. 11.2 Isolation
      1. 11.2.1 Concurrency control
    3. 11.3 Atomicity
      1. 11.3.1 Two-phase commit
    4. 11.4 Asynchronous transactions
      1. 11.4.1 Log-based transactions
      2. 11.4.2 Sagas
      3. 11.4.3 Isolation
  21. (PART) Scalability
  22. Introduction
  23. 12 Functional decomposition
    1. 12.1 Microservices
      1. 12.1.1 Benefits
      2. 12.1.2 Costs
      3. 12.1.3 Practical considerations
    2. 12.2 API gateway
      1. 12.2.1 Routing
      2. 12.2.2 Composition
      3. 12.2.3 Translation
      4. 12.2.4 Cross-cutting concerns
      5. 12.2.5 Caveats
    3. 12.3 CQRS
    4. 12.4 Messaging
      1. 12.4.1 Guarantees
      2. 12.4.2 Exactly-once processing
      3. 12.4.3 Failures
      4. 12.4.4 Backlogs
      5. 12.4.5 Fault isolation
      6. 12.4.6 Reference plus blob
  24. 13 Partitioning
    1. 13.1 Sharding strategies
      1. 13.1.1 Range partitioning
      2. 13.1.2 Hash partitioning
    2. 13.2 Rebalancing
      1. 13.2.1 Static partitioning
      2. 13.2.2 Dynamic partitioning
      3. 13.2.3 Practical considerations
  25. 14 Duplication
    1. 14.1 Network load balancing
      1. 14.1.1 DNS load balancing
      2. 14.1.2 Transport layer load balancing
      3. 14.1.3 Application layer load balancing
      4. 14.1.4 Geo load balancing
    2. 14.2 Replication
      1. 14.2.1 Single leader replication
      2. 14.2.2 Multi-leader replication
      3. 14.2.3 Leaderless replication
    3. 14.3 Caching
      1. 14.3.1 Policies
      2. 14.3.2 In-process cache
      3. 14.3.3 Out-of-process cache
  26. (PART) Resiliency
  27. Introduction
  28. 15 Common failure causes
    1. 15.1 Single point of failure
    2. 15.2 Unreliable network
    3. 15.3 Slow processes
    4. 15.4 Unexpected load
    5. 15.5 Cascading failures
    6. 15.6 Risk management
  29. 16 Downstream resiliency
    1. 16.1 Timeout
    2. 16.2 Retry
      1. 16.2.1 Exponential backoff
      2. 16.2.2 Retry amplification
    3. 16.3 Circuit breaker
      1. 16.3.1 State machine
  30. 17 Upstream resiliency
    1. 17.1 Load shedding
    2. 17.2 Load leveling
    3. 17.3 Rate-limiting
      1. 17.3.1 Single-process implementation
      2. 17.3.2 Distributed implementation
    4. 17.4 Bulkhead
    5. 17.5 Health endpoint
      1. 17.5.1 Health checks
    6. 17.6 Watchdog
  31. (PART) Testing and operations
  32. Introduction
  33. 18 Testing
    1. 18.1 Scope
    2. 18.2 Size
    3. 18.3 Practical considerations
  34. 19 Continuous delivery and deployment
    1. 19.1 Review and build
    2. 19.2 Pre-production
    3. 19.3 Production
    4. 19.4 Rollbacks
  35. 20 Monitoring
    1. 20.1 Metrics
    2. 20.2 Service-level indicators
    3. 20.3 Service-level objectives
    4. 20.4 Alerts
    5. 20.5 Dashboards
      1. 20.5.1 Best practices
    6. 20.6 On-call
  36. 21 Observability
    1. 21.1 Logs
    2. 21.2 Traces
    3. 21.3 Putting it all together
  37. 22 Final words
  1. Title Page
  2. Cover