Introduction

When all resiliency mechanisms fail, humans operators are the last line of defense. Historically, developers, testers, and operators were part of different teams. The developers handed over their software to a team of QA engineers responsible for testing it. When the software passed that stage, it moved to an operations team responsible for deploying it to production, monitoring it, and responding to alerts.

This model is being phased out in the industry as it has become commonplace for the development team to also be responsible for testing and operating the software they write. This forces the developers to embrace an end-to-end view of their applications, acknowledging that faults are inevitable and need to be accounted for.

Chapter 18 describes the different types of tests — unit, integration, and end-to-end tests — you can leverage to increase the confidence that your distributed applications work as expected.

Chapter 19 dives into continuous delivery and deployment pipelines used to release changes safely and efficiently to production.

Chapter 20 discusses how to use metrics and service-level indicators to monitor the health of distributed systems. It then describes how to define objectives that trigger alerts when breached. Finally, the chapter lists best practices for dashboard design.

Chapter 21 introduces the concept of observability and how it relates to monitoring. Then it describes how traces and logs can help developers debug their systems.