The Outage Postmortem

I’m not going to talk about the details of incident management, but the “postmortem” process is a critical element of good engineering. In fact, instead of calling the process a postmortem, many have started calling it a “learning review” to indicate that its purpose is not determining cause of death but learning from the incident. There has been a lot written on this topic, so I’ll highlight only a few elements that I believe are critical, especially for small teams:

  • Resist the urge to point fingers and blame. It is incredibly tempting, after a stressful outage, to point fingers and ask people why they failed to foresee the consequences of their behavior. Why did they run that command on that box? Why didn’t they test that? Why did they ignore that alert? Unfortunately, this blaming only results in people being afraid to make mistakes.
  • Look at the circumstances around the incident and understand the context of the events. You want to understand and identify the factors that contributed to this incident. This might include looking for tests that would have detected the problem, or tools that could have made the incident management go more smoothly. Getting a good list of these circumstantial contributors helps you detect patterns or areas for improvement, and forms the “learning” part of the learning review.
  • Be realistic about which takeaways are important and which are worth dropping. Be careful not to give the impression that people need to solve every problem they identify in the course of the exercise. Many learning reviews end in a laundry list of things that could be improved — everything from cleaning up alerts to adding role restrictions to following up with a third-party vendor to understand its API. It’s unlikely you will get to all of these, and in fact, it’s likely that if you try to do all of them, you will end up doing none of them. Pick the one or two that are truly high-risk and highly likely to cause future problems, and acknowledge the ones that you are going to let go for now.