Frequency of Incidents

How stable is the software being produced by the team? Is the quality improving, getting worse, or staying the same? Determining the level of software quality you need for the product you’re building and adjusting that measure over time is a technical challenge for you, the manager, to help address. If you’re building a brand new product for a small but growing business, it may be more important to focus on features over stability. On the other hand, if you own mission-critical systems, stability and incident minimization may be your top priority. The goal here is to balance risk in such a way that neither incident frequency nor incident prevention turns into a job that takes developers away from writing code for days at a time.

You may work for a company that has developers support the code or systems they write. This process has some downsides; significantly, expecting members of a team to frequently be on-call nights and weekends is a huge contributor to burnout. Despite that risk, it has the upside of putting the best people to help fix a problem in the role of responding to it. As a manager, you may be tempted now to take yourself out of this role. I sympathize, but if your team is set up to do its own incident management, you should be moving yourself into the role of escalation support. You won’t necessarily manage incidents as frequently, but you’ll be expected to be available more often in case the person supporting the systems needs you.

Analysis around incident management should include the question, “Is our current setup enabling my team to do what they do best every day?” Incident management, when it becomes merely reacting to incidents rather than working to reduce them, can turn into a task that diminishes your team’s ability to do what they do best. Engineers go on-call, they get burned out and exhausted from handling the deluge of problems and getting nothing done but fixing the consequences of incidents, and then they hand off the job to the next poor sap on the rotation. If that describes your team’s approach to incident management and on-call, your team is not able to do what they do best every day, and every time they go on-call they probably hate their jobs a little bit more. In this case, as a leader you probably want to focus on providing time to actually design systems that are more stable, or writing code to fix the recurring incidents as they arise.

An overemphasis on incident prevention can also reduce your team’s ability to do what they do best every day. Overfocusing on building systems that are defect-free, or pushing for error prevention by slowing down the development process, is often almost as bad as moving too fast and releasing unstable code. When risk reduction turns into weeks of manual QA, excessive and slow code reviews, infrequent releases, or a drawn-out planning process, the increased analysis can leave developers idle and restless, without necessarily reducing the risk of incidents.