Dec 5, 2022

Every Stoppage is an Incident

"What's not measured is not improved" – Peter Drucker

Any stoppage in a release train is a hurdle in delivering the product to the customer. It's a learning opportunity for the team to understand the problem and improve the system to be more resilient in the future.

By cataloging the incident data in a consistent, uniform, and readable manner, you can gradually move towards a release train system where errors are exception to the norm.

Start by capturing incident data by hand if needed, but it's convenient to automatically capture and catalog any hiccup happening in the system.

Once you have some data, a major way you can begin improving your release engineering system is to start working your way through reducing the most frequent incident.

Define Incident

It's important to define what an incident is for your context. We were conscious to not call it error, bug, issue, or failure. They all have other meanings or contextual connotations in software engineering that would muddy the conversation.

We defined incident as any unique stoppage occurring in a release train. Cataloging the incidents by uniquness helped us take a combined approach to solving same problems, otherwise a typical release train with large number of changes would result in an overwhelming count of stoppages.

Categorize Incidents

An incident can be of many types. To gather meaningful data, catergorize the incidents in a way that's meaningful to you and your team's context. Since incident is a stoppage that needs to be resolved for the release train to continue, we found it best to categorize the incident post-resolution in the following way:

Auto-restarting the job (i.e. flakes)
Restart the job after optionally applying a known remedy (e.g. SSH and run some command to clean up leftover state). A known remedy resolves the incident but doesn't necessarily fix the underlying issue in the product or the pipeline.
Restarting that pipeline of the train (i.e. restart from create-environment), unclaim environment (i.e. environmental flakes)
Releasing with a known issue in the product. This would require documenting the issue if it impacts customers and a workaround if needed.
Not release the impacted product variant in the release train. A typical release train may release multiple versions and different products together so in this case, you'd eject the affected version from the release train.
Begin the train again; for instances when the issue is drastic enough to warrant a halt and restart.

This is sorted by the cheapest to more expensive both in time and effort.

Having a way to record incidents is a key tactic in your journey to creating a robust resilient release enginering system. Use this data in your postmorterms by employing the improvement workflow proposed here.

Define Incident

Categorize Incidents

Subscribe to The Release Engineer