Nov 24, 2022

Debug to Revert; Not to Fix

At any large company, a Release Engineering team receives a large amount of changes to incorporate into the product, test, and ship to customers. A release train therefore typically contains various changes for release.

As it often happens, one of those changes fails and stops the release train. A typical response as an engineer is to then debug, figure out which change brought about this error, and fix the underlying issue. Doing so is utterly wrong.

It's akin to stopping a train full of passengers because one of them got sick and waiting for the paramedics to arrive, heal them up fully before continuing the journey. The common sense response to this is to offload the passenger on the next platform, have the paramedics take it from there while the train leaves for its intended destination. The passenger can hop back on in the next train.

Stopping the release process for one change is a waste of time. The right approach is to debug the combination of changes just enough to figure out the culprit and then revert those changes. This is the least possible amount of time it takes for the train to continue its operation. Any more time then that impacts every other change to make its way to the customers.

While it sounds so common sensical, I have seen release engineers time and time again attempting to figure out a solution by making further changes to the product in an attempt to fix and ship the change; leading to frustrations and cognitive burdens all around. A release engineer's sole job should be get all passing changes to customers as soon as possible; not to figure out and further fix the changes themselves.

At times, it may feel like debugging what change broke the build cultivates enough knowledge that we may as well fix the problem too, but this has several disadvantages:

It shifts the burden of responsibility in terms of producing a fixed product change from component teams to release teams. The best a release engineer can do is to provide reproduction steps and debugging tips to the component teams. This robs the component teams of a learning opportunity.
It lengthens the time it takes for the release train to go further—reverting it is the fastest, easiest thing to do in most if not all circumstances.
And at times, you'll notice that once you solve a change, the train fails at another point later in the future because of another change. Continuing to debug to fix is a recipe for disaster.

The only caveat to this guideline is when a CVE needs to be shipped or when a release train only includes a single change. In those times, it is better to mob with the team in question to resolve the problem and further the process along.

So, debug enough and revert by communicating the error states to the component teams. It may be tempting to go down the rabbit hole and help the team ship their feature to customers, but resist it. It will allow you to better operationalize your release engineering service for your company and your customers.

Subscribe to The Release Engineer