Dec 13, 2022

Implement Auto-Retries

We found that the majority of our incidents occurred due to test flakes. Test flakiness happens in most large software systems and complex integration test suites.

For the longest time, we believed that it was important to feel the pain and resolve each flakiness by fixing the underlying issue. The nature of flaky test failures is hard to replicate and the amount of investment it typically takes to improve the test suite is high enough that we never got to it.

On top of that, as a Release Engineering team, understanding these test flakes requires deep understanding of the components interacting. This made it even tougher especially since our attention is typically around making the process of release engineering more robust.

After much back and forth and deliberation, we implemented a retry strategy that removed the need to manually pay attention to these incidents. We also made sure to record these flakes in a datastore for us to pay attention if needed.

We saw enough productivity gains with this one change because more often than not, we were able to get through a release train over the weekend that would otherwise be waiting on someone to press restart on a pipeline.

Subscribe to The Release Engineer