---
title: "The Release Engineer"
author: "Rizwan Reza"
url: "https://release.engineer/2/release-engineer"
---

Introduction

Release Engineering, sometimes shortened as RelEng, focuses on how software gets packaged, shipped, and distributed to the end-users. This discipline is more prevalent where complex software is shipped as a standalone product. In contrast to this, disciplines and concepts like [Site Reliability Engineering](https://sre.google/sre-book/introduction/) and [Continuous Delivery](https://martinfowler.com/bliki/ContinuousDelivery.html) deal with live hosted applications. If there is an artifact that's distributed to users instead of a maintained service, Release Engineering is most definitely at play.

For businesses, Release Engineering solves a scaling problem. As companies grow from startups into enterprises, the number of engineers who write code for a given software increases rapidly. This creates a whole slew of coordination and compatibility problems between different parts of the system that results in the reduction of productivity and innovation.

* Team A’s component worked yesterday but now that Team B has shipped their component today, Team A’s component doesn't work anymore.
* Packaging a product into a single installable product becomes problematic as each component lacks a contract with another.
* No one knows what exact version is running at different customer sites. The need for Supply Chain Management becomes necessary.
* Many security vulnerabilities and bugs creep up in the product post-installation.

Release Engineering introduces itself to solve these problems large companies face. This begins with the emphasis on the process required to build and ship the product instead of the product itself. While the type of software changes the design and implementation of the process, the principles and tactics of Release Engineering apply universally between these different products. 

Release Engineering emphasizes in establishing and maintaining a baseline level of quality in the product, building tools to allow developers to easily test their changes in isolation, and automating the process as much as possible. It also focuses on creating a traceable supply chain so each artifact version can trace back its source components.

Before we begin discussing the mechanics and components of Release Engineering, let’s set the context of its relevance and need.

Release Engineering concerns itself with software that’s digitally or physically shipped to the customers. Given the dominance of cloud technologies, we’ll solely focus on the former—a downloadable artifact.

As discussed in the previous post, Release Engineering solves scaling problems as companies grow from startups to enterprises.

But that only scratches the surface. As we dig deeper, enterprise companies are made up of multiple players each with their own motivations and needs.

## Solves Growing Pains
As software businesses grow and hire more engineers, their release process becomes fragile and broken. It becomes hard to coordinate what gets released in which version. This results in the release process being slow. Companies typically use cultural norms like freezing code, performing QA, and maintenance in an attempt to maintain control. It is common to bake in as much time for testing and bug fixes as it takes to develop features.

Even then, the release process is very hard to execute effectively. It’s common to have long working hours as sustainability during these times is a challenge.

Release Engineering aims to solve this problem by centralizing the machinery of releasing around a set of principles and norms in a company. Release Engineering integrates a set of norms and principles into an automated process and a nimble team to go along side it that allows the business to initiate the release process at any point in time.

It enables the business to be confident in releasing their software repeatably and reliably. And as the business grows with more and more customers, a robust Release Engineering system makes sure the company continues to meet its increasing demands satisfactorily.

## So, what did we ship?
Software today is a snapshot of a complicated combination of libraries. When companies ship their software to customers, not knowing what they shipped creates a variety of problems that makes it embarrassingly confusing and difficult to determine whether they need a bug fix.

Legal and compliance requirements at any reasonably serious enterprise also make it necessary to keep an inventory of what they are releasing to their customers.

Generally, this has been solved by assigning a global identifier version (e.g. 1.0.0 so on) to each artifact that’s bundled together. But this breaks down when it isn’t clear what that version is made up of, or requires painful traversing to determine each component’s version number. And let’s not bring up spreadsheets, please.

Releasing a version of the product by merely creating a zip file from a folder of various components and assigning it a unique version is at its best lazy and at its worst plain dangerous.

A solid Release Engineering system takes its lessons from Supply Chain Management to make sure each included artifact is traceable to its equivalent of the atomic unit. The version of the released software simply acts as a pointer to a set of all these component versions in a topology that make it a snap to determine what actually was indeed shipped to customers.

Knowing this information means that companies release a fix or address a security vulnerability confidently and efficiently.

## Addressing Security Vulnerabilities
A shipped piece of software to its customers is used in a variety of contexts. The product that inspired this book, Tanzu Application Service, is used by governments, banks, e-commerce companies, and other various high-demand and crucial services.

This means that security vulnerabilities across its fifty components and thousands of libraries threaten customers and their users.

An efficient automated release engineering system allows the company to:

* Minimize as much time as possible in compiling the release artifact to be shipped to customers.
* Confidently ship the fix without introducing any unknown functional or performance regressions.

## Did I break the system?
As discussed, companies attempting to grow by hiring more engineers do not want to ship fewer features. And yet, that is exactly what typically unfolds. At a certain point, adding more engineers to the company lowers the speed of development in a shared codebase. Solving through software architecture and org design is a must and a topic worth researching.

But one factor that leads to slow release cycles tends to be the dependence of one component’s work over another. On top of that, this interdependence means that a single customer outcome requires touchpoints across different codebases and teams. It frustrates both the business and the engineers alike.

Release Engineering here aims to provide a service that relieves the developer from the responsibilities of building and compiling a piece of software at any point in time. And precisely because of that, a built binary with a new feature allows the developer to test their work easily. Apart from providing fast feedback loops, it also allows the developers to keep working relatively independently from others.

The hallmark of a good release engineering system also lies in providing testing environments and automated test suite runners so developers continue to focus on doing good work.

Release Engineering is needed in circumstances where you cannot point the product as a whole to a single owning team anymore. This is a scaling problem, as noted before. Release Engineering focuses on the process so the product isn’t impacted by the lack of ownership on building, compiling, testing, and shipping it to customers. 

As teams scale and own disparate chunks of a product, the Release Engineering team emerges as gatekeepers and becomes a natural integration point to make sure a quality product goes out to customers.

It’s worth noting that Micro Service Architecture for web applications can skip Release Engineering by defining versioned interfaces and contracts between the components. This removes the focal point of integration as each team can directly deliver their solution to customers independently. This is a huge reason why organizations are attracted to Micro Service Architecture.

This organizational and architectural design breaks down in our context of shipping a binary to customers that bundles all the pieces together. A Release Engineering team, therefore, disentangles the organization from the responsibility of delivering the product so the teams can independently concentrate in their domain areas.

And by extension, a good release engineering process allows the organization to grow each team without worrying about allocating more resources towards scaling and maintaining release processes. 

Achieving such organizational independence requires an attitude of constant and continuous improvement in the release processes. A new task might even begin manually but because the focus of this team is on engineering the process itself, they implement automated workflows. Over time, this rigor brings fluidity, trust, and confidence in being able to consistently deliver software to customers.

Site Reliability Engineering focuses on running services reliably. It emerged from Google and has garnered attention rightfully as a fundamental discipline in operating and designing systems to run complex software machinery reliably. It provides tactics and embraces software engineering philosophy in a world traditionally looked upon as system administration. It’s transformed the world of DevOps by bringing software engineering into system operations, understanding risk and creating service metrics.

Site Reliability Engineering differs from Release Engineering heavily in that the former applies to infrastructure and operations of the software while the latter’s goal is to ship software in a reliable and repeatable process. While some principles carry over across these sub-disciplines, they solve fundamentally different problems.

While topics such as toil and automation suit perfectly, monitoring and establishing service level objectives with end users and customers don’t find their home as neatly here.

At a high level, Site Reliability Engineering makes software-based services more resilient. Release Engineering on the other hand solves how a product ships in the hands of users.

_If you'd like to learn more about Site Reliability Engineering, read Google's excellent [Site Reliability Engineering book](https://sre.google/)._

At Tanzu, we teach and employ Site Reliability Engineering principles for our live services. Our Release Engineering practices are also inspired by some of the SRE principles, but it’s worth noting that they are used at two fundamental points in the lifecycle of software.

As we’ll further explore, the goal is for release engineers to rely on automated services to further the processes from building to compiling to testing and to release, and each of these phases comprise of numerous services that in themselves, are encouraged to employ SRE principles. The user in this case may not be the end user and availability metrics may not need to be as high as 99.99%, but keeping these frameworks in mind for each slice of service in release engineering is useful in its own right.

A QA role in a typical tech company today comprises of finding and fixing bugs, typically by writing a set of test suites aimed at finding defects and mistakes in the system. With each release, they may aim at a specific subsystem or feature set and focus on finding defects. The outcome of which results in fewer defects, surprises, and a better level of quality for the customer.

While Release Engineering aims to maintain a base level of quality, it is devoid of a detective finding defects in this way. Release Engineering must not guard against such intricate edge cases and obscure flows of the system. Doing so requires substantial investment in writing test suites that embed neatly into the RelEng processes. Moreover, it also increases the time it takes to engineer builds through the process.

QA engineers define and maintain a level of quality customers expect from the system. The classification and choice of which surface area to tackle within the product is a choice made in context of what’s being shipped in the next version and how important a set of features are.

In contrast, Release Engineering does not concern itself with what is being shipped. The process largely remains the same in an effort to scale the number of things being shipped.

Fundamentally, mixing Quality Assurance with Release Engineering responsibilities confuses the role and mindset of the engineers. Having a clear direction to automate workflows in service of the product through to customer takes a vastly different perspective than the defect finding mission QA engineers typically are on.

Release Engineers provide a service that ships the product. QA engineers fix the product from potential defects.

With that said, productive collaboration between QA Engineers and Release Engineers allows promotion of newly discovered as relevant and common test cases that are needed to gain a base level of confidence.

Cloud Foundry maintains a system test suites called CF Acceptance Tests (adoringly named CATs) that each team contributes to. Like a healthy system test suite, they typically do not contain edge cases but most crucial functionality of the system are covered by the test suite.

Exploratory testing and QA teams are a huge asset and any product would be better off investing in, but Release Engineering must not be burdened with this responsibility.

While Release Engineering concerns with designing and automating the process of releasing software, Release Management deals with what makes up the release itself. Release Management assesses the importance of features and bug fixes going in from a business outcomes perspective. 

While Project Managers scope out the features to be designed & developed in a product based on the customer need and value to the business, Release Managers scope out which features are shipped in releases based on the quality and readiness of its capabilities. They keep stakeholders informed of changing priorities from release to release. Release Management adapts Project Management skills along with knowledge about the specific release cadence and machinery built to ship product releases in a reliable manner.

Release Managers also assess risk and complexity of a particular release. They provide updates to the customers, development teams, and stakeholders and act as the human contact and assign tasks to the team members in the organization.

A Release Manager is a key stakeholder when designing Release Engineering systems. A system designed to serve the needs of a Release Manager increases fluidity in the release process. Generally, a person with systems thinking is aptly suited for these roles.

At Tanzu, the Release Engineering team assigns a Release Manager who determines the set of changes that are shipped with every release cycle. They are supported by a couple or so engineers who see the integration, testing, and delivery of the product through. 

Some companies tend to offload the release duties solely to the Release Manager  ignoring Release Engineering entirely, keeping the process manual entirely. This leads to burnout and piling on of manual processes and tasks over time. 

Think of the Release capability as its own product. The Release Manager acts as the Product Manager while the developers help make this process automated. A product's release capability often starts with a manual process or a few scripts hacked together but the partnership and investment in an independent team enables the business to mature this capability over time, yielding better cadence, quality, and control in releasing the product to customers.

> Continuous Delivery: Reducing the cost, time, and risk of delivering incremental changes to users.

Continuous Delivery develops a team's capability to have the code in an always deployable state. It mandates engineers to merge their code daily, never breaking the mainline, and delivering features incrementally. This is typically done by developing pipelines that build, test, and deploy the code in production. 

It draws inspiration from Lean Manufacturing that urges investment in machinery to cut costs over time. Soon enough, the team establishes discipline and improves the delivery process iteratively that deploying the product becomes boring.

This is not possible (at least, not as fast) when there are tens of teams contributing to a product. This can be a matter of scale or the nature of the software but often both are true. The breed of software RelEng practices apply to is inherently different from applications that are able to establish the practice of Continuous Delivery. Operating systems and downloadable software often fall under the former while web applications or Indie developed apps in the latter.

Continuous delivery is micro-focused whereas Release engineering is macro. Continuous delivery promotes practices engineers follow to allow their team's work be shipped readily. This falls short when the build processes of a complex product takes hours or days and requires multiple components to form a monolithic binary.

For bigger companies, continuous delivery may come into play in the form of a mono-repository pattern. Each team works on a part of the software that gets deployed periodically or at-will. Because Release Engineering typically results in a downloadable binary which requires compilation and other build operations, a similar approach is not generally appropriate for Continuous Delivery. 

The scale of continuous delivery is arguably different. Typically, smaller teams with a streamlined process implement continuous delivery processes easily. You can potentially create a whole CD pipeline using just GitHub Actions, for example. What's not easy might be the discipline the software developers need to develop to reap the full benefits of continuous delivery.

While these two methodologies are employed for different scale and needs, both emphasize reducing the cost, time and risk to delivering changes to customers. The idea is to let the machines take on repeatable tasks and script the toil away.

When there are tens of teams contributing to a large product separately, Release Engineering establishes the practices and paradigms so teams can focus on writing code within their area of expertise, get meaningful feedback reliably, and let releasing software be a separate worry.

Process

Release Engineering develops the capability in an organization to deliver new changes in the product to customer in a confident repeatable manner.

The Release Engineering team develops this process in a manner where subsequent releases can go out out confidently.

## What is a Release Train?
A Release Train is a release engineering pattern that helps define how a single release process can be instantiated and followed through.

A Release Train takes the changes in the product through different stages including building, testing, compiling, and finally publishing.

Each team and organization would adapt this concept according to their needs and the nature of the software in question. But here are a few characteristics generally common among Release Teams:

- A Release Train is atomic. Once a train has started, you do not want to bring in any further changes. 

- Only one Release Trains runs at a time. More than one concurrent Release Train results in confusion.

- A Release Train doesn't stop for more changes before it departs. Whatever's ready gets on the train. Whatever isn't, gets on the next train.

- If a change already in the Release Train stops it due to a failure or lack of functionality or a discovered bug, it is rejected so the rest of the changes can be delivered to customers.

- A Release Train may contain multiple versions of a single product that need to be patched. This is true for products that need to maintain and support multiple version lines for example. A single Release Train does not necessarily mean changes to only a single version or even a product variant. (Though at some point and when the scale demands, you may want to consider dividing up products and teams to handle Release Engineering for different products as necessary.)

## Stages of the Release Train

Each Release Train typically goes through the following few stages:

### 1. Pre-work
During this stage, the release team would take inventory of changes submitted by the component teams. A template should typically contain a place to insert relevant release notes and any documentation along with details around whether the change break the customer workflows upon installation. This would also include pull requests that contain code changes and any version bumps for the components and libraries in the product.

Ideally, these pull requests come in to the release train already tested with a recent build of the product, but depending on the maturity of the journey of release engineering, this may not always be possible. We will visit this idea again later in a separate post.

The Release team, being the gatekeepers of the product, should establish guidelines on what kind of changes make into the product. So this is the time to reject any changes that may be breaking the product where drastic changes shouldn't be made.

### 2. Integration
This is the point where all of the approved requests and pull requests are merged into the mainline branch of the product. In our case, we use Git to maintain production branches in our repository. Once all changes are brought in, an internal build version is assigned to the build so we are ready to build and test.

The build numbers safeguard the teams from confusion if a change needs to be taken out. These build numbers are for internal use and separate from final version increments that customers track. Final product versions are bumped only when the build is completely ready to be shipped to customers.

### 3. Build, Compile and Test
This is the stage where the core of the machinery triggers to create the next build of the product and compile any necessary components. This build serves as the candidate for the entire release train unless a defect is found. That triggers the removal of the change introducing the defect and a re-run to create the build.

Once a build is ready, the testing begins. In the case of Tanzu Application Service, we test our builds across an array of IAASes and scenarios. This includes all major public cloud services along with fresh installation and upgrade scenarios. We also run tests on a completely internet-less environment for customers who use the product offline.

It's important to note that any defect found while testing should be debugged only enough to pinpoint which change introduced it. The idea is to keep the train moving and debugging to fix while the release train is going means that all other changes have to wait further. We'll go into this further in a later post.

Another note around comprehensiveness of test suites: since Release Engineering is not a QA team, it's important to accommodate tests in the running test suites that aim for 80-90% of the main scenarios. A Release Engineering system cannot provide 100% guaranteed coverage for all use cases in any complex product, and aiming for it is extremely expensive and sets the wrong expectations with the stakeholders.

### 4. Publish and Distribute
If all systems go green after running these tests, that build is ready to go. Publishing and distributing the build is generally straightforward, but here are a few things that typically happen during this stage:

- The build is uploaded and given access to public or customers in a Marketplace or App Store.
- The version is bumped. It's important to have a clear version policy depending on the needs.
- The documentation for the new version is published, along with release notes.
- Any license files are generated and attached to the product entry or the build. In some cases, it may be necessary to embed this in the build itself.

As you can imagine, this is not a trivial amount of work. And it requires careful deliberation to make sure the product contains the changes and the team has enough confidence to launch it out to the world.

The Release Train pattern provides a simple, repeatable workflow where if followed, removes confusion and makes everyone in the organization follow these stages to track progress. At Tanzu, we've invested in dashboards that allow anyone internally to know when a change will come out. This helps customers be informed as necessary.


Principles

At any large company, a Release Engineering team receives a large amount of changes to incorporate into the product, test, and ship to customers. A release train therefore typically contains various changes for release.

As it often happens, one of those changes fails and stops the release train. A typical response as an engineer is to then debug, figure out which change brought about this error, and fix the underlying issue. Doing so is utterly wrong.

It's akin to stopping a train full of passengers because one of them got sick and waiting for the paramedics to arrive, heal them up fully before continuing the journey. The common sense response to this is to offload the passenger on the next platform, have the paramedics take it from there while the train leaves for its intended destination. The passenger can hop back on in the next train.

Stopping the release process for one change is a waste of time. The right approach is to debug the combination of changes just enough to figure out the culprit and then revert those changes. This is the least possible amount of time it takes for the train to continue its operation. Any more time then that impacts every other change to make its way to the customers.

While it sounds so common sensical, I have seen release engineers time and time again attempting to figure out a solution by making further changes to the product in an attempt to fix and ship the change; leading to frustrations and cognitive burdens all around. A release engineer's sole job should be get all passing changes to customers as soon as possible; not to figure out and further fix the changes themselves.

At times, it may feel like debugging what change broke the build cultivates enough knowledge that we may as well fix the problem too, but this has several disadvantages:

- It shifts the burden of responsibility in terms of producing a fixed product change from component teams to release teams. The best a release engineer can do is to provide reproduction steps and debugging tips to the component teams. This robs the component teams of a learning opportunity.
- It lengthens the time it takes for the release train to go further—reverting it is the fastest, easiest thing to do in most if not all circumstances.
- And at times, you'll notice that once you solve a change, the train fails at another point later in the future because of another change. Continuing to debug to fix is a recipe for disaster.

The only caveat to this guideline is when a CVE needs to be shipped or when a release train only includes a single change. In those times, it is better to mob with the team in question to resolve the problem and further the process along.

So, debug enough and revert by communicating the error states to the component teams. It may be tempting to go down the rabbit hole and help the team ship their feature to customers, but resist it. It will allow you to better operationalize your release engineering service for your company and your customers.

> "What's not measured is not improved" – Peter Drucker

Any stoppage in a release train is a hurdle in delivering the product to the customer. It's a learning opportunity for the team to understand the problem and improve the system to be more resilient in the future.

By cataloging the incident data in a consistent, uniform, and readable manner, you can gradually move towards a release train system where errors are exception to the norm. 

Start by capturing incident data by hand if needed, but it's convenient to automatically capture and catalog any hiccup happening in the system. 

Once you have some data, a major way you can begin improving your release engineering system is to start working your way through reducing the most frequent incident. 

## Define Incident
It's important to define what an incident is for your context. We were conscious to not call it error, bug, issue, or failure. They all have other meanings or contextual connotations in software engineering that would muddy the conversation. 

We defined incident as any unique stoppage occurring in a release train. Cataloging the incidents by uniquness helped us take a combined approach to solving same problems, otherwise a typical release train with large number of changes would result in an overwhelming count of stoppages.

## Categorize Incidents
An incident can be of many types. To gather meaningful data, catergorize the incidents in a way that's meaningful to you and your team's context. Since incident is a stoppage that needs to be resolved for the release train to continue, we found it best to categorize the incident post-resolution in the following way:

1. Auto-restarting the job (i.e. flakes)
2. Restart the job after optionally applying a known remedy (e.g. SSH and run some command to clean up leftover state). A known remedy resolves the incident but doesn't necessarily fix the underlying issue in the product or the pipeline.
3. Restarting that pipeline of the train (i.e. restart from create-environment), unclaim environment (i.e. environmental flakes)
4. Releasing with a known issue in the product. This would require documenting the issue if it impacts customers and a workaround if needed.
5. Not release the impacted product variant in the release train. A typical release train may release multiple versions and different products together so in this case, you'd eject the affected version from the release train.
6. Begin the train again; for instances when the issue is drastic enough to warrant a halt and restart.

This is sorted by the cheapest to more expensive both in time and effort.

Having a way to record incidents is a key tactic in your journey to creating a robust resilient release engineering system. Use this data in your postmortems by employing the improvement workflow proposed [here](https://release.engineer/2/the-release-engineer/29/improve-with-automation-by-repetition).

Any Release Engineering system is a complicated workflow system made up of lots of scripts, typically in a CI/CD tool. It can be confusing to determine what to improve at any point. 

To begin automating the system, the easiest low hanging fruit of a task can be a good option at times. At others, what is most painful, cumbersome, or logically tough might be best.

Most often, Release Engineering systems move towards automation gradually. It is illogical to halt the company from releasing their product to customers because a robust Release Engineering system with all bells and whistles isn't ready. And in the odd chance that you do have the budget and capacity, a product's shape and its characteristics need to be clear for the Release Engineering system to serve it properly. Focus on simple solutions first instead of clever ones as the tendency of over-engineering in this context is high.

Here's a high-level improvement workflow you may adapt:

1. When working on a release train, keep a journal and logbook and note down anything that feels painful, overly manual, and cognitively burdensome. These are the areas you want to tackle first.
2. When running a postmortem with the team, make a case for why this particular area needs attention. Focus on the why and pitching the problem first.
3. Before devising an automated script as a solution, figure out how the current workflow can be made better first. This may require that you test the manual improvement out in the subsequent train. This is really the heart of this improvement workflow, the ability to test-drive an improvement manually before coding it into the system.
4. Once the improvement has seen one or more repetitions, invest in automating it then.

This tactic allows for a process to evolve that is more resistant to changes over time. Coded workflows are more expensive to change and maintain. 

Oftentimes when building software, it's hard to test out the feature without actually building it. One has to get [creative to test business outcomes](https://www.strategyzer.com/books/testing-business-ideas-david-j-bland). Because the input and output are under the control of the team, a workflow improvement can usually be tested by hand before programming it. 

When Tanzu Application Service began its life, it was packaged, tested, and published to the customers by hand. This meant there was a lot of [toil](https://sre.google/sre-book/eliminating-toil/) in the process. The whole publishing workflow was manual until only a couple years back. But gradually, we've moved uploading, adding product to the distribution catalog, entering data in forms, and flipping the switch to public access all automated.

Engineers are creative folk. It's tempting to see a problem in front and not find and implement the solution right away. But when it comes to changing an aspect of the release trains, _don't change the process while a [Release Train](https://release.engineer/books/2/pages/25) is going_. 

Release Engineers should be substitutable—an engineer working on the train may take vacation or a sick leave. So it's important to always keep the next person taking the helm in mind. Improvising on the processes as the trains are in progress makes it hard for the process to seep through the personnel in the team. 

More importantly, Release Engineering a complex product means that you have to control the number of modifiable variables. Release Trains introduce changes to the product and if a testing script is also modified as the product is flowing through the pipelines, it becomes hard to identify where the problem lies leading to cognitive burden.

So each release train upon departure must conceptually freeze the platform it will run on and no changes should be allowed while it's running.

You should note the problems in the release train system. Make note of what part of it is inefficient in the journal or logbook, allowing you to pitch the problem in your Release Train postmortem. Another approach could be to hack on a branch but not push anything to the production platform until it's reviewed and tested.

This also includes the meta layer of the process, so consider sticking to the tools the team already uses. If the team doesn't typically use spreadsheets, don't do so. And if spreadsheets are the answer to manage the workflow better, propose and demo a use case and experiment in the upcoming train. Spontaneity in the process that releases a critical product out to customers is a red flag and increases the difficulty in jumping on to help out for your fellow teammates.

We found that the majority of our incidents occurred due to [test flakes](https://engineering.atspotify.com/2019/11/test-flakiness-methods-for-identifying-and-dealing-with-flaky-tests/). Test flakiness happens in most large software systems and complex integration test suites. 

For the longest time, we believed that it was important to feel the pain and resolve each flakiness by fixing the underlying issue. The nature of flaky test failures is hard to replicate and the amount of investment it typically takes to improve the test suite is high enough that we never got to it.

On top of that, as a Release Engineering team, understanding these test flakes requires deep understanding of the components interacting. This made it even tougher especially since our attention is typically around making the process of release engineering more robust.

After much back and forth and deliberation, we implemented a retry strategy that removed the need to manually pay attention to these incidents. We also made sure to record these flakes in a [datastore for us](https://www.honeycomb.io) to pay attention if needed. 

We saw enough productivity gains with this one change because more often than not, we were able to get through a release train over the weekend that would otherwise be waiting on someone to press restart on a pipeline.

> Simple can be harder than complex: You have to work hard to get your thinking clean to make it simple. But it's worth it in the end because once you get there, you can move mountains. – Steve Jobs

Oftentimes build and packaging systems tend to be a hodge-podge of scripts mangled together to spit out a version of a software that ends up getting distributed to customers. This increases friction in introducing changes while also making it hard to understand what is going on.

So it's important to consider releasing software as its own discipline and make conscious effort to strive for simplicity in its design. A clever approach may feel like an accomplishment, but keep asking why until you understand the outcomes and design simple pointed solutions (see [first principles](https://fs.blog/first-principles/)). 

The same arguments that apply in software engineering apply here (see [YAGNI](https://martinfowler.com/bliki/Yagni.html)). Release Engineering is an application of software engineering principles in a domain that aims to serve software out to customers. A poorly designed Release Engineering system contributes to delivering a poor quality software too.

Release Engineering is a craft and must be treated so. While it's orthogonal to the software it releases, it is influenced by it. Systems thinking allows one to take all such variables and forcing functions into account while solving problems in this domain. This is hard work. Designing for simplicity often requires complex hard thinking. 

We had been using Concourse for many years before we switched our release system to GitHub Actions for Tanzu Application Service. Our goal of moving over required us to take the path that was hard and the whole project took us over a year. 

Before we simplified, here was our state of the world:

- We had multiple repositories: one set had the product metadata while the other housed CI scripts. 
- Having multiple repositories led to duplication of information among the these repositories, making it hard for people to understand the purpose of each. At times, a metadata change had to be made to all of these repositories. 
- Our overall system already relied on GitHub for issues and pull requests, so using another system to run pipelines meant we had to keep an eye on Concourse as well as GitHub. On top of that, we would also keep track of stories in Pivotal Tracker.
- We kept building tools on top of these various abstractions to solve problems. 
- Debugging was a pain. It needed understanding of all of the various repositories, their interactions, and their implementation to understand where the problem could be.

We needed to rethink the system at a fundamental level even if it meant significant work. We ended in a place where we now have a single repository that houses the CI changes and all product metadata for different product variants. This enables our contributors to also submit changes to the product while updating our CI scripts and configuration as needed. While there are still quirks we continue to work on as with any major change, the difference is night and day from where we were. 

Simplicity requires hard thinking. It's easier to continue digging the rabbit hole you're in than to ask if it's the right hole to dig in the first place. To make a robust system, find paths to make your Release Engineering system simple to understand, contribute and debug.

Each release train should have a designated conductor, one who keeps track of the progress in the system.

The conductor's responsibility is to keep track and make notes as the progress is made. She should be noting down any discrepancies, creating incidents, noting problems in the workflow, and leading the retrospective. On top of that, this role should also involve in letting stakeholders know of important details and progress. 

One major part of this is keeping a journal of the progress in a centralized place. This helps others be in the know of what's going on and take charge if needed. Releasing a crucial software without enough context around in the team because the person is absent from work is a recipe for disaster.

So, keep a log of each change coming into the product, errors happening in the system, any problems occurring in the workflow, common failures, annoyances, and a high-level timeline. The team will begin to rely on this artifact to learn and this will make the Release Engineering system improve over time. Make this artifact accessible to everyone in the organization. This will reduce the times they ask you, "When is it going to get released?".

A typical release engineering system uses different data as inputs to invoke necessary workflows. End-of-support dates can, as an example, determine whether a pipeline for a particular product version is triggered and whether it is published publicly to all customers. 

Such type of information typically includes:

- Release dates for Beta, RC, or GA releases along with end-of-support dates.
- Configuration of the product for test setup.
- Environment infrastructure setup (in our case, Terraform files)
- List of changes in each release train.
- One set of steps to run the release train.
- Upgrade path information.

As the system grows in complexity, it's generally convenient to duplicate the information in question directly in scripts instead of implementing a way to refer to the central source of information. Taking the time to do so pays dividends.

Release Engineering systems become convoluted when each script consumes different copies of the same underlying data to perform its function. Making any change to that piece of information will then need to be propagated everywhere. Not doing so will lead to disaster. Your source of one type of information in any system should come from one central place only. 

We were duplicating the dependencies needed in our system in multiple places in the old version of our system. It was present in the metadata of the product but was also embedded directly in the CI build and test templates. This would lead to errors that were misleading because one had to just know to fix these.

Resolving this meant that we had to do the heavy lifting of extracting our product and grabbing the dependency information from there instead of it being passed as a parameter. Definitely not convenient, but it helped us eliminate the source of this error forever.

On top of that, having centrally located source of truth for all key data means that customers, contributors, and other stakeholders can also reliably refer to it.

Practices

A playbook captures all of the details that are needed to ship a product out to customers. Instead of relying on word of mouth for context sharing, playbooks allow team members to ship products by following documentation step-by-step. 

When you begin implementing a process to regularly release your product, you'll find out that there are certain tasks that are required to be performed in order. It's hard to figure out what needs doing unless it's written down. It's risky to rely on memory and built knowledge in a team as it will create single points of failure. Playbooks are the way to allow your team to be on the same page to lay down how to ship your product.

A playbook standardizes the process and helps execute the Release Engineering workflow. When the system isn't fully automated, playbooks allow the team to keep operating instructions of the current Release Engineering state of the system in a document. It should be easily accessible and understood by the team. Playbooks provide a space for any change needed in the system for it to be propagated to the team.

Playbooks increase efficiency by removing the need to remember how to perform a lengthy yet mundane task repeatedly. 

Here are a few ways to better manage and maintain playbooks:

- Divide them up into logical parts of the process. Our Tanzu Application Service playbooks are divided into several key workflows that are followed in order. 
- Only include operating instructions. We typically keep any context of the change separately in Decision Records.
- Ideally, they should be editable by the whole team and improved with each ongoing release train.
- Include common failure modes so it doesn't require asking other team members when things don't go in the happy path.

While writing and maintaining the playbooks may feel painstaking in the beginning, the dividends add up quickly as it brings robustness and reliability to shipping your products to customers. You'll also discover that increased investment in automation will begin shrinking the manual portions of the playbooks overtime.

Making changes to a product in a single repository is relatively simple. You've a series of commits that precede a deployment. While not trivial, it is easy to figure out what changes and upon error, easy to revert and make changes to fix the problem. If the team follows a [relatively standard format](https://www.conventionalcommits.org/en/v1.0.0/) to write their commits, it makes it obvious to know what parts of the code is likely the culprit. 

For a product that's contributed by multiple teams from different repositories and shipped out to multiple customers and deployed in various ways, it is risky to not have a standardized system to catalog all of the changes that are slated to go in any particular release. Even the simplest cases may require scrutinizing the code from multiple dependencies and repositories to find the source of the problem.  

Therefore, it is imperative to have a way for each contributing team to provide their changes along with documentation and a release note for the change. A single logical change to the product should be accompanied by its code changes, a release note, the nature of the change, products affected, and any other notes. For Tanzu Application Service, we also make a point to mark a breaking change separately.

This allows a Release Train to pick up a set of changes and nothing else in the release. Even the release team doesn't make changes to the product without the metadata required to catalog a product change. 

This introduces a rigor in achieving a level of traceability for an enterprise product that's necessary in this day and age. Integrated with the version control system like Git, this provides personnel info, date and time, and particular code commits that were included. On top of that, this also allows us to generate release notes for every new product version release. 

Another benefit is keeping a record of what was included in each release train for historical tracking and auditing purposes.

Agile teams typically conduct a retrospective at the end of a sprint or projects to determine what went well and what didn't to learn and work on improvements where needed. Similarly, each release train should be proceeded by a post-mortem.

A release train retrospective meeting could include: 

- Metadata on the release train: Number of days taken, number of [incidents by incident types](https://release.engineer/books/2/pages/28), list of versions shipped.
- Timeline of the release train with major events: Integration, build, testing, and publishing.
- Any major incidents that occurred. It's important to take the time to explore what problems stopped the train to progress smoothly.
- Any workflow gotchas discovered. Make a point to include what felt heavily manual.

This is a great opportunity to employ and learn facilitation skills and the conductor is a good person to lead this meeting. The team should ask questions and have an empathetic perspective towards understanding the problems encountered during the release train. 

Once the problems are well understood, attempt to come up with solutions but this part can be left to explore after the meeting as well as each major problem identified can have multiple solutions. Start with [implementing a simple solution](https://release.engineer/books/2/pages/29) that acts as the prototype before automating it fully.

If particular incidents occurred require other teams to be included in the meeting, hold incident postmortems with others. Depending on its severity, these opportunities allow the teams to learn from each other and clarify any contractual boundaries that should be present among them.

A complex release engineering system is built from decisions that have consequential effects in and out of the system. Most of these decisions feel logical while others feel counterintuitive. 

It's foolish to rely on human memory as a way to make sure these decisions are transmitted across the team, notwithstanding that teams change over time as well. Learning from a senior engineer is appropriate for skill development, but isn't a good substitute for gaining context around decisions taken in the architecture and design of the Release Engineering system.

And over time, as teams make updates to the process and workflow, it's important to preserve the knowledge for future or absent engineers. It also provides a snapshot of where the philosophy of the architecture stands at any given point.

Decision records help with this problem. A bit of bookkeeping upfront yields tangible benefits. The team records any important decision made in the team. This allows the team to remember why a decision was made and keep the context in mind when reversing it. 

It also establishes a lightweight way to discuss a matter in a manner that's written, similar to how RFCs have been adopted and used in many open source projects. While RFCs are proposals, decision records can be proposed, or recorded as a result of a decision in a meeting. 

There are various different styles and templates to use. You'll want to adapt it to your team's needs but anything as extensive as [RFCs](https://www.ietf.org/standards/rfcs/), [Python Enhancement Proposals](https://peps.python.org/pep-0001/) to a lightweight approach such as [Architectural Decision Records (ADRs)](https://adr.github.io) could work. Begin with a lightweight approach and only extend to include more metadata as needed.

Supply Chain Management encompasses the flow of physical goods, activities, processes and systems with the goal of delivering finished products to consumers. It is generally employed as a practice in large scale organizations to support and provide assembled products to customers in an efficient and reliable manner.

As with most software paradigms, supply chain management in software can be adapted to encompass the supply chain of software dependencies from its lowest abstraction to highest, like OS or Docker images. 

In a way, a Release Engineering system can be seen as an implementation of a Supply Chain system for a software product. 

Any enterprise product needs to make sure that the product is secure. Bulk of the security vulnerabilities are exposed after the product is shipped and therefore, it's important to design the Release Engineering system with traceability and verifiability of its components in mind.

When a bug or a security vulnerability is discovered, responsible teams should be able to determine which product versions are affected. 

We found that tying Git SHAs and dependency versions to our product versions in the form of a Release Train label to be immensely useful in understanding when a particular dependency bump was introduced. For Tanzu Application Service, we label our trains RT-yyyy-xxx. All GitHub issues, Release Train branches, and the final released versions are tied to this Release Train label.

And since we keep a record of changes introduced into every release train, each Git SHA points to a logical change by the team, allowing us to track down and plan a remedy. These links are bi-directional through Git and we can delineate a release train from a Git SHA and vice versa.

A Release Train is a set of multiple changes slated for release in the next version. 

Testing and working with multiple changes in a single release train means that it is hard to debug and fix the problems that occur because system tests are hard at providing a trace to the exact code line for the failure. Unit tests on the other hand are typically adept at exacting the location of the failure, but are less suitable for Release Engineering systems where system integrity and functionality is the top priority.

As a Release Engineer, when you don't have confidence in each independent change-set getting integrated into the release train, the likelihood of failure increases. And it becomes extremely difficult to debug and determine which changeset is the culprit as a result.

You cannot completely eliminate the risk profile of a set of changes but to dramatically lower it, test each change-set independently against a subset or all of the system tests. 

There is still a chance the combination of the two change-sets in the train may fail the pipelines, but the decrease in probability is worth the cost of testing these changes independently beforehand.

For TAS, we've built a GitHub Actions based Pull Request testing system that uses the same testing infrastructure as our release engineering system to create a receipt of record that the change-set getting integrated is tested and ready to go. This way, the Release Engineer has confidence that the changes work independently and are functionally working.

As a Release Engineer, one of the most time consuming tasks is to make sure that updated dependencies are shipped to customers. It's a crucial part of the release process as it helps customers get access to recent and up-to-date security patches. Important bug fixes and improvements are also shipped in dependencies that can be leveraged by customers in the product.

Typically, a workflow for bumping dependencies can be quite manual. This can vary from bumping each version manually to running commands repeatedly that bump the dependencies using a package manager. 

Regardless of the level of convenience, working at this abstraction from the Release Engineering perspective feels like [toil](https://sre.google/sre-book/eliminating-toil/). 

So it's important to design tooling and pipelines to automate this process. As part of the build process, the system should technically be able to bring in updated dependencies automatically and incorporate those in the next release train. 

For TAS, we have a fairly complex dependency system that spans across the underlying operating system, libraries, Docker images, and the product code itself. Because the surface area of these is so large and can have huge impact to the product if gone wrong, we implemented a Dependabot plugin that allows each meaningful dependency bump to go through our pull request system so these changes are also [tested independently](https://release.engineer/toc/release.engineer/test-incoming-changes/) and incorporated into the product only if they are functionally and operationally sound.

Before that, the TAS Release Engineering team used to bump dependencies manually in [a Concourse pipeline config](https://concourse-ci.org/pipelines.html). Not only was this error prone, but it always required the Release Engineers to be on top and involved in the bumping process. With the Dependabot based system, the component engineers can set their own version criteria for inclusion into the product and the system handles the rest gracefully.

Case Studies

[Tanzu Application Service (TAS)](https://tanzu.vmware.com/application-service) is VMware's [Cloud Foundry](https://www.cloudfoundry.org) distribution that enables developers to deploy their applications without worrying about installing dependencies or maintaining them. The developers code their applications and let the platform handle the rest. 

TAS provides companies an escape from large IT infrastructure teams and software and enables smaller platform teams to handle an exponentially large workloads. Developers, on the other hand, worry about their application and what it needs and use the platform to manage and connect its dependencies. 

TAS is used by governments, banks, major wireless carriers, and other critical industries where uptime and redundancy is crucial. 

TAS is available as a roughly 10 GB download and uses [Tanzu Operations Manager](https://docs.pivotal.io/ops-manager/3-0/index.html) (powered by [BOSH](https://bosh.io/docs/)) to deploy and set up the runtime, logging, and other components necessary to deliver a feature rich platform for enterprise use.

This case study goes over how each build comes into existence, goes through rigorous testing and out to customers. The customers depend on the robust release engineering system as much as the platform and its components to deliver a system they are confident in rolling out to production.

A regular release train consists of all supported versions across all product variants. We have TAS, Small Footprint TAS, Isolation Segment (to further scale runtime capabilities), and Windows Runtime products. They all stem from the same set of components but are different products in practice to the customers. 

Each [Release Train](https://release.engineer/2/the-release-engineer/25/the-release-train) includes each supported version of these products. On average 20 to 25 final builds are produced with a single Release Train run.

## 1. Development

A typical changeset (feature or a bug), begins its journey in our product or support teams learning and discussing it with our customers. The related component team is also involved and begins working on their changes. Note that each component team is self-sufficient in running the integration tests against TAS before submitting those over to the Release team. The feedback loop otherwise would be horrendously long. 

## 2. Submission

We [catalog changes](https://release.engineer/2/the-release-engineer/36/catalog-changes) using GitHub issues. A GitHub issue includes all metadata relevant to the change and whether it's customer facing or not. It also includes a release note that comes into picture later. Importantly, each GitHub issue has multiple pull requests for each product version.

## 3. Feedback

We provide [independent feedback](https://release.engineer/2/the-release-engineer/40/test-incoming-changes) to each pull request by running integration tests for that specific change. This produces a temporary but functionally equivalent build that allows the teams to debug and test their changes without blocking the main branch. This also includes any subjective feedback on the design, breaking changes, and compatibility problems between release and development teams.

## 4. Integration

Once the change is tested, the release team merges the change into the relevant product branch. The release team continues between providing feedback and integrating continuously. We integrate the changes to the upcoming release train.

## 5. Testing

Every other Friday, all integrated set of changes are fanned out into a testing matrix that includes testing TAS on IaaSes such as GCP, AWS, Azure, and vSphere platforms. Along with that, we test upgrades from the last patch and last minor versions of the product. We also have tests that check on backwards compatibility by making sure no unintended breaking changes are introduced. 

Usually, this phase includes [removal of changes](https://release.engineer/2/the-release-engineer/27/debug-to-revert-not-to-fix) that are failing the test suites with relevant feedback to the contributing team. We will record these as incidents and maintain a journal.

## 6. Publish

Once the test suite is green, the release conductor will initiate the pipeline to publish the artifacts to the Tanzu Network (our App Store). Along with that, release notes are automatically compiled from GitHub issues and dependency diffs and are published to the [Tanzu Docs](https://docs.pivotal.io/application-service/2-11/release-notes/runtime-rn.html) website.