The Release Engineer Ship Software That Works Every Time Rizwan Reza

  • Move Introduction
    Open Introduction

    Introduction

    Introduction
  • Move What is Release Engineering?
    Open What is Release Engineering?

    Release Engineering, sometimes shortened as RelEng, focuses on how software gets packaged, shipped, and distributed to the end-users. This discipline is more prevalent where complex software is shipped as a standalone product. In contrast to this, disciplines and concepts like Site Reliability Engineering and Continuous Delivery deal with live hosted applications. If there is an artifact that's distributed to users instead of a maintained service, Release Engineering is most definitely at play.

    For businesses, Release Engineering solves a scaling problem. As companies grow from startups into enterprises, the number of engineers who write code for a given software increases rapidly. This creates a whole slew of coordination and compatibility problems between different parts of the system that results in the reduction of productivity and innovation.

    • Team A’s component worked yesterday but now that Team
    What is Release Engineering? 310 words
  • Move What problem does it solve?
    Open What problem does it solve?

    Before we begin discussing the mechanics and components of Release Engineering, let’s set the context of its relevance and need.

    Release Engineering concerns itself with software that’s digitally or physically shipped to the customers. Given the dominance of cloud technologies, we’ll solely focus on the former—a downloadable artifact.

    As discussed in the previous post, Release Engineering solves scaling problems as companies grow from startups to enterprises.

    But that only scratches the surface. As we dig deeper, enterprise companies are made up of multiple players each with their own motivations and needs.

    Solves Growing Pains

    As software businesses grow and hire more engineers, their release process becomes fragile and broken. It becomes hard to coordinate what gets released in which version. This results in the release process being slow. Companies typically use cultural norms like freezing code, performing QA, and maintenance in an attempt to maintain control. It is common to bake in as mu

    What problem does it solve? 841 words
  • Move Industrializing Release Process
    Open Industrializing Release Process

    Release Engineering is needed in circumstances where you cannot point the product as a whole to a single owning team anymore. This is a scaling problem, as noted before. Release Engineering focuses on the process so the product isn’t impacted by the lack of ownership on building, compiling, testing, and shipping it to customers.

    As teams scale and own disparate chunks of a product, the Release Engineering team emerges as gatekeepers and becomes a natural integration point to make sure a quality product goes out to customers.

    It’s worth noting that Micro Service Architecture for web applications can skip Release Engineering by defining versioned interfaces and contracts between the components. This removes the focal point of integration as each team can directly deliver their solution to customers independently. This is a huge reason why organizations are attracted to Micro Service Architecture.

    This organizational and architectural design breaks down in our context of shipping a binary to customers

    Industrializing Release Process 273 words
  • Move Site Reliability Engineering and Release Engineering
    Open Site Reliability Engineering and Release Engineering

    Site Reliability Engineering focuses on running services reliably. It emerged from Google and has garnered attention rightfully as a fundamental discipline in operating and designing systems to run complex software machinery reliably. It provides tactics and embraces software engineering philosophy in a world traditionally looked upon as system administration. It’s transformed the world of DevOps by bringing software engineering into system operations, understanding risk and creating service metrics.

    Site Reliability Engineering differs from Release Engineering heavily in that the former applies to infrastructure and operations of the software while the latter’s goal is to ship software in a reliable and repeatable process. While some principles carry over across these sub-disciplines, they solve fundamentally different problems.

    While topics such as toil and automation suit perfectly, monitoring and establishing service level objectives with end users and customers don’t find their home as neatly here.

    Site Reliability Engineering and Release Engineering 320 words
  • Move Release Engineering is not QA
    Open Release Engineering is not QA

    A QA role in a typical tech company today comprises of finding and fixing bugs, typically by writing a set of test suites aimed at finding defects and mistakes in the system. With each release, they may aim at a specific subsystem or feature set and focus on finding defects. The outcome of which results in fewer defects, surprises, and a better level of quality for the customer.

    While Release Engineering aims to maintain a base level of quality, it is devoid of a detective finding defects in this way. Release Engineering must not guard against such intricate edge cases and obscure flows of the system. Doing so requires substantial investment in writing test suites that embed neatly into the RelEng processes. Moreover, it also increases the time it takes to engineer builds through the process.

    QA engineers define and maintain a level of quality customers expect from the system. The classification and choice of which surface area to tackle within the product is a choice made in context of what’s being shi

    Release Engineering is not QA 383 words
  • Move Release Management and Release Engineering
    Open Release Management and Release Engineering

    While Release Engineering concerns with designing and automating the process of releasing software, Release Management deals with what makes up the release itself. Release Management assesses the importance of features and bug fixes going in from a business outcomes perspective.

    While Project Managers scope out the features to be designed & developed in a product based on the customer need and value to the business, Release Managers scope out which features are shipped in releases based on the quality and readiness of its capabilities. They keep stakeholders informed of changing priorities from release to release. Release Management adapts Project Management skills along with knowledge about the specific release cadence and machinery built to ship product releases in a reliable manner.

    Release Managers also assess risk and complexity of a particular release. They provide updates to the customers, development teams, and stakeholders and act as the human contact and assign tasks to the team members in th

    Release Management and Release Engineering 348 words
  • Move Continuous Delivery and Release Engineering
    Open Continuous Delivery and Release Engineering

    Continuous Delivery: Reducing the cost, time, and risk of delivering incremental changes to users.

    Continuous Delivery develops a team's capability to have the code in an always deployable state. It mandates engineers to merge their code daily, never breaking the mainline, and delivering features incrementally. This is typically done by developing pipelines that build, test, and deploy the code in production.

    It draws inspiration from Lean Manufacturing that urges investment in machinery to cut costs over time. Soon enough, the team establishes discipline and improves the delivery process iteratively that deploying the product becomes boring.

    This is not possible (at least, not as fast) when there are tens of teams contributing to a product. This can be a matter of scale or the nature of the software but often both are true. The breed of software RelEng practices apply to is inherently different from applications that are able to establish the practice of Continuous Delivery. Operating systems and

    Continuous Delivery and Release Engineering 418 words
  • Move Process
    Process
  • Move The Release Train
    Open The Release Train

    Release Engineering develops the capability in an organization to deliver new changes in the product to customer in a confident repeatable manner.

    The Release Engineering team develops this process in a manner where subsequent releases can go out out confidently.

    What is a Release Train?

    A Release Train is a release engineering pattern that helps define how a single release process can be instantiated and followed through.

    A Release Train takes the changes in the product through different stages including building, testing, compiling, and finally publishing.

    Each team and organization would adapt this concept according to their needs and the nature of the software in question. But here are a few characteristics generally common among Release Teams:

    • A Release Train is atomic. Once a train has started, you do not want to bring in any further changes.

    • Only one Release Trains runs at a time. More than one concurrent Release Train results in confusion.

    • A Release Train doesn't stop f

    The Release Train 1,030 words
  • Move Principles
    Principles
  • Move Debug to Revert; Not to Fix
    Open Debug to Revert; Not to Fix

    At any large company, a Release Engineering team receives a large amount of changes to incorporate into the product, test, and ship to customers. A release train therefore typically contains various changes for release.

    As it often happens, one of those changes fails and stops the release train. A typical response as an engineer is to then debug, figure out which change brought about this error, and fix the underlying issue. Doing so is utterly wrong.

    It's akin to stopping a train full of passengers because one of them got sick and waiting for the paramedics to arrive, heal them up fully before continuing the journey. The common sense response to this is to offload the passenger on the next platform, have the paramedics take it from there while the train leaves for its intended destination. The passenger can hop back on in the next train.

    Stopping the release process for one change is a waste of time. The right approach is to debug the combination of changes just enough to figure out the culprit and

    Debug to Revert; Not to Fix 526 words
  • Move Every Stoppage is an Incident
    Open Every Stoppage is an Incident

    "What's not measured is not improved" – Peter Drucker

    Any stoppage in a release train is a hurdle in delivering the product to the customer. It's a learning opportunity for the team to understand the problem and improve the system to be more resilient in the future.

    By cataloging the incident data in a consistent, uniform, and readable manner, you can gradually move towards a release train system where errors are exception to the norm.

    Start by capturing incident data by hand if needed, but it's convenient to automatically capture and catalog any hiccup happening in the system.

    Once you have some data, a major way you can begin improving your release engineering system is to start working your way through reducing the most frequent incident.

    Define Incident

    It's important to define what an incident is for your context. We were conscious to not call it error, bug, issue, or failure. They all have other meanings or contextual connotations in software engineering that would muddy the conv

    Every Stoppage is an Incident 460 words
  • Move Improve with Automation by Repetition
    Open Improve with Automation by Repetition

    Any Release Engineering system is a complicated workflow system made up of lots of scripts, typically in a CI/CD tool. It can be confusing to determine what to improve at any point.

    To begin automating the system, the easiest low hanging fruit of a task can be a good option at times. At others, what is most painful, cumbersome, or logically tough might be best.

    Most often, Release Engineering systems move towards automation gradually. It is illogical to halt the company from releasing their product to customers because a robust Release Engineering system with all bells and whistles isn't ready. And in the odd chance that you do have the budget and capacity, a product's shape and its characteristics need to be clear for the Release Engineering system to serve it properly. Focus on simple solutions first instead of clever ones as the tendency of over-engineering in this context is high.

    Here's a high-level improvement workflow you may adapt:

    1. When working on a release train, keep a journal and lo
    Improve with Automation by Repetition 428 words
  • Move No Process Changes During a Release Train
    Open No Process Changes During a Release Train

    Engineers are creative folk. It's tempting to see a problem in front and not find and implement the solution right away. But when it comes to changing an aspect of the release trains, don't change the process while a Release Train is going.

    Release Engineers should be substitutable—an engineer working on the train may take vacation or a sick leave. So it's important to always keep the next person taking the helm in mind. Improvising on the processes as the trains are in progress makes it hard for the process to seep through the personnel in the team.

    More importantly, Release Engineering a complex product means that you have to control the number of modifiable variables. Release Trains introduce changes to the product and if a testing script is also modified as the product is flowing through the pipelines, it becomes hard to identify where the problem lies leading to cognitive burden.

    So each release train upon departure must conceptually freeze the pl

    No Process Changes During a Release Train 315 words
  • Move Implement Auto-Retries
    Open Implement Auto-Retries

    We found that the majority of our incidents occurred due to test flakes. Test flakiness happens in most large software systems and complex integration test suites.

    For the longest time, we believed that it was important to feel the pain and resolve each flakiness by fixing the underlying issue. The nature of flaky test failures is hard to replicate and the amount of investment it typically takes to improve the test suite is high enough that we never got to it.

    On top of that, as a Release Engineering team, understanding these test flakes requires deep understanding of the components interacting. This made it even tougher especially since our attention is typically around making the process of release engineering more robust.

    After much back and forth and deliberation, we implemented a retry strategy that removed the need to manually pay attention to these incidents. We also made sure to

    Implement Auto-Retries 201 words
  • Move Simplicity is the Key
    Open Simplicity is the Key

    Simple can be harder than complex: You have to work hard to get your thinking clean to make it simple. But it's worth it in the end because once you get there, you can move mountains. – Steve Jobs

    Oftentimes build and packaging systems tend to be a hodge-podge of scripts mangled together to spit out a version of a software that ends up getting distributed to customers. This increases friction in introducing changes while also making it hard to understand what is going on.

    So it's important to consider releasing software as its own discipline and make conscious effort to strive for simplicity in its design. A clever approach may feel like an accomplishment, but keep asking why until you understand the outcomes and design simple pointed solutions (see first principles).

    The same arguments that apply in software engineering apply here (see YAGNI). Release Engineering is an application of software engineering principles

    Simplicity is the Key 563 words
  • Move Leave a Paper Trail
    Open Leave a Paper Trail

    Each release train should have a designated conductor, one who keeps track of the progress in the system.

    The conductor's responsibility is to keep track and make notes as the progress is made. She should be noting down any discrepancies, creating incidents, noting problems in the workflow, and leading the retrospective. On top of that, this role should also involve in letting stakeholders know of important details and progress.

    One major part of this is keeping a journal of the progress in a centralized place. This helps others be in the know of what's going on and take charge if needed. Releasing a crucial software without enough context around in the team because the person is absent from work is a recipe for disaster.

    So, keep a log of each change coming into the product, errors happening in the system, any problems occurring in the workflow, common failures, annoyances, and a high-level timeline. The team will begin to rely on this artifact to learn and this will make the Release Engineering sy

    Leave a Paper Trail 199 words
  • Move Single Source of Truth
    Open Single Source of Truth

    A typical release engineering system uses different data as inputs to invoke necessary workflows. End-of-support dates can, as an example, determine whether a pipeline for a particular product version is triggered and whether it is published publicly to all customers.

    Such type of information typically includes:

    • Release dates for Beta, RC, or GA releases along with end-of-support dates.
    • Configuration of the product for test setup.
    • Environment infrastructure setup (in our case, Terraform files)
    • List of changes in each release train.
    • One set of steps to run the release train.
    • Upgrade path information.

    As the system grows in complexity, it's generally convenient to duplicate the information in question directly in scripts instead of implementing a way to refer to the central source of information. Taking the time to do so pays dividends.

    Release Engineering systems become convoluted when each script consumes different copies of the same underlying data to perform its function. Makin

    Single Source of Truth 324 words
  • Move Practices
    Practices
  • Move Write Playbooks
    Open Write Playbooks

    A playbook captures all of the details that are needed to ship a product out to customers. Instead of relying on word of mouth for context sharing, playbooks allow team members to ship products by following documentation step-by-step.

    When you begin implementing a process to regularly release your product, you'll find out that there are certain tasks that are required to be performed in order. It's hard to figure out what needs doing unless it's written down. It's risky to rely on memory and built knowledge in a team as it will create single points of failure. Playbooks are the way to allow your team to be on the same page to lay down how to ship your product.

    A playbook standardizes the process and helps execute the Release Engineering workflow. When the system isn't fully automated, playbooks allow the team to keep operating instructions of the current Release Engineering state of the system in a document. It should be easily accessible and understood by the team. Playbooks provide a space for any ch

    Write Playbooks 343 words
  • Move Catalog Changes
    Open Catalog Changes

    Making changes to a product in a single repository is relatively simple. You've a series of commits that precede a deployment. While not trivial, it is easy to figure out what changes and upon error, easy to revert and make changes to fix the problem. If the team follows a relatively standard format to write their commits, it makes it obvious to know what parts of the code is likely the culprit.

    For a product that's contributed by multiple teams from different repositories and shipped out to multiple customers and deployed in various ways, it is risky to not have a standardized system to catalog all of the changes that are slated to go in any particular release. Even the simplest cases may require scrutinizing the code from multiple dependencies and repositories to find the source of the problem.  

    Therefore, it is imperative to have a way for each contributing team to provide their changes along with documentation and a release note for the change. A s

    Catalog Changes 327 words
  • Move Do Postmortems
    Open Do Postmortems

    Agile teams typically conduct a retrospective at the end of a sprint or projects to determine what went well and what didn't to learn and work on improvements where needed. Similarly, each release train should be proceeded by a post-mortem.

    A release train retrospective meeting could include:

    • Metadata on the release train: Number of days taken, number of incidents by incident types, list of versions shipped.
    • Timeline of the release train with major events: Integration, build, testing, and publishing.
    • Any major incidents that occurred. It's important to take the time to explore what problems stopped the train to progress smoothly.
    • Any workflow gotchas discovered. Make a point to include what felt heavily manual.

    This is a great opportunity to employ and learn facilitation skills and the conductor is a good person to lead this meeting. The team should ask questions and have an empathetic perspective towards understanding the problems encountered

    Do Postmortems 250 words
  • Move Maintain Decision Records
    Open Maintain Decision Records

    A complex release engineering system is built from decisions that have consequential effects in and out of the system. Most of these decisions feel logical while others feel counterintuitive.

    It's foolish to rely on human memory as a way to make sure these decisions are transmitted across the team, notwithstanding that teams change over time as well. Learning from a senior engineer is appropriate for skill development, but isn't a good substitute for gaining context around decisions taken in the architecture and design of the Release Engineering system.

    And over time, as teams make updates to the process and workflow, it's important to preserve the knowledge for future or absent engineers. It also provides a snapshot of where the philosophy of the architecture stands at any given point.

    Decision records help with this problem. A bit of bookkeeping upfront yields tangible benefits. The team records any important decision made in the team. This allows the team to remember why a decision was made and k

    Maintain Decision Records 274 words
  • Move Incorporate Supply Chain Management
    Open Incorporate Supply Chain Management

    Supply Chain Management encompasses the flow of physical goods, activities, processes and systems with the goal of delivering finished products to consumers. It is generally employed as a practice in large scale organizations to support and provide assembled products to customers in an efficient and reliable manner.

    As with most software paradigms, supply chain management in software can be adapted to encompass the supply chain of software dependencies from its lowest abstraction to highest, like OS or Docker images.

    In a way, a Release Engineering system can be seen as an implementation of a Supply Chain system for a software product.

    Any enterprise product needs to make sure that the product is secure. Bulk of the security vulnerabilities are exposed after the product is shipped and therefore, it's important to design the Release Engineering system with traceability and verifiability of its components in mind.

    When a bug or a security vulnerability is discovered, responsible teams should be ab

    Incorporate Supply Chain Management 279 words
  • Move Test Incoming Changes
    Open Test Incoming Changes

    A Release Train is a set of multiple changes slated for release in the next version.

    Testing and working with multiple changes in a single release train means that it is hard to debug and fix the problems that occur because system tests are hard at providing a trace to the exact code line for the failure. Unit tests on the other hand are typically adept at exacting the location of the failure, but are less suitable for Release Engineering systems where system integrity and functionality is the top priority.

    As a Release Engineer, when you don't have confidence in each independent change-set getting integrated into the release train, the likelihood of failure increases. And it becomes extremely difficult to debug and determine which changeset is the culprit as a result.

    You cannot completely eliminate the risk profile of a set of changes but to dramatically lower it, test each change-set independently against a subset or all of the system tests.

    There is still a chance the combination of the two

    Test Incoming Changes 250 words
  • Move Build Dependencies Automatically
    Open Build Dependencies Automatically

    As a Release Engineer, one of the most time consuming tasks is to make sure that updated dependencies are shipped to customers. It's a crucial part of the release process as it helps customers get access to recent and up-to-date security patches. Important bug fixes and improvements are also shipped in dependencies that can be leveraged by customers in the product.

    Typically, a workflow for bumping dependencies can be quite manual. This can vary from bumping each version manually to running commands repeatedly that bump the dependencies using a package manager.

    Regardless of the level of convenience, working at this abstraction from the Release Engineering perspective feels like toil.

    So it's important to design tooling and pipelines to automate this process. As part of the build process, the system should technically be able to bring in updated dependencies automatically and incorporate those in the next release train.

    For TAS, we have a fairly

    Build Dependencies Automatically 295 words
  • Move Case Studies
    Open Case Studies

    Case Studies

    Case Studies
  • Move Tanzu Application Service
    Open Tanzu Application Service

    Tanzu Application Service (TAS) is VMware's Cloud Foundry distribution that enables developers to deploy their applications without worrying about installing dependencies or maintaining them. The developers code their applications and let the platform handle the rest.

    TAS provides companies an escape from large IT infrastructure teams and software and enables smaller platform teams to handle an exponentially large workloads. Developers, on the other hand, worry about their application and what it needs and use the platform to manage and connect its dependencies.

    TAS is used by governments, banks, major wireless carriers, and other critical industries where uptime and redundancy is crucial.

    TAS is available as a roughly 10 GB download and uses Tanzu Operations Manager (powered by BOSH) to deploy and set up the runtime, logging, and other comp

    Tanzu Application Service 623 words