Continuous Deployment

AUDIENCE

Programmers, Operations

Our latest code is in production.

If you use continuous integration, your team has removed most of the risk of releasing. Done correctly, continuous integration means your team is ready to release at any time. You’ve tested your code and exercised your deployment scripts.

One source of risk remains. If you don’t deploy your software to real production servers, it’s possible that your software won’t actually work in production. Differences in environment, traffic, and usage can all result in failures, even in the most carefully tested software.

Continuous deployment resolves this risk. It follows the same principle as continuous integration: by deploying small pieces frequently, you reduce the risk that a big change will cause problems, and you make it easy to find and fix problems when they do occur.

Although continuous deployment is a valuable practice for fluent Delivering teams, it’s optional. If your team is still developing their fluency, focus on the other practices first. Full adoption of continuous integration, including automated deployments to a test environment (which some people call “continuous delivery”), will give you nearly as much benefit.

How to Use Continuous Deployment

Continuous deployment isn’t hard, but it has a lot of preconditions:

Create a zero-friction, zero-downtime deploy script that automatically deploys your code.
Use continuous integration to keep your code ready to release.
Improve quality to the point that your software can be deployed without manual testing.
Use feature flags or keystones to decouple deployments from releases.
Establish monitoring to alert your team of deployment failures.

Once these preconditions are met, enabling continuous deployment is just a matter of running deploy in your continuous integration script.

The details of your deploy script will depend on your organization. Your team should include people with operations skills who understand what’s required. If not, ask your operations department for help. If you’re on your own, Continuous Delivery [Humble2010] and The DevOps Handbook [Kim2016] are both useful resources.

Your deploy script must be 100 percent automated. You’ll be deploying every time you integrate, which will be multiple times per day, and could even be multiple times per hour. Manual steps introduce delays and errors.

Detecting Deployment Failures

Your monitoring system should alert you if a deployment fails. At a minimum, this involves monitoring for an increase in errors or a decrease in performance, but you can also look at business metrics such as user sign-up rates. Be sure to program your deploy script to detect errors, too, such as network failures during deploy. When the deploy is complete, have your deploy script tag the deployed commit with “success” or “failure.”

To reduce the impact of failure, you can deploy to a subset of servers, called canary servers, and automatically compare metrics from the old deploy to the new deploy. If they’re substantially different, raise an alert and stop the deploy. For systems with a lot of production servers, you can also have multiple waves of canary servers. For example, you could start by deploying to 10% of servers, then 50%, and finally all.

Resolving Deployment Failures

One of the advantages of continuous deployment is that it reduces the risk of deployment. Because each deploy represents only a few hours of work, they tend to be low impact. If something does go wrong, the change can be reverted without affecting the rest of the system.

When a deployment does go wrong, immediately “stop the line” and focus the entire team on fixing the issue. Typically, this will involve rolling back the deploy.

Roll back the deploy

Start by restoring the system to its previous working state. This typically involves a rollback, which restores the previous deploy’s code and configuration. To do so, you can keep each deploy in a version control system, or you can just keep a copy of the most recent deploy.

One of the simplest ways to enable rollback is to use blue/green deployment. To do so, create two copies of your production environment, arbitrarily labeled “blue” and “green,” and configure your system to route traffic to one of the two environments. Each deploy toggles back and forth between the two environments, allowing you to roll back by routing traffic to the previous environment.

For example, if “blue” is active, deploy to “green.” When the deploy is complete, stop routing traffic to “blue” and route it to “green” instead. If the deploy fails, rolling back is a simple matter of routing traffic back to “blue.”

Occasionally, the rollback will fail. This may indicate a data corruption issue or a configuration problem. Either way, it’s all hands on deck until the problem is solved. Site Reliability Engineering [Beyer2016] has practical guidance about how to respond to such incidents in chapters 12–14.

Fix the deploy

Rolling back the bad deploy will usually solve the immediate production problem, but your team isn’t done yet. You need to fix the underlying issue. The first step is to get your integration branch back into a known-good state. You’re not trying to fix the issue, yet, you’re just trying to get your code and production environment back into sync.

Start by reverting the changes in the code repository, so your integration branch matches what’s actually in production. If you use merge commits in git, you can just run git revert on the integration commit. Then use your normal continuous integration process to integrate and deploy the reverted code.

Deploying the reverted code should proceed without incident because you’re deploying the same code that’s already running. It’s important to do so anyway, because it ensures your next deploy starts from a known-good state. Also, if this second deploy also has problems, it narrows the issue down to a deployment problem, not a problem in your code.

Once you’re back in a known-good state, you can fix the underlying mistake. Create tasks for debugging the problem—usually, the people who deployed it will fix it—and everybody can go back to working normally. After it’s been resolved, schedule an incident analysis session to learn how to prevent this sort of deployment failure from happening in the future.

Alternative: Fix forward

Some teams, rather than rolling back, fix forward. They make a quick fix—possibly by running git revert—and deploy again. The advantage of this approach is that you fix problems using your normal deployment script. Rollback scripts can go out of date, causing them to fail just when you need them the most.

On the other hand, deploy scripts tend to be slow, even if you have an option to disable testing (which isn’t necessarily a good idea). A well-executed rollback script can complete in a few seconds. Fixing forward can take a few minutes. During an outage, those seconds count. For this reason, I tend to prefer rolling back, despite the disadvantages.

Incremental Releases

For large or risky changes, run the code in production before you reveal it to users. This is similar to a feature flag, except that you’ll actually exercise the new code. (Feature flags typically prevent the hidden code from running at all.) For additional safety, you can release the feature gradually, enabling a subset of users at a time.

The DevOps Handbook [Kim2016] calls this a dark launch. Chapter 12 has an example of Facebook using this approach to release Facebook Chat. The chat code was loaded onto clients and programmed to send invisible test messages to the backend service, allowing Facebook to load-test the code before rolling it out to customers.

Data Migration

Database changes can’t be rolled back—at least, not without risking data loss—so data migration requires special care. It’s similar to performing an incremental release: first you deploy, then you migrate. There are three steps:

Deploy code that understands both the new and old schema. Deploy the data migration code at the same time.
After the deploy is successful, run the data migration code. It can be started manually, or automatically as part of your deploy script.
When the migration is complete, manually remove the code that understands old schema, then deploy again.

Separating data migration from deployment allows each deploy to fail, and be rolled back, without losing any data. The migration occurs only after the new code has proven to be stable in production. It’s slightly more complicated than migrating data during deployment, but it’s safer, and it allows you to deploy with zero downtime.

Migrations involving large amount of data require special care, because the production system needs to remain available while the data is migrating. For these sorts of migrations, write your migration code to work incrementally—possibly with a rate limiter, for performance reasons—and have it use both schema simultaneously. For example, if you’re moving data from one table to another, your code might look at both tables when reading and updating data, but only insert data into the new table.

After the migration is complete, be sure to keep your code clean by removing the outdated code. If the migration needs more than a few minutes, add a reminder to your team’s task plan. For very long migrations, you can add a reminder to your team calendar or schedule a “finish data migration” story into your team’s visual plan.

This three-step migration process applies to any change to external state. In addition to databases, it also includes configuration settings, infrastructure changes, and third-party service changes. Be very careful when external state is involved, because errors are difficult to undo. Smaller, more frequent changes are typically better than big, infrequent changes.

Prerequisites

To use continuous deployment, your team needs a rigorous approach to continuous integration. You need to integrate multiple times per day and create a known-good, deploy-ready build each time. “Deploy-ready,” in this case, means unfinished features are hidden from users and your code doesn’t need manual testing. Finally, your deploy process needs to be completely automated, and you need a way of automatically detecting deployment failures.

Continuous deployment makes sense only when deployments are invisible to users. Practically speaking, that typically means backend systems and web-based frontends. Desktop and mobile frontends, embedded systems, and so forth usually aren’t a good fit for continuous deployment.

Indicators

When your team deploys continuously:

Deploying to production becomes a stress-free nonevent.
When deployment problems occur, they’re easily resolved.
Deploys are unlikely to cause production issues, and when they do, they’re usually quick to fix.

Alternatives and Experiments

The typical alternative to continuous deployment is release-oriented deployment: deploying only when you have something ready to release. Continuous deployment is actually safer and more reliable, once the preconditions are in place, even though it sounds scarier at first.

You don’t have to switch from release-oriented deployment directly to continuous deployment. You can take it slowly, starting out by writing a fully automated deploy script, then automatically deploying to a staging environment as part of continuous integration, and finally moving to continuous deployment.

In terms of experimentation, the core ideas of continuous deployment are to minimize work in progress and speed up the feedback loop (see “MINIMIZE WORK IN PROGRESS” and “FAST FEEDBACK”). Anything that you can do to speed up that feedback loop and decrease the time required to deploy is moving in the right direction. For extra points, look for ways to speed up the feedback loop for release ideas, too.