Saturday, November 30, 2019

Rails database migrations on a cluster with AWS CodeDeploy

A typical Ruby / Rails solution will comprise of a number of web servers behind a load balancer. Each of the web servers will read/write from a central database. In the course of new features being added to the application, the database schema goes through changes, what we refer to as "migrations".

When new code is deployed, the migrations that are needed for new code needs to be run first. If AWS Code Deploy is used for deployment of new code, we can set up the AfterInstall hooks to run the migrations before re-starting the web server.

So the usual flow in a deployment goes something like this:

  1. Stop the web server
  2. Migrate the database
  3. Start the web server

However, our application is hosted on a number of web servers. We don't want to bring down all servers at once. A typical blue/green deployment will have us deploy to just one third of the server fleet at once.

So if we have 27 web servers, we will be running the above steps on 9 of them at the same time. The main problem with this is that when the Rails migrate runs on multiple servers at once, it is likely to fail on a number of them. This is because Rails takes an advisory lock on the database and throws an exception on concurrent migrations. You can read more about the advisory locking here as well as a way to work around the problem.

But the solution is not without its drawbacks. If you prevent the migrations running on all but one machine, it is possible that the new code will be deployed sooner on those machines before the migration has finished. This is specially true for long running migrations. Then there is potential for new code to be running against an old database schema. New features that depend on the new schema will likely fail.

A better solution would be:
  1. Run the migration on a single instance - this could be one of the web servers, or a dynamically provisioned EC2 instance that can access the database.
  2. For all web servers
              2.1 Stop the server
              2.2 Deploy new code to it
              2.3 Start the server
The advantage of this solution is
  1. We side-step the concurrent migration issue. We run the migrations on a single instance and then do the rest of the deployment without incurring any database issues.
  2. We bring up the new code only after the database is migrated, so the new features work reliably from the start.
So the new database schema changes need to be backward compatible. But this is a general constraint we have anyway since on a blue/green deployment some part of the code is old and will be hitting the new database.

While this solution is pretty straightforward, it requires some effort to implement this in the AWS CodeDeploy environment.

What I ended up doing was to use a new Deployment Group (called staging) to bring up a single EC2 instance, change the start up code to only run the migration on that deployment group. Then I hooked this deployment group right after the deployment to a test instance, but before the code is deployed to the production servers.

In the startup code, we can check the current deployment group via ENV['DEPLOYMENT_GROUP_NAME']. In our scripts, we set the RAILS_ENV equal to the Deployment Group. This allows code to take different paths based on where it runs (in a local dev environment, a staging server or like in this case on a migrator server).

This is what our migrate script now looks like:

It is important to set the inequality, as we want the migrations to run on our test servers - we just don't want them running on the production web servers.

We add this to our database.yml, notice the environment is staging, to match the deployment group. Notice the database is the production instance.

In our case, we read credentials from AWS Secrets Manager. You don't have to.

This is how our staging step in CodePipleline looks like:

On the last CodeDeploy step, hit the edit button and set the Application Name and the Deployment Group correctly.

Now before the code is deployed to the production servers, the database migration has been completed on the staging instance. If the migration fails, CodeDeploy won't advance to the deploy stage. When the production servers start with new code, they will all use the new database schema.

After the migration has finished and before code is deployed, the old code will start using the new database schema. As long as the new schema is backward compatible, this will not cause a problem.

You may have to run the release pipline a few times till AWS co-operates with the new changes. But it should eventually start working.

No comments: