Our Outage May 24th, 2018

Posted by:

I talk a lot about making sure our app is always available for you – such as when I mention our Application Uptime. It’s been a while since we’ve had an outage (that’s a good thing!) In January 2017 I mention another short outage we experienced.

Unfortunately we had another brief outage a week and a half ago.

Like the previous outage, I wanted to share the details. We’re all about being transparent with our customers about mistakes we make and what we’re doing to make sure they aren’t repeated.

We know we’re going to make mistakes. What we want to do is make sure we don’t repeat them.

Readers Digest version of our outage

The short explanation: Yours Truly deployed a code change that didn’t agree with the servers currently running – while the deploy happened those servers didn’t know how to properly deal with the situation so they returned errors. Once the deploy finished those servers were replaced with ones that knew how to deal with the new code and the problem went away.

A longer story

The long explanation:

I’ve talked about our deployment process we use a system that automatically tests code changes to make sure no problems happen and then automatically deploy those code changes to our production environment. Unfortunately this snuck past that.

On the morning of May 24th, I removed some old legacy code around how we used to have devices related to customers rather than devices relating to service locations and then customers via service locations. This cleanup involved removing a database column from our Device table – specifically the customer_id column.

At 10:26 AM on May 24th, I pushed this code to our automated testing and deployment system. That process completed about 10 minutes later and the automated tests all passed

At 10:35 AM, because that code passed our automated testing, it started automatically deploying to production. This involves spinning up new servers and in this case – because of the removal of the customer_id column – making a change to the database – specifically telling the production database to remove the customer_id column from the device table.

Changes to the database in a deployment aren’t uncommon – the difference in this case was the removal. The existing application code running in production wasn’t using this column for relationships between tables or anything else, but the code expected the column to be there. It was ‘cached’ so that any call to the device table requests various columns rather than getting all columns.

The running servers expected the device table to have a customer_id column. At about 10:40, once the new servers were spun up and removed the customer_id column from the database, that column didn’t exist anymore. Because of their cache for the columns in the device table – the running servers made requests to the database that now returned errors – because the customer_id that they requested was no longer there.

The new servers removed the column and started up, but the old servers continued to serve our users. This meant our users experienced error messages any time a page included a request to a device table – that’s a lot of our pages!

Once the new servers were fully started, we automatically switched over to using them. About a minute after that (at 10:44AM) all the errors we saw coming from the older servers were cleared up because everyone was on the new servers. These new servers knew about the lack of the customer_id column.

All in all the outage lasted about 4 minutes. Unfortunately it was during one of our out busiest times of day, and it affected a lot of our customers.

What are we doing to prevent this going forward?

As we continue to improve on C3, we’re undoubtedly going to need to remove unused columns from tables. Not doing that would create too much technical debt

So what are we going to do to prevent this type of outage? Make sure that any removal of columns in prod will be peer-reviewed before being deployed to production. Additionally, they will be done in a maintenance window off-hours (like our maintenance window last night) or in a different manner than we used this time so that it won’t cause impact to the servers that are currently serving our users.

We (and I personally) apologize for the outage, we’ll continue to work on improving so that we don’t get in the way of you doing your work. Thanks for your understanding.

0

Add a Comment