I’ve talked a lot in the past about how serious we take uptime. Yesterday was the first time we’ve had an outage that was reported by a customer… and a lot more than one customer noticed it, unfortunately.
We’ve done a complete root cause analysis and today I’ll walk through exactly what happened to cause the outage, how we addressed the problem, and what steps we’re putting in place to ensure it doesn’t reoccur or something like it doesn’t happen in the future.
From about 7:36AM PST to 7:59AM PST our app was pretty much inaccessible.
warning – it’s getting technical – skip to the bottom if you’d like to keep it lighter.
Our automatic monitoring first reported our app as inaccessible at 7:36am
The first customer to ask if there was a problem happened about a minute later at 7:37.
Our Customer Support folks were asking the devs if anything was up based on the customer issue right as the second failed monitor popped at 7:41. That’s when our automatic alarms go off – one monitoring failure isn’t unheard of – we check from all over the world. Individual internet issues can cause issues there, but a second monitor failure from a different location is unusual.
Over to our monitoring tools to figure out what’s going on:
Woah, we’re getting CRUSHED! Those bars on the side are how long our transactions are running (generally how much each page load takes) – normally they average ~200, maybe 250 milliseconds – we’re looking at 60-100 THOUSAND millseconds or a minute or 2.
You can also see the times for common transactions – things that are normally measured in portions of seconds were taking hundreds of seconds. Our error rate also popped and was high. We didn’t get a screenshot of it at the time, but the web server CPU & memory was pretty normal. Generally that means we weren’t facing heavy incoming traffic.
Next step: database info
Same issue here, a single transaction is taking 1000x+ more time than it normally does.
That’s not a super common data lookup – a quick code-check finds that lookup is only performed in a job we process with our background worker system.
So we checked the background worker system.
All 20 of the background workers are running the same job with different parameters. That’s probably our problem.
at 7:59am we pause our background worker system.
The application quickly returns to operational.
What went wrong?
We sit down to analyze what causes that job to introduce so much pressure on our system that we went down. The job updates a test with the appropriate route point for routing & scheduling.
The way the job was written is this: it needs to lock a table in the database to ensure that a test is added to the proper route point. When that table is locked, nobody else can access that table until that job completes and releases that lock. (Wikipedia has a great breakdown of database locking.)
Enough of those jobs were running at the same time (along with other transactions on the system) that the database was returning data very slowly. All of the different locks were waiting on each other to complete.
The high volume of this job running started a little earlier in the day – about 7:15am – the database and the application had no problem keeping up with those jobs along with the rest of the application traffic. However – at ~7:20 the backend processing queue was getting too long. When that happens we automatically scale and add more processing to get jobs done quicker. When those additional servers automatically came online they started processing the job that caused our problems – now twice as many of those jobs were running at the same time.
This greatly increased the wait time for most database transactions. Think of each request/job/link click as a person standing in a line at the grocery store. The line got long so we added more cashiers. Unfortunately all of the customers at those cashiers were our slow running job – so we opened 20 more registers but the first customer in line at each of those registers took forever and the line continued to be held up.
We quickly evaluated the slow job. About a month ago, there was an aggressive lock that was applied to the job meant to address route points that weren’t properly grouping together when the address is the same.(You’d see 3 tests with 3 different route points all with the same address, boo, bad bug!) The lock clearly fixed that issue but caused a much worse condition to occur.
We had that fix in about 8:15 and re-enabled our background worker system immediately after – our background processing was fully caught up at about 8:25am.
Outage time was 23 minutes & overall time where our system was impacted was just shy of an hour.
Lessons Learned & Next Steps:
Our immediate goal is to get the functionality of the problematic job back and working – we’re making substantial changes to how that code runs to ensure it no longer requires aggressive locking that can cause problems. We’ve reviewed the rest of our code to look for any similar design patterns that may cause issues in other locations and have found none (phew!)
In the bigger picture we’re reviewing ways to pause or adjust the scaling of our backend processing system based on any alarm actions we receive from our monitoring. So if a even vaguely similar issue were to reoccur we’d prioritize customer traffic over potentially impacting backend job processing.
We’ll also be introducing a status page that is hosted on external infrastructure to communicate any application impacts if they do occur in the future.
We’re very committed to ensuring C3 is available & operational all the time. – We’re still reaching for that 5 nines of uptime, despite this bump in the road.