Analytics & data drive our business. We don’t have enough time in the day to solve every one of our problems or optimize every line of our application. We have to choose what to focus our time on to ensure we’re benefiting our customers in the most effective way every day.
Listening to your gut is important, but just as important is having the data to know when your gut is wrong or confirming it was right.
I’ve talked about this before, Sasha talked about ensuring the same thing in your accounting software recently. The better you know your business, the better you can make improvements to it.
Today I’m going to talk about how we prioritize making sure our application is as fast as it can be.
When we were first getting started I’d click through our website and if I thought a page took too long to load we’d go look at it and see if there was anything we could do to make it faster. That was my gut check – and for the most part it worked.
But now that we have a lot of customers using our application it’s not as easy to click through and see what’s fast & slow. Every customer is a little bit different – one page that might be nearly empty and insanely fast might have a ton of data on it for another user and be slow in a way we never expected.
Analytics software to the rescue! We use a tool called New Relic. It tracks how well our application is responding at a very deep level. We can see how our application performs on each page all the way through our code to the database where the information is stored.
New Relic provides us with a complete view of our application – we get dashboards like below:
That is from our staging environment, so it’s fairly quiet. But we can see really useful information from this data – We can see that our health check is the most commonly used operation in our application here. Second to that is looking through our customer pages – the main page for a customer as well as the list of customers page. This view of the data helps us see what pieces of our application are being used most often. We always want to spend time improving the areas of the application that are most used by our users rather than optimizing or improving a single element that is only used 0.01% of the time.
Let’s look at a concrete improvement example. We use a report New Relic provides called ‘slowest average response time.’ This gives us a top list of our slowest pages in our application. There is always going to be a slowest page but what we look for are pages that are performing far slower than the average. Yesterday we had a great example.
We saw that our billing page was running *really* slowly – it was taking almost a second to generate the page. We get concerned when anything takes longer than a quarter of a second so this was 4x slower than we want.
Our billing page is fairly simple – we were surprised to see it running so slow.
-Doesn’t seem like there is much to go slow here-
New Relic gives us the details:
When we look at the breakdown we can see the different elements of the page that add up to the 959 milliseconds it took to create that page.
I’ve highlighted the important part – we can see that more than half of the time this page was taking to generate was an external request to Amazon S3. We store our test report files in S3. But we aren’t showing a test report on this page. What gives? Why did this page take half of it’s processing time checking for test report files?
The answer is actually the ‘report’ button – we only show a ‘report’ button when there is a report associated with a test that has been performed. On this specific page we were using an incorrect check of that file being present – instead of asking the database if the test report was present, we were asking the application to verify the file was physically available in S3 – so it dutifully went and checked that the file was present for each test. It just so happened that there were two customers that brought up the billing page in that 5 min window – one of them had 4 tests and the other one had 60! The customer that had 60 is the one that took a very long time for the page to load (over 2 seconds!) They were waiting around while C3 went and checked that files they weren’t trying to access were available.
Thankfully the fix was easy – we changed the file check to a database check and the same request that took 959ms now takes 185ms – a 5x improvement!
We wouldn’t have known about this problem if we didn’t have thorough analytics. Analytics gives us the option to prioritize things we need to work on by showing us the slowest (and fastest!) areas of our application. Then, we get to use that information to make a better product for our customers. Listening to our gut is important, but actual data goes a lot further, and is essential to benefiting our customers and their experience with C3.