Application Uptime – Reaching for Five Nines

Posted by:

We’re doing backflow test management differently. No surprise there – we embrace that and see it as a positive. But sometimes that means setting expectations that things are in fact a bit different than people might be used to.

Since our solution lives in the cloud you don’t need to have a computer at the office that is your ‘backflow machine’ in order for you to work. That’s a good thing! But it’s also different. When you can’t reach out and hit the reset button yourself, you need to be confident you’re never going to need to hit reset, and that when you go to use our application that it will be available & working.

In this day and age the bar for any cloud-based application is to be very available, all the time–without fail. This is referred to as “uptime” in IT infrastructure terms. Let’s look at what those terms mean and how we’re striving to offer the highest uptime possible.

What is uptime?

Uptime is measured in nines. The more the better. Wikipedia has a great article on high availability that covers the finer points. In general though:

Being up a 100% of the time is the perfect scenario – it’s never possible. Phone systems are pretty much the gold standard – they are generally referred to as having five nines of reliability – or they are up 99.999% of the time. That means in a period of a year you can expect they would be unavailable for about five and a half minutes a year. That’s pretty reliable!

How the nines break down:

Availability % Downtime per year Downtime per day
90% (“one nine”) 36.5 days 2.4 hours
95% 18.25 days 1.2 hours
99% (“two nines”) 3.65 days 14.4 minutes
99.9% (“three nines”) 8.76 hours 1.44 minutes
99.95% 4.38 hours 43.2 seconds
99.99% (“four nines”) 52.56 minutes 8.66 seconds
99.995% 26.28 minutes 4.32 seconds
99.999% (“five nines”) 5.26 minutes 864.3 milliseconds
99.9999% (“six nines”) 31.5 seconds 86.4 milliseconds

As you can see – as you get more nines, you have better availability. You wont have customers for very long at 90% or 95% uptime. Five, six, & seven nines get crazy expensive. You need redundancy upon redundancy to be that reliable. And the more redundancies you have the greater chance of human error in supporting that.

You’ll see a lot of companies tout impressive uptime numbers – with a stack of caveats. Normally that they don’t include any planned maintenance periods – of which they have two each week for a period of 2-3 hours each. In my eyes that’s cheating. Maintenance is important but not counting it as unavailable time for your application is pretty misleading.

It isn’t just about uptime

It’s also important to consider that pure uptime out of the day isn’t all that matters. When outages occur is far more important usually than how long. The majority of our customers of course use our application during business hours. The weekends get a little bit of activity but our servers are awfully quiet over most of the evening and night hours.

A 4 hour outage for us during a holiday night would probably not even be noticed. But 5 minutes on a Tue mid morning? We’d hear about it.

What’s our target?

I’d like five nines; most people in technology set that bar for themselves. I know we don’t need that high of availability but it’s important to me that we raise the bar as much as we can.

Since we’ve started ~10 months ago we’ve been down for 17 minutes total. 13 of those were me personally turning off the wrong stack when I was very tired one late night (nobody noticed) and the other 4 minutes were issues caused by our hosting provider having some physical issues between datacenters.

That puts us solidly over the four and a half nines bar (99.995%.)

More importantly we learned from both of those outages and made improvements to remediate them as much as possible. I’m hopeful next year that I can say we’ve hit my goal of five nines. More importantly we want to ensure that our customers are never impacted by our application uptime.

How often did your old system stop working? How much does an hour of downtime from your backflow tracking application cost you? How much time does your staff spend supporting your existing application?

1

Comments

Add a Comment