2013-08-17

Uptimes and apocalypses

Riley: Buffy. When I saw you stop the world from, you know, ending, I just assumed that was a big week for you. It turns out I suddenly find myself needing to know the plural of apocalypse.
"A New Man", Buffy The Vampire Slayer, S4 E12
Amused by the apocryphal tone of the Daily Mail's coverage of the 5-minute Google outage on Friday - just before midnight BST which explains why no-one in the UK except hardcore nerds noticed - I thought I'd do a brief explanation of the concept of "uptime" for an Internet service.

Marketeers <spit> describe expected system uptime in "nines" - the fraction of time that the system is expected to be available. A "two nines" system is available 99% of the time. This sounds pretty good, until you realise that every day the system can be down for about 14 minutes. If Google, Facebook or the BBC News website were down for quarter of an hour every day, there would be trouble. So this is a pretty low bar.

For "Three nines" (99.9%) you start to move into downtime measured in minutes per week - there are just over 10,000 minutes in a week, so if you allow 1 in 1000 of those to be down, you're looking at 10 minutes per week. This is pretty tight - the rule of thumb says that even if you have someone at the end of a pager 24/7 and great system monitoring that alerts you whenever something goes wrong, it will still take your guy 10-15 minutes to react to the alert, log in, look to see what's wrong - and that's before he works out how to fix it. So your failures need to occur less frequently than weekly.

When you get to "Four nines" (99.99%) you're looking at either a seriously expensive system or a seriously simple system. During a whole year, you're allowed fifty minutes of downtime, which by the maths above indicates no more than two incidents in that year - and, realistically, probably only one. At this level you start to be more reliable than most Internet Service Providers, so it starts to get hard to measure your uptime as your traffic is fluctuating all the time due to Internet outages of your users - if your traffic drops, is it due to something you've done or is it due to something external (e.g. a natural disaster like Hurricane Sandy?) Network connectivity and utility power supply are probably not this reliable, so you have to have serious redundancy and geographic distribution of your systems. I've personally run a distributed business system that nudged four nines of availability, with an under-resourced support team and it was a cast iron bastard - any time anything glitched, you had someone from Bangalore calling you at home around 1am. Not fun.

"Five Nines" (99.999%) is the Holy Grail of marketeers, but in practice it seems to be unachievable for a complex system. You have only 5 minutes per year of downtime allowed, which normally equates to one incident every 3-4 years at max. Either your system is extremely simple, or it's massively expensive to run. Normally the cost of that extra 45 minutes of uptime a year is prohibitive - easily double that of four nines in many cases, sometimes much more - and most reasonable people settle for four nines or, in practice, less than that.

Given that, let's examine the DM's assertion that "Experts said the outage had cost the company about £330,000 and that the event was unheard of." Google had about $50bn revenue last year so divide that by 366 (leap year) to get about $140M/day average, $5.7M/hour. A 5 minute outage is 1/12th of that, $474K or £303K at today's rates, so the number sounds about right. But "unheard of"? May 7 2005 was another outage, this time for around 15 minutes. Google, Twitter, Yahoo, Facebook, Bing, iTunes etc. go down for some areas of the planet fairly frequently - see DownRightNow which is currently showing me service disruptions for Yahoo Mail and Twitter. Gmail was down for a whole bunch of people for 18 minutes back in December. It's part of normal life.

Global networks go down all the time. Google going down for a few minutes is not the end of the world. It's happened before and will almost certainly happen again. The Daily Mail needs to find some better quality experts - but then, I guess their quotes aren't as quotable. I'm not surprised Google drops off the planet for 5 minutes - I'm surprised it doesn't happen more often, and I'm astonished they get it back online in 5 minutes. I also feel sorry for people setting up their Internet connection at home in that outage window, when they tried connecting to www.google.com to verify their connection and it failed. "I can't reach Google - my Internet must be bust, it certainly can't be Google that's unavailable..."

Update: (2013-08-19)
And now Amazon goes down worldwide for 30 minutes. I rest my case.

2 comments:

  1. A particularly daft large utility company in the UK, whose name I will not provide because it will limit my nym and not improve the dit, specified 6 nines performance for their online shop. As their security auditor, I was surprised and more than a bit confused. I can only assume that was because what they usually specified for their core business. Or, of course, that they were idiots.

    Oh, and it was 6 nines downtime. Not 6 nines unplanned downtime.

    And it was a "critical business risk" if this wasn't achieved. I wondered what business damage would happen if somebody couldn't pay their electricity bill at 3am? Or buy a new fridge.

    Yes, they were idiots.

    ReplyDelete
  2. SE: wonderful! Well, if five nines are good, surely six nines are better. Presumably they were running one 99.9% uptime system in each of 1000 towns across the UK...

    ReplyDelete

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.