AWS is down: Why the sky is falling

Amazon Web Services, "the cloud" to many people, has had a significant issue in one of their datacenters since about 1AM Pacific time April 21st.  Some huge websites (reddit, quora, foursquare) are all down or significantly impacted.  I've seen a lot of misinformation which suggests that this is all purely due to the laziness of the affected sites' engineers, but that isn't the case.  Here's why:

AWS has two concepts that relate to availability - Regions and Availability Zones.  They have five Regions - two in the US (one east coast, one west coast), one in Europe (Ireland), and two in Asia (Tokyo, Singapore).  Each region has within it multiple "Availability Zones" (AZs), which are supposed to be isolated so that they have no single point of failure less than a natural disaster or something of that magnitude.  AWS says that by "launching instances in separate Availability Zones, you can protect your applications from failure of a single location".  It's not clear whether 'location' means separate datacenters or separate floors/areas of a single datacenter, but it doesn't really matter - the point is that AZs should fail independently until a catastrophic failure occurs. [Update below: it seems likely that they are in fact separate datacenters]

AZs also offer "inexpensive, low latency network connectivity to other Availability Zones in the same Region".  Inter-Region transfer, on the other hand, goes over the public internet, and is comparatively expensive, slow and unreliable.

These are the "rules of the game".  So if you're playing the AWS game and setting up a master/slave MySQL database (to take a highly pertinent example), what you do is you put the master and the slave in the same Region, but make sure they're in different Availability Zones.  You don't normally put them in separate Regions, otherwise you have to cross the expensive, slow and unreliable links between Regions, and you'll likely have more problems trying to keep your databases in sync.  You are at risk e.g. if a hurricane hits the eastern seaboard and destroys the datacenter, but short of that you should be OK - as long as AWS does what they promised.

So (finally) we come to the problem.  This morning, multiple availability zones failed in the us-east region.  AWS broke their promises on the failure scenarios for Availability Zones.  It means that AWS have a common single point of failure (assuming it wasn't a winning-the-lottery-while-being-hit-by-a-meteor-odds coincidence).  The sites that are down were correctly designing to the 'contract'; the problem is that AWS didn't follow their own specifications.  Whether that happened through incompetence or dishonesty or something a lot more forgivable entirely, we simply don't know at this point.  But the engineers at quora, foursquare and reddit are very competent, and it's wrong to point the blame in that direction.

Of course it's possible to protect against a catastrophic failure (multiple AZs), but for most businesses the additional expense and engineering effort isn't worth it (or may even be counterproductive by introducing additional complexity).  I'm sure all the sites that are down have backups they could go to.  The problem is that bringing them online is likely complicated and risky - in practice you have to move everything to the new region, because otherwise the latency between your machines is too big.  AWS has made this particularly complicated: the different regions have different features available, different AMI ids, I think reserved instances can't be moved between datacenters - in reality failover between regions is not realistic.  It's probably as much work as failover to a completely different cloud, which is probably a better disaster recovery policy anyway.  For all we know quora started the process the minute AWS had an issue and are still working on it - it could easily be a whole-day process.  Perhaps they would have started that process had AWS communicated at the start that it would have been such a big outage, but AWS communication is - frankly - abysmal other than their PR.

So - in short - the blame here lies squarely with AWS, who 'guaranteed' a contract they then broke.  Mistakes happen, but the mistake here was an AWS mistake.

Nor was this a failure of "the cloud".  It does show the importance of choosing your cloud provider carefully.  I think many people will be reasssesing their choice of AWS.

 

A few other tidbits:

  • The reason so many websites are in us-east is because that's where the new features get rolled out first.  It's also the cheapest.  It's also probably best located in terms of many website's traffic (good performance for the North America, reasonable performance for Europe)
  • The actual failure was due to EBS (persistent disks), which have been a disaster in terms of reliability since their introduction.  But that's a whole different blog post!
  • An EBS volume can only be in one AZ, and can only be used from that AZ.  RDS seems to use a private API that lets EBS volumes be in multiple AZs, but those aren't available to anyone competing with RDS (hmmm... a Seattle company using private APIs to gain competitive advantage - haven't we been here before?) 

__________________________________________________________________________

Update: 'js2' on hackernews pointed out that the EC2 FAQ offers a stronger guarantee:

"Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone."  http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availability_Zones_from_one_another

So it sounds like AZs are, in fact, separate datacenters and not just separate floors / rooms.  That makes multi-AZ failure even less acceptable.

Also, I used the word "contract" above, but I meant it in the technical sense, not the legal sense.  The legal contract is the SLA, which I consider relatively worthless.  Engineers designed to the AWS technical 'guarantees', but a multi-AZ failure shouldn't happen if AWS is upholding their end of the 'bargain'

The freebie I'm hoping for at Google I/O 2011: Nothing

Google I/O 2011 apparently sold out in 59 minutes this year, and for that hour the registration site was almost entirely down.  Obviously it's a popular conference, and it's poor form for Google to have their servers crashing, turning the registration into a game of chance.  (It's also not clear that the conference is genuinely sold out, because the system is that bad!)  But the real issue in my mind is that Google have turned what's supposed to be a developer conference into a giveaway, by giving away free phones every year whose value exceeds the price of registration.  This year people are probably expecting the Nexus S or the Motorola Xoom - or maybe even both.  Last year there were a bunch of scalpers on EBay selling their tickets, usually having pocketed the phones and probably selling those separately, making a tidy profit.

So what should Google do?  They can either keep running their conference as a (unlicensed) lottery, but in that case they might as well just have the technical talks elsewhere.  Or they can try to have a real developer's conference at Google I/O.  It looks like this year is a write-off on that front, unless they do something to dissuade the people just after the freebies.

My idea: Announce that there will be nothing given away at Google I/O this year, offer free refunds to anyone that wants them, and open a waitlist.  The scalpers will take their refunds, and the real developers will be able to get in.

 

What is Toyota doing?

As everyone across America must surely know, Toyota's cars apparently have a problem with 'unintended acceleration', and a small number of people have tragically died in related accidents.  Irrespective of the motives you attribute to the congressional hearings, irrespective of the statistics that seem to show that Toyota cars were no more prone to the problem than other makes before the press started reporting it, and irrespective of Toyota's questionable behaviour in preemptively blaming the problem on floor mats; I want to know what on earth Toyota is doing in their handling of the crisis.  Why haven't they simply provided the most basic advice: should this happen, it's straightforward to stop the car safely.  Suppose, for the sake of argument, that you're in a car which suddenly starts to accelerate: despite what the press would have you believe, this is not the time to call your loved ones, resigned to your inevitable demise.  You can simply apply the brakes (continuously, as pumping them may cause them to overheat.)  Should that fail, you can put the car into Neutral and glide to a halt.  Should that fail, you can turn the car off while leaving the key in the ignition (you lose power-assistance, but presumably adrenaline will more than make up for the added effort required.)

So why isn't Toyota communicating this simple message?  They're not talking about in their TV ads, they're not talking about it in the press, their CEO and Chairman both gave prepared statements that talked about commitments to quality but not about how to survive the problem.  There's probably some PR rule I'm not aware of here, but if so, it's time to break the rule.  This information is on the Toyota website, but it's buried on a difficult to find FAQ page, beneath the fold.  I challenge you to find that information if you're not looking for it.

If you're trying to point the finger at the media here - don't.  Of course the media have played up the problem, but they've also been the only ones investigating and talking about what to do.  That article has probably been more helpful in saving lives than the Toyota recall, not least because the problem can occur whether the badge on the car happens to say Toyota or Ford.  The media did air the congressional hearings, but neither the politicians nor the Toyota executives seemed to want to use the opportunity to make sure that no more lives were lost.

I'm thinking Toyota should run an ad like this:  "The accelerator on any car can become stuck, so whether you drive a Toyota or any other make, we at Toyota would like to tell you that it is straightforward to stop your car safely.  Apply the brakes firmly and continuously; that will safely stop the vehicle.  You can also put the car into Neutral.  You can also turn the key in the ignition through one click.  And please check at toyota.com to see if you should bring your car in for free servicing that can avoid this rare issue."

Maybe saying that would be a PR disaster.  But even if it is, a recall takes time, and if one more person dies during the recall because of this issue, when they could have survived had Toyota simply mentioned to apply the brakes, the lawyers will do their best to wipe out the company.  And, I'd argue, deservedly so.

The whole thing is nonsense anyway, with 40,000 people dying on US roads every year.  When you're multiplying by a number that big, you don't need a big percentage improvement.  I'd guess that running an ad during the Superbowl reminding drivers to check their tire pressure would probably be a lot cheaper, save a lot more lives, save a lot of gasoline, and probably build a lot of brand loyalty to the company that did it.  The fear factor needs to be addressed, but it seems to me that a simple bit of communication could have defused the whole issue.  Blame it on floor mats, blame it on the pedal, blame it on the weight of a thousand angels dancing on the engine computer, but tell people that should the car accelerate, they should apply the brakes till the car stops.

I'd love for someone to tell me what I'm missing.  Because the alternative - gross incompetence or some sickening 'brand damage' assessment - is far worse.

Patents are living up to their ideals

There's been a lot of uproar over software patents recently, seemingly sparked up by Apple going after HTC and news that Intellectual Ventures operates a network of shell companies. Even the USPTO admits that it could do a better job in implementing the patent system (and is seeking public input - deadline Monday, so speak now or forever hold your peace). However, implementations in the real world always fall short of the ideal, and politicians (of all kinds) are very good at drawing attention to these failings, but what I want to ask is: ignoring the procedural errors, is the patent system a worthy ideal, and are we even partially achieving that ideal?

 

US patents derive from the clause in the Constitution "to promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries." i.e. The quid-pro-quo of patents is that governments grant exclusive rights, in return for complete public disclosure of the invention. The intention is that this free disclosure promotes scientific progress, allowing others to 'stand on the shoulders of giants'. Ending software patents wouldn't yield a Utopian free exchange of ideas, but rather trade secrets and non-disclosure: 'no software papers'. Trade secrets do enjoy legal protection, and this is how the Coca-Cola formula has remained a secret for a century.

 

There are those that argue that this wouldn't apply to software, not least because it's relatively easy to reverse engineer software and discover any trade secrets within. Irrespective of the dubious logic involved in saying that legal protections shouldn't apply to things that are easy to steal, the rise of the Internet & cloud means that this simply isn't true any more. When using google.com you get an opaque black box; attempting to discern the internal machinations by looking at inputs and outputs would probably take longer than 20 years.  But yet the workings of Google's black box are relatively well known: PageRank, Map/Reduce, Google File System. I'm not suggesting that we know the exact details - the exact hardware or algorithm heuristics - but we know the 'clever bits'. The whole Internet industry post-Google has moved away from thinking in terms of a single computer to thinking in terms of how to harness clusters of computers together.  This is a key part of the changing economics of internet computing, and enabling Web 2.0.

 

And the reason we know all this is that Google has published papers. But they published those papers after they filed patents.  The PageRank patent was filed in Jan 1998, the paper published 4 months later. The MapReduce patent was filed in June 2004, the paper published 6 months later.  Google can keep secrets - for example we still have trouble figuring out exactly where their datacenters are. If Google can hide a frickin building, they're more than capable of keeping their core technologies locked away under trade-secret laws.

 

Now, we can argue 'post hoc ergo propter hoc' all we want, but the basic facts remain: the aim of patents is to allow free disclosure such that scientific knowledge can be advanced, Google has freely disclosed their secret sauce, and they did so only after filing patents.  This is exactly what the designers of the patent system envisaged. The positive impact of Google's ideas in my opinion is much greater than the negative effects of a few bad patents. So, yes, of course the implementation of the patent system is not ideal, but the ideals of the patent system do seem to be working, in the case of arguably the most important Computer Science papers of the last decade.

 

One more thing: Stanford got Google stock from the PageRank patent worth $1 billion today. So patent licensing isn't just for trolls, but funds some of our greatest institutions.