AWS is down: Why the sky is falling
Amazon Web Services, "the cloud" to many people, has had a significant issue in one of their datacenters since about 1AM Pacific time April 21st. Some huge websites (reddit, quora, foursquare) are all down or significantly impacted. I've seen a lot of misinformation which suggests that this is all purely due to the laziness of the affected sites' engineers, but that isn't the case. Here's why:
AWS has two concepts that relate to availability - Regions and Availability Zones. They have five Regions - two in the US (one east coast, one west coast), one in Europe (Ireland), and two in Asia (Tokyo, Singapore). Each region has within it multiple "Availability Zones" (AZs), which are supposed to be isolated so that they have no single point of failure less than a natural disaster or something of that magnitude. AWS says that by "launching instances in separate Availability Zones, you can protect your applications from failure of a single location". It's not clear whether 'location' means separate datacenters or separate floors/areas of a single datacenter, but it doesn't really matter - the point is that AZs should fail independently until a catastrophic failure occurs. [Update below: it seems likely that they are in fact separate datacenters]
AZs also offer "inexpensive, low latency network connectivity to other Availability Zones in the same Region". Inter-Region transfer, on the other hand, goes over the public internet, and is comparatively expensive, slow and unreliable.
These are the "rules of the game". So if you're playing the AWS game and setting up a master/slave MySQL database (to take a highly pertinent example), what you do is you put the master and the slave in the same Region, but make sure they're in different Availability Zones. You don't normally put them in separate Regions, otherwise you have to cross the expensive, slow and unreliable links between Regions, and you'll likely have more problems trying to keep your databases in sync. You are at risk e.g. if a hurricane hits the eastern seaboard and destroys the datacenter, but short of that you should be OK - as long as AWS does what they promised.
So (finally) we come to the problem. This morning, multiple availability zones failed in the us-east region. AWS broke their promises on the failure scenarios for Availability Zones. It means that AWS have a common single point of failure (assuming it wasn't a winning-the-lottery-while-being-hit-by-a-meteor-odds coincidence). The sites that are down were correctly designing to the 'contract'; the problem is that AWS didn't follow their own specifications. Whether that happened through incompetence or dishonesty or something a lot more forgivable entirely, we simply don't know at this point. But the engineers at quora, foursquare and reddit are very competent, and it's wrong to point the blame in that direction.
Of course it's possible to protect against a catastrophic failure (multiple AZs), but for most businesses the additional expense and engineering effort isn't worth it (or may even be counterproductive by introducing additional complexity). I'm sure all the sites that are down have backups they could go to. The problem is that bringing them online is likely complicated and risky - in practice you have to move everything to the new region, because otherwise the latency between your machines is too big. AWS has made this particularly complicated: the different regions have different features available, different AMI ids, I think reserved instances can't be moved between datacenters - in reality failover between regions is not realistic. It's probably as much work as failover to a completely different cloud, which is probably a better disaster recovery policy anyway. For all we know quora started the process the minute AWS had an issue and are still working on it - it could easily be a whole-day process. Perhaps they would have started that process had AWS communicated at the start that it would have been such a big outage, but AWS communication is - frankly - abysmal other than their PR.
So - in short - the blame here lies squarely with AWS, who 'guaranteed' a contract they then broke. Mistakes happen, but the mistake here was an AWS mistake.
Nor was this a failure of "the cloud". It does show the importance of choosing your cloud provider carefully. I think many people will be reasssesing their choice of AWS.
A few other tidbits:
- The reason so many websites are in us-east is because that's where the new features get rolled out first. It's also the cheapest. It's also probably best located in terms of many website's traffic (good performance for the North America, reasonable performance for Europe)
- The actual failure was due to EBS (persistent disks), which have been a disaster in terms of reliability since their introduction. But that's a whole different blog post!
- An EBS volume can only be in one AZ, and can only be used from that AZ. RDS seems to use a private API that lets EBS volumes be in multiple AZs, but those aren't available to anyone competing with RDS (hmmm... a Seattle company using private APIs to gain competitive advantage - haven't we been here before?)
__________________________________________________________________________
Update: 'js2' on hackernews pointed out that the EC2 FAQ offers a stronger guarantee:
"Each availability zone runs on its own physically distinct, independent infrastructure, and is engineered to be highly reliable. Common points of failures like generators and cooling equipment are not shared across Availability Zones. Additionally, they are physically separate, such that even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone." http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availability_Zones_from_one_another
So it sounds like AZs are, in fact, separate datacenters and not just separate floors / rooms. That makes multi-AZ failure even less acceptable.
Also, I used the word "contract" above, but I meant it in the technical sense, not the legal sense. The legal contract is the SLA, which I consider relatively worthless. Engineers designed to the AWS technical 'guarantees', but a multi-AZ failure shouldn't happen if AWS is upholding their end of the 'bargain'