AWS is Down: Why the Sky is Falling by Justin Santa Barbara
"So (finally) we come to the problem. This morning, multiple availability zones failed in the us-east region. AWS broke their promises on the failure scenarios for Availability Zones. It means that AWS have a common single point of failure (assuming it wasn't a winning-the-lottery-while-being-hit-by-a-meteor-odds coincidence). The sites that are down were correctly designing to the 'contract'; the problem is that AWS didn't follow their own specifications. Whether that happened through incompetence or dishonesty or something a lot more forgivable entirely, we simply don't know at this point. But the engineers at quora, foursquare and reddit are very competent, and it's wrong to point the blame in that direction."
When Success is 99% Failover: How Availability Can Persist in the AWS Cloud When Network Events Also Persist in an EC2 Region by Asher Bond
"But don’t just go throwing shuriken at network-event coordinators unless your star has more than just these two points. I think a nice third point to sharpen and cut to is the reliability of monitoring systems. It’s good to be monitoring your autoscaling processes if you’re in a situation where you scale on demand… and you also want to monitor who is demanding the computing resources. Ideally, you’re getting alarmed before your end users are. Reflexive firewalls are a good way to go, but just having good reflexes is part of wearing the agile cat’s hat in general. If you have a fast way to report trouble to the authorities charged with ownership of a compromised node attacking your system, you’re part of the solution and get a gold star."
On Data Center Scale, OpenFlow, and SDN by Brad Hedlund
"In networks today, if you build a cluster based on a pervasive Layer 2 design (any VLAN in any rack), every top of rack (ToR) switch builds forwarding information by a well-known process called source MAC learning. This process forms the nature of the forwarding information stored by each ToR switch which results in a complete table of all station MAC addresses (virtual and physical) in the entire cluster. With a limit of 16-32K entires in the MAC table of the typical ToR, this presents a real scalability problem. But what really created this problem? The process used to the build the information (source mac learning)? The amount of information exposed to the process (pervasive L2 design)? Or the limited capacity to hold information?"