Security Experts:

Storms in the Cloud: Lessons from the Amazon Cloud Outage

The Cloud — While Great — Doesn't Absolve Companies from taking Fundamental Precautions to Safeguard their Systems

It started in the early hours of April 21st with a tiny router configuration error during a routine upgrade to an Elastic Block Storage (EBS) instance — EBS is Amazon's storage service, employed by applications that use Amazon’s EC2 cloud service. Engineers accidentally switched traffic destined for a high-capacity production network onto a secondary, backup network. This second network was designed for redundancy and data replication, not to manage the large volume of production data suddenly flooding into it. This led to many nodes becoming stuck in a replication loop as they searched for the storage space they had been cut off from. Amazon described the cascading failure as a "re-mirroring storm."

Backup Cloud EnvironmentsMajor red-hot startups were hit by Amazon’s cloud storm: Dropbox, Foursquare, Quora, Reddit, HootSuite. All of them went down and stayed down, some for days. Amazon has pioneered the cheap, scalable and convenient hosting of Web-based services, thereby unleashing a flood of new companies whose entire business model is based around the Amazon cloud. The storm, however, showed the world that the cloud — while great — does not absolve companies from taking fundamental precautions to safeguard their systems online.

Cloud computing is in its ascendency. Apple is set to announce its iCloud initiative this week, joining in a fierce competition for users with Microsoft and Google as well as conventional Web hosting providers such as Rackspace. Analysts estimate that enterprises are preparing to spend tens of billions of dollars annually on software-as-a-service and infrastructure-as-a-service over the next few years. The cost efficiencies and scalability benefits that can be realized when established companies and upstarts alike outsource their application infrastructure, storage and processing cycles to cloud-based infrastructure services are undeniable.

Protecting Cloud Deployments, BackupWith all the hype that the cloud has been subject to over the last few years, those who believed that the cloud was the silver bullet were due a wake-up call. While the cloud allows for a rapid and easy start-up experience, storms in the cloud effectively shut down businesses completely unless you take sensible measures to protect yourself.

The result of all this was that some EC2 customers, largely those whose applications and data were stored in the Amazon Availability Zone that was affected, saw a day or more of downtime. Some lost data, and many more lost face with their customers. Some companies, however, managed to struggle through the incident with only minor availability problems. What did they do that other high-profile ventures did not?

The Cloud is Not a Disaster Recovery Plan

Moving to the cloud carries with it the promise of massive scalability, availability and redundancy, but it's no substitute for an effective disaster recovery plan. Cloud services are architecturally complex and relatively new design concepts. It is inevitable that they will sometimes break. While Amazon's post-incident report was fairly comprehensive and rightfully earned the company praise, some customers reported frustration with the lack of communication while their services were offline. During the downtime, they had no idea how long it was likely to last and felt the options for recovery were limited. Of course, it is likely that Amazon’s own engineers did not know how long the outage was likely to last, but that is little consolation for a company whose primary business is online.

In this regard, the move to the cloud actually increases the need for a well-tested disaster recovery plan, one that takes into account the lack of visibility you're likely to have during an incident. The companies that managed to ride out the storm of the Amazon outage were those that had designed their infrastructure with failure in mind, knew where the potential weak spots were, and knew what they needed to do if any particular component of their services failed.

Backup. Backup. Backup.

The Amazon outage hit its storage services primarily, and some customers did lose data permanently as a result. But a sound, frequent off-site backup policy would have proved invaluable in maintaining uptime during the incident. A full 24 hours offline is an almost inconceivably long time for any company that does its primary business on the Web, yet some cloud users suffered that and longer. In several cases, Amazon customers had no option but to sit out most of the downtime in frustrated silence, waiting for Amazon's engineers to resolve the situation, because they had no recent backup data they could fail over to, and no plan for doing so. Amazon offers customers protection against this type of problem by offering Availability Zones except, in this case, the outage affected multiple Availability Zones that were in one geographic region. Backing up regularly to a separate Availability Zones in diverse geographic locations could have saved the day for some companies.

Does the Cloud Match Your Policies?

Financial accounting, consumer data privacy, medical confidentiality, and payment processing data security — whatever your regulatory compliance requirements, you may find that migrating your applications and data to the cloud will impact your risk profile. You should carefully review your privacy policies against the cloud provider's infrastructure, and ask some specific questions:

• Where, physically and geographically, will my data be stored, and what is the legal jurisdiction of that location?

• Who has access to my data?

• Is my data stored and processed on shared infrastructure?

• When the cloud goes down, is my data at risk of loss or leakage?

• Is my data stored in an encrypted manner? If so, how are the private keys managed?

The utility paradigm offered by the cloud should not create a mindset of dangerous indifference to the underlying infrastructure in use. A disaster recovery plan is critical whether you process data in-house, or in the cloud, and backup is a vital component of any such plan.

Your disaster recovery, backup and policy review plans need to be audited and tested. The next major cloud outage will almost certainly not be like the one experienced before. While enjoying the many benefits of cloud computing, it would do well to go back to some old-school IT policies and procedures that you probably threw out as “useless.”

With great advances in cloud computing come great responsibilities of redundancy and diversity. Those who fail to heed this essential truth are dancing with a disaster of their own making.

Ram Mohan is the Executive Vice President and Chief Technology Officer at Afilias, a global provider of Internet infrastructure services including domain name registry and DNS solutions. Ram also serves as the Security & Stability Advisory Committee's liaison to ICANN’s Board of Directors and has helped direct and write numerous policies effecting domain name registration and DNS security.