Dcpl cloud computing amazon fail

AMAZON FAILDC Public Library’s Lessons Learned from the Amazon Cloud Outage

Friday, June 24, 2011

BACKGROUND

• DClibrary.org was first major DC Government website to use cloud-based hosting beginning circa June 2009

• Initial architecture designed to leverage low cost of large instances Amazon Web Services (AWS) servers for database operations and lower cost small and mid servers for WWW services

• DClibrary.org Content Management System is Drupal 6

• Bonus: Experimental Drupal 7 amazon machine instance available on our website; currently undergoing user testing


• Background: AWS de-couples the physical hard disk space (called Elastic Block Storage or EBS) from the CPUs (called “compute instances”)

• late April 2011: an AWS engineer mistakenly routed “backplane” (internal server traffic) which connects EBS to the CPUS through a system that could not handle the load

• This triggered an alarm; since everything in AWS is redundant, the systems thought the backup EBS drives had all failed simultaneously, causing an overload as the system tried to compensate

• In a nutshell, it’s almost as if the CPUs no longer had hard drives

WHAT WENT WRONG


2009 ARCHITECTURE

• June 2009 architecture focused on load balancing and database replication across Amazon Availability Zones

• SVN machine was also in cloud

• Too reliant on one service provider (amazon)


PRE-OUTAGE ARCHITECTURE

• AWS began a new service called “RDS” for Relational Data Service in 2010. This was a managed database service -- mySQL -- that was more powerful and simpler to administer than us doing so ourselves on large servers

• We migrated to RDS in 2010

• The remaining architecture, with the mid-instance front ends and load balancers, remained the same


KEY LESSONS LEARNED

• Amazon’s multiple availability zones failover are not reliable

• Does not imply separate physical or logical facilities!

• Amazon’s poor communication during the outage compounded this problem

• Due to Amazon’s poor initial incidence response communications, we on the spot decided to

create new machine instances (AMIs) in a different geographic zone (US-West vs. US-East) and

copy over the “offsite” one-day-old SVN and DB backups

• Downtime minimized to 1.5 hours; many websites (Reddit, Quora, Foursquare) were down for

days

• Future Worst Case: Amazon goes completely offline. Means we need a very recent full backup of

both WWW and DB instances in a physically and logically separate facility + ability to load balance/

change DNS quickly

• Solution was to scale up Rackspace instances and make daily copies to those servers


2011 ARCHITECTURE


WHAT WE RECOMMEND• get physically and logically separate backup servers

• do nightly full copy backups to the above servers

• have a clear, written process in place for the following things:

• communicating with superiors about what’s happening

• what steps need to be taken to failover

• when the “worst-case” failover plan is implemented (can be time-based or circumstance-based

or both)

• either implement automatic load balancing or (not as good) have complete control over your DNS

• use a very good alerts monitoring service; some of the best ones are cheap/free. We use

binarycanary.com.


Dcpl cloud computing amazon fail

Technology

Transcript of Dcpl cloud computing amazon fail