AWS re:Invent re:Cap - AWS re:Invent 2014 주요 발표 및 강연 정리 - Thomas Park
SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013
-
Upload
amazon-web-services -
Category
Technology
-
view
635 -
download
7
description
Transcript of SmugMug's Zero-Downtime Migration to AWS (ARC312) | AWS re:Invent 2013
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Andrew Shieh, SmugMug Operationsshandrew @ smugmug.comNovember 15, 2013
SmugMug’s Zero Downtime Migration to AWSARC312
Friday, November 15, 13
SmugMug—Who are we?
Friday, November 15, 13
The early days of SmugMug• Gradual bootstrapped growth• Multiple self-managed datacenter cages• Too many servers of varying types• Too many disks• Tons of valuable skilled employee
hours spent in cages
Friday, November 15, 13
DataCenter Fantasy
Friday, November 15, 13
Data Center Reality
Friday, November 15, 13
Data Center Reality
Friday, November 15, 13
SmugMug <3 AWS• Early adopter of Amazon S3• Over the years, moved rendering,
upload, archiving, payments, permissions, email, and more compute to AWS
• Before mid-2012, no ultra-high performance I/O
Friday, November 15, 13
SmugMug Architecture ~2006
AWS: S3
AWS: S3SV: Web, DB, Image*
Friday, November 15, 13
SmugMug Architecture ~2011
AWS: S3
AWS: S3, Image (upload, processing, render, video, …) SV: Web, DB
Friday, November 15, 13
SmugMug Architecture - Transition
AWS: S3
AWS: S3, Image*, WebSV: Web, DB
DC: Replication DB, Direct Connect
Friday, November 15, 13
SmugMug Architecture Today
AWS: S3, Image*, Web, DBØ
Friday, November 15, 13
How did we get there?
Friday, November 15, 13
Our database I/O evolution:Always cutting edge• Started with MySQL on spinning
disk RAID, max RAM• Moved to ZFS SSD + SSD cache +
spinning disks• Moved to custom 24-SSD arrays
Friday, November 15, 13
hi1.4xlarge FTW• our custom, obscure hardware =>
difficult to resolve problems,difficult to upgrade
• hi1 overall DB IO performance comparable to 8 x SSD RAID10
• < 3%/yr hi1 instance failure rate!
Friday, November 15, 13
Amazon VPC - also a big win• Easy mapping of internal / external network security
model to AWS
Friday, November 15, 13
Zero downtime move?
Friday, November 15, 13
Friday, November 15, 13
Friday, November 15, 13
Zero Downtime Move• Flexibility of the AWS cloud
makes a zero downtime move inexpensive. Pay for only what you use. Provision fast.
• Plan• Test• Plan and test again
Friday, November 15, 13
Major changes post-move• Database storage goes from SSD to
hi1.4xlarge ephemeral• Hardware load balancers become
Elastic Load Balancing load balancers
Friday, November 15, 13
Major changes post-move• Database storage goes from SSD to
hi1.4xlarge ephemeral• Hardware load balancers become ELB• haproxy layer 7 load/traffic directing
goes from static to dynamic config• Web servers autoscale for each cluster• Membase to ElastiCache (later to
Amazon EC2)
Friday, November 15, 13
Zero Downtime Move Requirements• Read-only site mode• Traffic control — shadow load• Cross country MySQL replication +
sufficient bandwidth
Friday, November 15, 13
Zero Downtime Move Requirements• Read-only site mode• Traffic control — shadow load• Cross country MySQL replication +
sufficient bandwidth
• Bot testing• Read-only live site testing w/ QA
Friday, November 15, 13
More on moving• Full scale read-write testing
is difficult• Be aware of AWS limits• Talk to support for big
growth• Roll back plan - manage
risky change
Friday, November 15, 13
Flipping the switch to AWS• “The biggest, scariest engineering
change we've made in the company's history” - Don, SmugMug Chief Geek
• Go read-only (1 min)• Pre-Scale up big• MHA to reassign MySQL
masters and their replication (30min)• Point DNS+CDN to Elastic Load
Balancing (5-30m)
Friday, November 15, 13
Flipping the switch to AWS• Test! (60 min)• When Read-only is
all good, go to read-write (5 min)
• Test! Inevitable bugs at this step (hours)
Friday, November 15, 13
MHA?• Facebook, DeNA
• Helps to reliably reassign MySQL masters and replication, maintaining consistency
Friday, November 15, 13
MHA?• Manual failover in MySQL
5.5 and earlier is painful, time-consuming
• Be careful with automation for rare events — it can bite
Friday, November 15, 13
Problems?• Completely redundant
network links can fail• Bugs related to IP address
change• ElastiCache performance• NewRelic! Use it or a similar
APM product
Friday, November 15, 13
Results
Friday, November 15, 13
Results
Friday, November 15, 13
Results• Data Center - performance fluctuated
through day• AWS w/scaling - flat performance
throughout the day - significant scalability limits removed
• Networking was a key improvement• Success!
Friday, November 15, 13
Lessons Learned• We love AWS even more than before• Automate everything• Understand Amazon EBS, and
understand underlying details of AWS services
• Unpredictable Ops schedules vs. large projects
Friday, November 15, 13
Lessons Learned
Job #1: Making business happen
Friday, November 15, 13
We made more changes, because we could• As long as we’re moving our infrastructure,
why not rebuild most of it too?• Linux, MySQL, package versions upgraded• New monitoring tools• NFS dependencies eliminated, moved to
Amazon S3 or DynamoDB• Code pushes managed by nice distributed
tools utilizing Amazon S3 + internal torrent
Friday, November 15, 13
One last thing...• Go Multi-availability-zone!• Load balancers send traffic to multiple
haproxy per AZ with AZ-specific web clusters, DB replicas
• Backed up w/ cross AZ• Keep SPOFs in one AZ
Friday, November 15, 13
Questions?Andrew Shieh, Sunnyvale, [email protected]@shandrew
http://www.smugmug.com/ http://pics.shieh.info/
Thank you!
Friday, November 15, 13
Please give us your feedback on this presentation
As a thank you, we will select prize winners daily for completed surveys!
ARC312 - SmugMug’s Zero Downtime Migration to AWS
Thank You
Friday, November 15, 13