Five Years of EC2 Distilled
-
Upload
grig-gheorghiu -
Category
Technology
-
view
5.002 -
download
0
description
Transcript of Five Years of EC2 Distilled
Five years of EC2 distilledGrig Gheorghiu
Silicon Valley Cloud Computing Meetup, Feb. 19th 2013
@griggheoagiletesting.blogspot.com
whoami
• Dir of Technology at Reliam (managed hosting)
• Sr Sys Architect at OpenX
• VP Technical Ops at Evite
• VP Technical Ops at Nasty Gal
EC2 creds
• Started with personal m1.small instance in 2008
• Still around!
• UPTIME:• 5:13:52 up 438 days, 23:33, 1 user, load average:
0.03, 0.09, 0.08
EC2 at OpenX
• end of 2008
• 100s then 1000s of instances
• one of largest AWS customers at the time
• NAMING is very important
• terminated DB server by mistake
• in ideal world naming doesn’t matter
EC2 at OpenX (cont.)
• Failures are very frequent at scale
• Forced to architect for failure and horizontal scaling
• Hard to scale at all layers at the same time (scaling app server layer can overwhelm DB layer; play wack-a-mole)
• Elasticity: easier to scale out than scale back
EC2 at OpenX (cont.)
• Automation and configuration management become critical
• Used little-known tool - ‘slack’
• Rolled own EC2 management tool in Python, wrapped around EC2 Java API
• Testing deployments is critical (one mistake can get propagated everywhere)
EC2 at OpenX (cont.)
• Hard to scale at the DB layer (MySQL)
• mysql-proxy for r/w split
• slaves behind HAProxy for reads
• HAProxy for LB, then ELB
• ELB melted initially, had to be gradually warmed up
EC2 at Evite
• Sharded MySQL at DB layer; application very write-intensive
• Didn’t do proper capacity planning/dark launching; had to move quickly from data center to EC2 to scale horizontally
• Engaged Percona at the same time
EC2 at Evite (cont.)
• Started with EBS volumes (separate for data, transaction logs, temp files)
• EBS horror stories
• CPU Wait up to 100%, instances AWOL
• I/O very inconsistent, unpredictable
• Striped EBS volumes in RAID0 helps with performance but not with reliability
EC2 at Evite (cont.)
• EBS apocalypse in April 2011
• Hit us even with masters and slaves in diff. availability zones (but all in single region - mistake!)
• IMPORTANT: rebuilding redundancy into your system is HARD
• For DB servers, reloading data on new server is a lengthy process
EC2 at Evite (cont.)
• General operation: very frequent failures (once a week); nightmare for pager duty
• Got very good at disaster recovery!
• Failover of master to slave
• Rebuilding of slave from master (xtrabackup)
• Local disks striped in RAID0 better than EBS
EC2 at Evite (cont.)
• Ended up moving DB servers back to data center
• Bare metal (Dell C2100, 144 GB RAM, RAID10); 2 MySQL instances per server
• Lots of tuning help from Percona
• BUT: EC2 was great for capacity planning! (Zynga does the same)
EC2 at Evite (cont.)
• Relational databases are not ready for the cloud (reliability, I/O performance)
• Still keep MySQL slaves in EC2 for DR
• Ryan Mack (Facebook): “We chose well-understood technologies so we could better predict capacity needs and rely on our existing monitoring and operational tool kits."
EC2 at Evite (cont.)
• Didn’t use provisioned IOPS for EBS
• Didn’t use VPC
• Great experience with Elastic Map Reduce, S3, Route 53 DNS
• Not so great experience with DynamoDB
• ELB OK but still need HAProxy behind it
EC2 at NastyGal
• VPC - really good idea!
• Extension of data center infrastructure
• Currently using it for dev/staging + some internal backend production
• Challenging to set up VPN tunnels to various firewall vendors (Cisco, Fortinet) - not much debugging on VPC side
Interacting with AWS
• AWS API (mostly Java based, but also Ruby and Python)
• Multi-cloud libraries: jclouds (Java), libcloud (Python), deltacloud (Ruby)
• Chef knife
• Vagrant EC2 provider
• Roll your own
Proper infrastructure care and feeding
• Monitoring - alerting, logging, graphing
• It’s not in production if it’s not monitored and graphed
• Monitoring is for ops what testing is for dev
• Great way to learn a new infrastructure
• Dev and ops on pager
Proper infrastructure care and feeding
• Going from #monitoringsucks to #monitoringlove and @monitorama
• Modern monitoring/graphing/logging tools
• Sensu, Graphite, Boundary, Server Density, New Relic, Papertrail, Pingdom, Dead Man’s Snitch
Proper infrastructure care and feeding
• Dashboards!
• Mission Control page with graphs based on Graphite and Google Visualization API
• Correlate spikes and dips in graphs with errors (external and internal monitoring)
• Akamai HTTP 500 alerts correlated with Web server 500 errors and DB server I/O wait increase
Proper infrastructure care and feeding
• HTTP 500 errors as a percentage of all HTTP requests across all app servers in the last 60 minutes
Proper infrastructure care and feeding
• Expect failures and recover quickly
• Capacity planning
• Dark launching
• Measure baselines
• Correlate external symptoms (HTTP 500) with metrics (CPU I/O Wait) then keep metrics under certain thresholds by adding resources
Proper infrastructure care and feeding
• Automate, automate, automate! - Chef, Puppet, CFEngine, Jenkins, Capistrano, Fabric
• Chef - can be single source of truth for infrastructure
• Running chef-client continuously on nodes requires discipline
• Logging into remote node is anti-pattern (hard!)
Proper infrastructure care and feeding
• Chef best practices
• Use knife - no snowflakes!
• Deploy new nodes, don’t do massive updates in place
• BUT! beware of OS monoculture
• kernel bug after 200+ days
• leapocalypse
Is the cloud worth the hype?
• It’s a game changer, but it’s not magical; try before you buy! (benchmarks could surprise you)
• Cloud expert? Carry pager or STFU
• Forces you to think about failure recovery, horizontal scalability, automation
• Something to be said about abstracting away the physical network - the most obscure bugs are network-related (ARP caching, routing tables)
So...when should I use the cloud?
• Great for dev/staging/testing
• Great for layers of infrastructure that contain many identical nodes and that are forgiving of node failures (web farms, Hadoop nodes, distributed databases)
• Not great for ‘snowflake’-type systems
• Not great for RDBMS (esp. write-intensive)
If you still want to use the cloud
• Watch that monthly bill!
• Use multiple cloud vendors
• Design your infrastructure to scale horizontally and to be portable across cloud vendors
• Shared nothing
• No SAN, NAS
If you still want to use the cloud
• Don’t get locked into vendor-proprietary services
• EC2, S3, Route 53, EMR are OK
• Data stores are not OK (DynamoDB)
• OpsWorks - debatable (based on Chef, but still locks you in)
• Wrap services in your own RESTful endpoints
Does EC2 have rivals?
• No (or at least not yet)
• Anybody use GCE?
• Other public clouds are either toys or smaller, with less features (no names named)
• Perception matters - not a contender unless featured on High Scalability blog
• APIs matter less (can use multi-cloud libs)
Does EC2 have rivals?
• OpenStack, CloudStack, Eucalyptus all seem promising
• Good approach: private infrastructure (bare metal, private cloud) for performance/reliability + extension into public cloud for elasticity/agility (EC2 VPC, Rack Connect)
• How about PaaS?
• Personally: too hard to relinquish control