Five Years of EC2 Distilled

Five years of EC2 distilledGrig Gheorghiu

Silicon Valley Cloud Computing Meetup, Feb. 19th 2013

@griggheoagiletesting.blogspot.com

whoami

• Dir of Technology at Reliam (managed hosting)

• Sr Sys Architect at OpenX

• VP Technical Ops at Evite

• VP Technical Ops at Nasty Gal

EC2 creds

• Started with personal m1.small instance in 2008

• Still around!

• UPTIME:• 5:13:52 up 438 days, 23:33, 1 user, load average:

0.03, 0.09, 0.08

EC2 at OpenX

• end of 2008

• 100s then 1000s of instances

• one of largest AWS customers at the time

• NAMING is very important

• terminated DB server by mistake

• in ideal world naming doesn’t matter

EC2 at OpenX (cont.)

• Failures are very frequent at scale

• Forced to architect for failure and horizontal scaling

• Hard to scale at all layers at the same time (scaling app server layer can overwhelm DB layer; play wack-a-mole)

• Elasticity: easier to scale out than scale back


• Automation and configuration management become critical

• Used little-known tool - ‘slack’

• Rolled own EC2 management tool in Python, wrapped around EC2 Java API

• Testing deployments is critical (one mistake can get propagated everywhere)


• Hard to scale at the DB layer (MySQL)

• mysql-proxy for r/w split

• slaves behind HAProxy for reads

• HAProxy for LB, then ELB

• ELB melted initially, had to be gradually warmed up

EC2 at Evite

• Sharded MySQL at DB layer; application very write-intensive

• Didn’t do proper capacity planning/dark launching; had to move quickly from data center to EC2 to scale horizontally

• Engaged Percona at the same time

EC2 at Evite (cont.)

• Started with EBS volumes (separate for data, transaction logs, temp files)

• EBS horror stories

• CPU Wait up to 100%, instances AWOL

• I/O very inconsistent, unpredictable

• Striped EBS volumes in RAID0 helps with performance but not with reliability


• EBS apocalypse in April 2011

• Hit us even with masters and slaves in diff. availability zones (but all in single region - mistake!)

• IMPORTANT: rebuilding redundancy into your system is HARD

• For DB servers, reloading data on new server is a lengthy process


• General operation: very frequent failures (once a week); nightmare for pager duty

• Got very good at disaster recovery!

• Failover of master to slave

• Rebuilding of slave from master (xtrabackup)

• Local disks striped in RAID0 better than EBS


• Ended up moving DB servers back to data center

• Bare metal (Dell C2100, 144 GB RAM, RAID10); 2 MySQL instances per server

• Lots of tuning help from Percona

• BUT: EC2 was great for capacity planning! (Zynga does the same)


• Relational databases are not ready for the cloud (reliability, I/O performance)

• Still keep MySQL slaves in EC2 for DR

• Ryan Mack (Facebook): “We chose well-understood technologies so we could better predict capacity needs and rely on our existing monitoring and operational tool kits."


• Didn’t use provisioned IOPS for EBS

• Didn’t use VPC

• Great experience with Elastic Map Reduce, S3, Route 53 DNS

• Not so great experience with DynamoDB

• ELB OK but still need HAProxy behind it

EC2 at NastyGal

• VPC - really good idea!

• Extension of data center infrastructure

• Currently using it for dev/staging + some internal backend production

• Challenging to set up VPN tunnels to various firewall vendors (Cisco, Fortinet) - not much debugging on VPC side

Interacting with AWS

• AWS API (mostly Java based, but also Ruby and Python)

• Multi-cloud libraries: jclouds (Java), libcloud (Python), deltacloud (Ruby)

• Chef knife

• Vagrant EC2 provider

• Roll your own

Proper infrastructure care and feeding

• Monitoring - alerting, logging, graphing

• It’s not in production if it’s not monitored and graphed

• Monitoring is for ops what testing is for dev

• Great way to learn a new infrastructure

• Dev and ops on pager


• Going from #monitoringsucks to #monitoringlove and @monitorama

• Modern monitoring/graphing/logging tools

• Sensu, Graphite, Boundary, Server Density, New Relic, Papertrail, Pingdom, Dead Man’s Snitch


• Dashboards!

• Mission Control page with graphs based on Graphite and Google Visualization API

• Correlate spikes and dips in graphs with errors (external and internal monitoring)

• Akamai HTTP 500 alerts correlated with Web server 500 errors and DB server I/O wait increase


• HTTP 500 errors as a percentage of all HTTP requests across all app servers in the last 60 minutes


• Expect failures and recover quickly

• Capacity planning

• Dark launching

• Measure baselines

• Correlate external symptoms (HTTP 500) with metrics (CPU I/O Wait) then keep metrics under certain thresholds by adding resources


• Automate, automate, automate! - Chef, Puppet, CFEngine, Jenkins, Capistrano, Fabric

• Chef - can be single source of truth for infrastructure

• Running chef-client continuously on nodes requires discipline

• Logging into remote node is anti-pattern (hard!)


• Chef best practices

• Use knife - no snowflakes!

• Deploy new nodes, don’t do massive updates in place

• BUT! beware of OS monoculture

• kernel bug after 200+ days

• leapocalypse

Is the cloud worth the hype?

• It’s a game changer, but it’s not magical; try before you buy! (benchmarks could surprise you)

• Cloud expert? Carry pager or STFU

• Forces you to think about failure recovery, horizontal scalability, automation

• Something to be said about abstracting away the physical network - the most obscure bugs are network-related (ARP caching, routing tables)

So...when should I use the cloud?

• Great for dev/staging/testing

• Great for layers of infrastructure that contain many identical nodes and that are forgiving of node failures (web farms, Hadoop nodes, distributed databases)

• Not great for ‘snowflake’-type systems

• Not great for RDBMS (esp. write-intensive)

If you still want to use the cloud

• Watch that monthly bill!

• Use multiple cloud vendors

• Design your infrastructure to scale horizontally and to be portable across cloud vendors

• Shared nothing

• No SAN, NAS

If you still want to use the cloud

• Don’t get locked into vendor-proprietary services

• EC2, S3, Route 53, EMR are OK

• Data stores are not OK (DynamoDB)

• OpsWorks - debatable (based on Chef, but still locks you in)

• Wrap services in your own RESTful endpoints

Does EC2 have rivals?

• No (or at least not yet)

• Anybody use GCE?

• Other public clouds are either toys or smaller, with less features (no names named)

• Perception matters - not a contender unless featured on High Scalability blog

• APIs matter less (can use multi-cloud libs)

Does EC2 have rivals?

• OpenStack, CloudStack, Eucalyptus all seem promising

• Good approach: private infrastructure (bare metal, private cloud) for performance/reliability + extension into public cloud for elasticity/agility (EC2 VPC, Rack Connect)

• How about PaaS?

• Personally: too hard to relinquish control

Five Years of EC2 Distilled

Technology

Transcript of Five Years of EC2 Distilled