Epidemic Failures

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)


Slides originally written in April 2013 for a private conference and internal use at Netflix. Publishing now since Heartbleed is another example of an epidemic failure mode.

Transcript of Epidemic Failures

  • 1. Cloud Native and Epidemic Failures April 2014 Adrian Cockcroft @adrianco @BatteryVentures http://www.linkedin.com/in/adriancockcroft

2. Cloud Native? Epidemic Failures Automated Diversity 3. Cloud Native Construct a highly agile and highly available service from ephemeral and often broken components 4. Inspiration 5. Numquam ponenda est pluralitas sine necessitate Plurality must never be posited without necessity Occams Razor 6. Monoculture Replicate the best as patterns Reduce interaction complexity Epidemic single point of failure 7. Pattern Failures Infrastructure Pattern Failures Software Stack Pattern Failures Application Pattern Failures 8. Infrastructure Pattern Failures Device failures bad batch of disks, PSUs, etc. CPU failures cache corruption, math errors Datacenter failures power, network, disaster Routing failures DNS, Internet/ISP path 9. Software Stack Pattern Failures Time bombs Counter wrap, memory leak Date bombs - Leap year, leap second, epoch Expiration Certs timing out Trust revocation Certificate Authority fails Security exploit e.g. heartbleed Language bugs compile time Runtime bugs JVM, Linux, Hypervisor Network bugs routers, firewalls, protocols 10. Application Pattern Failures Time bombs Counter wrap, memory leak Date bombs - Leap year, leap second, epoch Content bombs Data dependent failure Configuration wrong/bad syntax Versioning incompatible mixes Cascading failures error handling bugs etc. Cascading overload excessive logging etc. 11. What to do? Automated diversity management Diversified automation Efficient vs. Antifragile 12. Specific Ideas Automate running a mixture Diversity as default for any service stack No developer overhead, stay agile, low cost Support oldest and newest versions together Automate running 50/50 mix CentOS/Ubuntu Mix versions of JDK, Tomcat, etc. Vendor diversity Multiple DNS vendors, cloud regions, costs more Multiple cloud vendors? Much higher cost. 13. Generate Permutations > epi epi java linux codeversion 1 java6 centos v34 2 java7 centos v34 3 java6 ubuntu v34 4 java7 ubuntu v34 5 java6 centos v35 6 java7 centos v35 7 java6 ubuntu v35 8 java7 ubuntu v35 14. Deployment Builds Manual to test, automate if it works Modify build to generate permutation AMIs Modify Asgard to auto-deploy permutations Data collection Tag each instance with its permutation Gather metrics by permutation per instance Do R-based Design of Experiments analysis 15. Analysis As a function of permutations Error rate Response time CPU Utilization Interactions E.g. interaction between linux and java Contrasts identify components with issues Small changes with high statistical significance 16. GCS Total API Outage for ~1hr 17. Takeaway Watch out for monocultures A|B Testing its not just for personalization http://perfcap.blogspot.com http://slideshare.net/adrianco Netflix http://slideshare.net/adriancockcroft - Battery http://www.linkedin.com/in/adriancockcroft @adrianco @BatteryVentures