Download - The Mushroom Cloud Effect - What happens when containers fail?

@mayralois

Mushroom Cloud Effect

Alois MayrTechnology Lead – Dynatrace

@mayralois

Monitoring for Highly DynamicContainer Environments

@mayraloisSource: http://www.schoonoart.de/

…there’s been the mushroom cloud effect

oh yeah, everything screwed up

@mayralois

The Mushroom Cloud Effector

What Happens When Containers Fail?

@mayralois

Biggest LatAm e-commerce Company

• ~ U$ 2.5 billion revenue• 4 sites: Americanas, Shoptime,

Submarino, Soubarato

• ~ 150 hosts across 4 regions• 5k-15k containers• 1k-3k services

@mayralois

TL;DR

@mayralois

About Cloud-Scale Systems

@mayralois

Important Aspects…

• Lots of (micro-)services

• Lots of communication between services

• Service dependencies

• Versioning and API compatibilities

• Zero downtime

@mayralois

Platform-related Aspects

• Most often container-based

• Clustered for scalability

• Ephemeral containers

• Resilient architecture

• Cross AZ fail-overs

• SDN for communication

@mayralois

Deployments are no Longer Static

7:00 a.m.Low load, service runningwith minimum redundancy

12:00 p.m.Scaled up service during peak loadwith failover of problematic node

7:00 p.m.Scaled back down to lower load, move to different geolocation

@mayralois

Anatomy of dynamic environments

https://www.dynatrace.com/en/ruxit/

@mayralois

All About (Service) Dependencies

@mayralois

Failing containers…

…may or may not have an (immediate) impact on service performance

@mayralois

Cascading Failures Lead to aMushroom Cloud Effect

@mayralois

@mayralois

The Hungry Container Breakdown

• Shared /logs partition on host• No log rotation, no archiving for app logs• No proper log management used for Docker environment• Shared /logs partition ran out of space

What was the problem?

@mayralois


• Container health checks failed• Orchestration killed container and rescheduled new one• Still no free space on /logs• Termination and rescheduling• /var/lib/docker ran out of space• Cluster nodes were no longer able to run any containers

How the problem has evolved over time?

@mayralois


• Services at the top of the graph• Increased failure rates• Lots of depending Tomcat and DB services affected

How the problem affected services?

@mayralois

@mayralois


Log management tools for app logs--log-driver=none|syslog

Remove container--rm=true

/var/lib/docker deserves its own partition

How the problem could have been avoided?

@mayralois


Buggy Containers May Kill Your Nodes

@mayralois

Try to Break Your Clusters Early(And be Prepared for Black Friday)

@mayralois

Break Your Clusters Early

Massive load testing!

Survive three days of pain

Include everythingServices, Containers,

Orchestration, EC2 instances

@mayralois

Testing everything

13.3k containers (+nodes)

3,451 services

@mayralois

@mayralois

Automation Needed to Pinpoint the Root Cause of Cascading Failures!

@mayralois

Want to learn more?

Stop by our booth!G15

@mayralois

Thank you!