The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started:...
Transcript of The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started:...
The Practice of Chaos Engineering
Ana MedinaChaos Engineer at Gremlin
@ana_m_medina
@ana_m_medina#reactive18
@ana_m_medina
Gremlin
UberSFEFCU Google Quicken LoansStanford University Miami Dade College
@ana_m_medina#reactive18
How many of you have heard of Chaos Engineering?
@ana_m_medina#reactive18
How many of you have run a Chaos Engineering
experiment?
@ana_m_medina#reactive18
What is Chaos Engineering?
@ana_m_medina#reactive18
Thoughtful, planned experiments designed to reveal the weakness in our systems.
Chaos Engineering
@ana_m_medina#reactive18
Inject something harmful to build an immunity.
-@KoltonAndrusGremlin Founder and CEO
Chaos Engineering
@ana_m_medina#reactive18
Why?● Microservices ● Systems are scaling
fast● Downtime is really
expensive● Our dependencies
will fail● Pager fatigue and
burnout really hurts
@ana_m_medina#reactive18
Use Cases:● Outage
reproduction● On-call training● Strengthen new
products ● Battle test new
infrastructure and services
@ana_m_medina#reactive18
What do you need before doing Chaos Engineering?
● Monitoring/Observability ● On-Call and Incident Management ● Cost of Downtime Per Hour
@ana_m_medina#reactive18
Chaos Engineering is not● Unexpected or unmonitored experiments● Creating outages
@ana_m_medina#reactive18
“Chaos Engineering Without Observability ... Is Just Chaos”
-@mipsytipsyCharity MajorsCEO of honeycomb
@ana_m_medina#reactive18
Minimize theBlast radius
THE BEGINNING
Chaos Monkey
Level 0
MATURITY REQUIRED
Low
APPROACH TAKEN
Random
VALUE PROVIDED
Prepare for host failures in the cloud
@ana_m_medina#reactive18
THE FIRST STEP
Infrastructure Failures
Level 1
MATURITY REQUIRED
Basic Operations
APPROACH TAKEN
Disciplined
VALUE PROVIDED
Prepare for host-level failures
@ana_m_medina#reactive18
INTERMEDIATE
Network Failures
Level 1.5
MATURITY REQUIRED
Networking expertise
APPROACH TAKEN
Gameday
VALUE PROVIDED
Prepare for high impact events
@ana_m_medina#reactive18
THE NEXT STEP
Application Failures
Level 2
MATURITY REQUIRED
Advanced
APPROACH TAKEN
Precision Experiments
VALUE PROVIDED
Safely validate the user experience
@ana_m_medina#reactive18
Latency added to 50% of android traffic
Exceptions - 50% of android traffic failed
@ana_m_medina#reactive18
You can and should inject chaos at every layer of your stack
● Application● API ● Caching● Database● Hardware● Cloud Infrastructure / Bare metal
@ana_m_medina#reactive18
Top places to inject chaos
@ana_m_medina#reactive18
@ana_m_medina#reactive18
https://www.gremlin.com/community/tutorials/what-i-learned-running-the-chaos-lab-kafka-breaks/
@ana_m_medina#reactive18
Getting Started:
● Identify top 5 critical systems ● Choose system● Whiteboard the system● Determine what experiment you want to run:
(resource, state, network)● Determine Blast Radius
@ana_m_medina#reactive18
Companies doing Chaos Engineering
@ana_m_medina#reactive18
Chaos Days
@ana_m_medina#reactive18
Chaos Days: Dedicated day for your entire company to focus on building resilience instead of new products.
https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/
@ana_m_medina#reactive18
“What could go wrong?”
“Do we know what will happen if this breaks?”
@ana_m_medina#reactive18
Chaos Day Crew:
VP Engineering / CTO / COOExecutive AssistantEngineering Director / ManagerSenior EngineerNew Grad / Intern Engineer
@ana_m_medina#reactive18
What experiments can you run?
• Reproduce outage conditions• Unpredictable circumstances • Large traffic spikes • Race conditions • Datacenter failure • Time travel - system clocks to be out of sync • Network errors • CPU overloads
@ana_m_medina#reactive18
@ana_m_medina#reactive18
@ana_m_medina#reactive18
@ana_m_medina#reactive18
@ana_m_medina#reactive18
@ana_m_medina#reactive18
gremlin.com/chaos-monkey/
@ana_m_medina#reactive18
Join the Chaos, Join Slack:
bit.ly/chaos-eng-slack1,900+ members across the world
Learn more: