The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started:...

38
The Practice of Chaos Engineering Ana Medina Chaos Engineer at Gremlin @ana_m_medina

Transcript of The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started:...

Page 1: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

The Practice of Chaos Engineering

Ana MedinaChaos Engineer at Gremlin

@ana_m_medina

Page 2: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

@ana_m_medina

Gremlin

UberSFEFCU Google Quicken LoansStanford University Miami Dade College

Page 3: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

How many of you have heard of Chaos Engineering?

Page 4: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

How many of you have run a Chaos Engineering

experiment?

Page 5: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

What is Chaos Engineering?

Page 6: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Thoughtful, planned experiments designed to reveal the weakness in our systems.

Chaos Engineering

Page 7: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Inject something harmful to build an immunity.

-@KoltonAndrusGremlin Founder and CEO

Chaos Engineering

Page 8: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Why?● Microservices ● Systems are scaling

fast● Downtime is really

expensive● Our dependencies

will fail● Pager fatigue and

burnout really hurts

Page 9: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Use Cases:● Outage

reproduction● On-call training● Strengthen new

products ● Battle test new

infrastructure and services

Page 10: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

What do you need before doing Chaos Engineering?

● Monitoring/Observability ● On-Call and Incident Management ● Cost of Downtime Per Hour

Page 11: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Chaos Engineering is not● Unexpected or unmonitored experiments● Creating outages

Page 12: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

“Chaos Engineering Without Observability ... Is Just Chaos”

-@mipsytipsyCharity MajorsCEO of honeycomb

Page 13: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Minimize theBlast radius

Page 14: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

THE BEGINNING

Chaos Monkey

Level 0

MATURITY REQUIRED

Low

APPROACH TAKEN

Random

VALUE PROVIDED

Prepare for host failures in the cloud

@ana_m_medina#reactive18

Page 15: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

THE FIRST STEP

Infrastructure Failures

Level 1

MATURITY REQUIRED

Basic Operations

APPROACH TAKEN

Disciplined

VALUE PROVIDED

Prepare for host-level failures

@ana_m_medina#reactive18

Page 16: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

INTERMEDIATE

Network Failures

Level 1.5

MATURITY REQUIRED

Networking expertise

APPROACH TAKEN

Gameday

VALUE PROVIDED

Prepare for high impact events

@ana_m_medina#reactive18

Page 17: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

THE NEXT STEP

Application Failures

Level 2

MATURITY REQUIRED

Advanced

APPROACH TAKEN

Precision Experiments

VALUE PROVIDED

Safely validate the user experience

@ana_m_medina#reactive18

Page 18: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

Latency added to 50% of android traffic

Page 19: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

Exceptions - 50% of android traffic failed

Page 20: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

You can and should inject chaos at every layer of your stack

● Application● API ● Caching● Database● Hardware● Cloud Infrastructure / Bare metal

Page 21: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Top places to inject chaos

Page 22: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Page 23: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

https://www.gremlin.com/community/tutorials/what-i-learned-running-the-chaos-lab-kafka-breaks/

Page 24: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Getting Started:

● Identify top 5 critical systems ● Choose system● Whiteboard the system● Determine what experiment you want to run:

(resource, state, network)● Determine Blast Radius

Page 25: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Companies doing Chaos Engineering

Page 26: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Chaos Days

Page 27: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Chaos Days: Dedicated day for your entire company to focus on building resilience instead of new products.

https://www.gremlin.com/community/tutorials/planning-your-own-chaos-day/

Page 28: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

“What could go wrong?”

“Do we know what will happen if this breaks?”

Page 29: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Chaos Day Crew:

VP Engineering / CTO / COOExecutive AssistantEngineering Director / ManagerSenior EngineerNew Grad / Intern Engineer

Page 30: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

What experiments can you run?

• Reproduce outage conditions• Unpredictable circumstances • Large traffic spikes • Race conditions • Datacenter failure • Time travel - system clocks to be out of sync • Network errors • CPU overloads

Page 31: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Page 32: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Page 33: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Page 34: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Page 35: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Page 36: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

gremlin.com/chaos-monkey/

Page 37: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

Join the Chaos, Join Slack:

bit.ly/chaos-eng-slack1,900+ members across the world

Learn more:

Page 38: The Practice of @ana m medina Chaos Engineering · #reactive18 @ana_m_medina Getting Started: Identify top 5 critical systems Choose system Whiteboard the system Determine what experiment

@ana_m_medina#reactive18

THANKS!

@ana_m_medina

[email protected]