Chaos Engineering - Limiting Damage During Chaos Experiments
-
Upload
nils-meder -
Category
Engineering
-
view
212 -
download
10
Transcript of Chaos Engineering - Limiting Damage During Chaos Experiments
Limiting Damage During Chaos ExperimentsNils Meder | Computer Scientist @ Adobe
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Agenda
• Doing Chaos In Your Production System
• Building A Context Around Your Experiment
• Protect Your Infrastructure
• Example: Kill Random Instances
• Protect Your Application
• Resilience Patterns
• Wrap-Up & Discussion
2
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Doing Chaos In Your Production System
• Testing in Production is The Ultimate Goal
• But, It is Not The First Step
• There are Always Differences Between Staging and Production
• Scale, Networking, Datasets, …
• Start In Staging Environment
• Make Sure Doesn’t Bring Down The Whole Service
• “Know Your Enemy” - Have A Clear View of Your Environment
• Iterate Over Your Experiments
• Be Brave - Having Just Some Basic Tests Running in Production is Better Than None
3
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Building A Context Around Your Experiments
• Chaos Testing is Not Just Pull The Plug
• Focus On Business Critical Scenarios/Components First
• Have A Clear Goal, e.g. What Happens When The Network Fails?
• Focus - Run One Experiment At a Time
• Monitor Your Experiments
• Define Fallbacks And Defaults
4
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Protect Your Infrastructure
• Target Infrastructure Components
• Think About Recovery
• Take Snapshots
• Limit The Damage To Single Instances
• Limit The Damage To Groups of Instances
• Of The Same Kind
• Within The Same Workflow
• Limit Percentage Of Impact
• Limit What Chaos Tests Are Allowed To Do
5
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Example: Kill Random Instances
• Terminate Random EC2 Instances
• Focus:
• What Happens If A Number Of My Servers Die?
• Does Autoscaling Work?
• Is the Web API still serving requests?
• The Test is Only Allowed To Terminate Instances
• Simulate Experiment Before
• Take An Environment Snapshot
• Run The Test
6
Chaos Test
App1 App2App3
Client
Appx
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Protect Your Application
• Plan For Chaos in Your Application
• Fail Fast, But Keep The Streams Flowing
• Build Your Application Isolated
• Apply Loose Coupling
• Introduce Latency Control
• Real-Time Data and Diagnostics
7
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Resilience Patterns
• Bulk Heads
• Building Failure Units
• Protect App Against Cross-Failures
• Event-Driven & Stateless
• Embrace Loose Coupling
• Circuit Breaker
• Timeouts
• Fallbacks
• Healthchecks
8
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Resilience Patterns
• “Release It!” - Michael Nygard
• More On Resilience Patterns, Anit-Patterns and Case-Studies
• ISBN-13: 978-0978739218
9
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
Wrap-Up & Discussion
• Expect The Unexpected
• Failures Are The Normal Case & Not Predictable
• Do Not Try To Avoid Failures. Embrace Them.
• Chaos Engineering Helps To Discover Weak Points
• Apply Resilience Patterns
10
© 2016 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential.
References
• Resilience Patterns: http://de.slideshare.net/ufried/patterns-of-resilience
• Bulk Heads: http://skife.org/architecture/fault-tolerance/2009/12/31/bulkheads.html
• Making APIs More Resilient: http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html
• “Release It!” - Michael Nygard
12