Embracing Failure - Fault Injection and Service Resilience at Netflix
-
Upload
josh-evans -
Category
Technology
-
view
469 -
download
1
Transcript of Embracing Failure - Fault Injection and Service Resilience at Netflix
![Page 1: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/1.jpg)
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Embracing FailureFault Injection and Service Resilience at Netflix
Josh Evans – Director of Operations Engineering
Naresh Gopalani – Software Engineer and Architect
![Page 2: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/2.jpg)
• ~50 million members, ~50 countries
• > 1 billion hours per month
• > 1000 device types
• 3 AWS Regions, hundreds of services
• Hundreds of thousands of requests/second
• CDN serves petabytes of data at terabits/second
Netflix Ecosystem
Service
Partners
Static
ContentAkamai
Netflix CDN
AWS/Netfli
x
Control
Plane
Internet
![Page 3: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/3.jpg)
Availability means that members can
● sign up
● activate a device
● browse
● watch
![Page 4: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/4.jpg)
What keeps us up at night
![Page 5: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/5.jpg)
Failures can happen any time
• Disks fail
• Power outages
• Natural disasters
• Software bugs
• Human error
![Page 6: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/6.jpg)
We design for failure
• Exception handling
• Fault tolerance and isolation
• Fall-backs and degraded experiences
• Auto-scaling clusters
• Redundancy
![Page 7: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/7.jpg)
Testing for failure is hard
• Web-scale traffic
• Massive, changing data sets
• Complex interactions and request patterns
• Asynchronous, concurrent requests
• Complete and partial failure modes
Constant innovation and change
![Page 8: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/8.jpg)
What if we regularly inject failures
into our systems under controlled
circumstances?
![Page 9: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/9.jpg)
![Page 10: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/10.jpg)
Blast Radius
• Unit of isolation
• Scope of an outage
• Scope a chaos exercise
Zone
Region
Instance
Global
![Page 11: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/11.jpg)
An Instance Fails
Edge Cluster
Cluster A
Cluster B
Cluster D
Cluster C
![Page 12: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/12.jpg)
Chaos Monkey
• Monkey loose in your DC• Run during business hours
• What we learned– Auto-replacement works– State is problematic
![Page 13: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/13.jpg)
A State of Xen - Chaos Monkey & Cassandra
Out of our 2700+ Cassandra nodes• 218 rebooted
• 22 did not reboot successfully
• Automation replaced failed nodes
• 0 downtime due to reboot
![Page 14: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/14.jpg)
An Availability Zone Fails
EU-West
US-EastUS-West
AZ1AZ2
![Page 15: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/15.jpg)
Chaos Gorilla
Simulate an availability zone
outage
• 3-zone configuration
• Eliminate one zone
• Ensure that others can
handle the load and
nothing breaks
Chaos Gorilla
![Page 16: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/16.jpg)
Challenges
• Rapidly shifting traffic– LBs must expire connections quickly
– Lingering connections to caches must be addressed
• Service configuration– Not all clusters auto-scaled or pinned
– Services not configured for cross-zone calls
– Mismatched timeouts – fallbacks prevented fail-over
![Page 17: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/17.jpg)
A Region Fails
EU-WestUS-EastUS-West
![Page 18: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/18.jpg)
AZ1 AZ2 AZ3
Regional Load Balancers
Zuul – Traffic Shaping/Routing
Data Data Data
Geo-located
Chaos Kong
Chaos Kong
AZ1 AZ2 AZ3
Regional Load Balancers
Zuul – Traffic Shaping/Routing
Data Data Data
Customer
Device
![Page 19: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/19.jpg)
Challenges
● Rapidly shifting traffic
○ Auto-scaling configuration
○ Static configuration/pinning
○ Instance start time
○ Cache fill time
![Page 20: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/20.jpg)
Challenges
● Service Configuration
○ Timeout configurations
○ Fallbacks fail or don’t provide the
desired experience
● No minimal (critical) stack
○ Any service may be critical!
![Page 21: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/21.jpg)
A Service Fails
Zone
Region
Global
Service
![Page 22: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/22.jpg)
Services Slow Down and Fail
Simulate latent/failed service
calls
• Inject arbitrary latency and errors at
the service level
• Observe for effects
Latency Monkey
![Page 23: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/23.jpg)
Latency Monkey
Device ZuulELB Edge Service B
Service C
Internet
Service A
![Page 24: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/24.jpg)
Challenges• Startup resiliency is an issue
• Services owners don’t know all dependencies
• Fallbacks can fail too
• Second order effects not easily tested
• Dependencies are in constant flux
• Latency Monkey tests function and scale
– Not a staged approach
– Lots of opt-outs
![Page 25: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/25.jpg)
More Precise and Continuous
![Page 26: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/26.jpg)
Service Failure Testing:FIT
![Page 27: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/27.jpg)
Distributed Systems Fail
● Complex interactions at scale
● Variability across services
● Byzantine failures
● Combinatorial complexity
![Page 28: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/28.jpg)
Any service can cause cascading failures
ELB
![Page 29: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/29.jpg)
Fault Injection Testing (FIT)
Device Service B
Service C
Internet Edge
Device or Account Override
Zuul
Service A
Request-level simulations
ELB
![Page 30: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/30.jpg)
Failure Injection Points
IPC Cassandra Client Memcached Client Service Container Fault Tolerance
![Page 31: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/31.jpg)
FIT Details
● Common Simulation Syntax
● Single Simulation Interface
● Transported via Http Request header
![Page 32: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/32.jpg)
Integrating Failure
Service
Filter
Ribbon
Service
Filter
Ribbon
ServerRcv
ServerRcv
ClientSend
request
Service A
response
Service B
[sendRequestHeader] >>fit.failure: 1|fit.Serializer|
2|[[{"name”:”failSocial,
”whitelist":false,
"injectionPoints”:
[“SocialService”]},{}
]],
{"Id":
"252c403b-7e34-4c0b-a28a-3606fcc38768"}]]
![Page 33: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/33.jpg)
Failure Scenarios
● Set of injection points to fail
● Defined based on
○ Past outages
○ Specific dependency interactions
○ Whitelist of a set of critical services
○ Dynamic tracing of dependencies
![Page 34: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/34.jpg)
FIT Insights : Salp● Distributed tracing inspired by Dapper paper
● Provides insight into dependencies
● Helps define & visualize scenarios
![Page 35: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/35.jpg)
Functional Validation
● Isolated synthetic transactions
○ Set of devices
Validation at Scale
● Dial up customer traffic - % based
● Simulation of full service failure
Dialing Up Failure
Chaos!
![Page 36: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/36.jpg)
Continuous Validation
Critical
Services
Non-critical
Services
Synthetic
Transactions
![Page 37: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/37.jpg)
Don’t Fear The Monkeys
![Page 38: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/38.jpg)
Take-aways• Don’t wait for random failures
– Cause failure to validate resiliency
– Remove uncertainty by forcing failures regularly
– Better to fail at 2pm than 2am
• Test design assumptions by stressing them
Embrace Failure
![Page 39: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/39.jpg)
The Simian Army is part of the Netflix open source cloud platform
http://netflix.github.com
![Page 40: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/40.jpg)
Netflix talks at re:InventTalk Time Title
BDT-403 Wednesday, 2:15pm Next Generation Big Data Platform at Netflix
PFC-306 Wednesday, 3:30pm Performance Tuning EC2
DEV-309 Wednesday, 3:30pm From Asgard to Zuul, How Netflix’s proven Open Source
Tools can accelerate and scale your services
ARC-317 Wednesday, 4:30pm Maintaining a Resilient Front-Door at Massive Scale
PFC-304 Wednesday, 4:30pm Effective Inter-process Communications in the Cloud: The
Pros and Cons of Micro Services Architectures
ENT-209 Wednesday, 4:30pm Cloud Migration, Dev-Ops and Distributed Systems
APP-310 Friday 9:00am Scheduling using Apache Mesos in the Cloud
![Page 41: Embracing Failure - Fault Injection and Service Resilience at Netflix](https://reader035.fdocuments.net/reader035/viewer/2022062710/55a252121a28abdd758b469c/html5/thumbnails/41.jpg)
Please give us your feedback on this
presentation
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Join the conversation on Twitter with #reinvent
Josh Evans
@josh_evans_nflx
Naresh Gopalani