Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina...
Transcript of Evolve or Die - SIGCOMMconferences.sigcomm.org/sigcomm/2016/files/program/...Ramesh Govindan, Ina...
Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin
Vahdat… and a cast of hundreds at Google
Network availability is the biggest challenge facing large content and
cloud providers today
2
Why?
3
At four 9s availability❖ Outage budget is 4 mins per month
At five 9s availability❖ Outage budget is 24 seconds per month
The push towards higher 9s of availability
4
By learning from failuresHow do providers achieve these levels?
What design principles can achieve high availability?
What has Google Learnt from Failures?
Why is high network availability a challenge?
What are the characteristics of network availability failures?
5
Paper’s Focus
Why is high network availability a challenge?
Velocity of EvolutionScale
Management Complexity
6
Evolution
Time
Cap
acity
Saturn
Firehose 1.0
Watchtower
Firehose 1.1
4 Post
Jupiter
7
Network hardware evolves continuously
Evolution
B4
2006
2008
2010
2012
2014Google Global Cache
BwE
JupitergRPC
Freedome
Watchtower
QUIC
Andromeda
8
So does network software
Evolution
9
New hardware and software can❖ Introduce bugs❖ Disrupt existing software
Result: Failures!
B2B4
Data centers
Other ISPs
Scale and Complexity
10
Scale and Complexity
11
B4 and Data Centers❖ Use merchant silicon chips❖ Centralized control planes
Design Differences
B2❖ Vendor gear❖ Decentralized control plane
Scale and Complexity
12
Design Differences
These differences increase management complexity and pose availability challenges
The Management
Plane
Management Plane Software
13
Managesnetwork evolution
Management Plane Operations
Connect a new data center to B2 and B4
Upgrade B4 or data center control plane software
Drain or undrain links, switches, routers, services
Many operations require multiple steps and can
take hours or days
Temporarily remove from service
14
The Management
Plane
15
Low-level abstractions for management operations❖ Command-line interfaces to high
capacity routers
A small mistake by operator can impact a large part of network
Why is high network availability a challenge?
What are the characteristics of network availability failures?
Duration, Severity, PrevalenceRoot-cause Categorization
16
Key Takeaway
17
Content provider networks evolve rapidly
The way we manage evolution can impact availability
We must make it easy and safe to evolve the network daily
We analyzed over 100 Post-mortem reports written over a
2 year period
18
What is a Post-mortem?
Carefully curated description of a previously unseen failure that had significant availability impact
Helps learn from failures
19
Blame-free process
What a Post-Mortem
Contains
20
Description of failure, with detailed timeline
Root-cause(s) confirmed by reproducing the failure
Discussion of fixes, follow up action items
Failure Examples
and Impact
21
❖ Entire control plane fails❖ Upgrade causes backbone traffic shift❖Multiple top-of-rack switches fail
Examples
❖ Data center goes offline❖WAN capacity falls below demand❖ Several services fail concurrently
Impact
Key Quantitative
Results
22
70% of failures occur when management plane operation is in progress
Failures are everywhere: all three networks and three planes see comparable failure rates
80% of failure durations between 10 and 100 minutes
Evolution impacts availability
No silver bullet
Need fast recovery
Root causes
23
Lessons learned from root causes motivate availability design principles
Why is high network availability a challenge?
What are the characteristics of network availability failures?
What design principles can achieve high availability?
Re-Think Management PlaneAvoid and Mitigate Large Failures
Evolve or Die
24
25
Re-think the Management Plane
Availability Principle
26
Operator types wrong CLI command, runs wrong script
Backbone router fails
Minimize Operator
Intervention
Availability Principle
27
To upgrade part of a large device…❖ Line card, block of Clos fabric
… proceed while rest of device carries traffic❖ Enables higher availability
Necessary for upgrade-in-place
Availability Principle
28
Ensure residual capacity > demand
Early risk assessments were manual
Risky!
High packet loss
Assess risk continuously
Re-think the Management
Plane
I want to upgrade this router
“Intent”
Management Plane Software
Management Operations
Device Configurations
Tests to Verify Operation
29
Re-think the Management
Plane
Management Plane Run-time
Management Operations
Device Configurations
Tests to Verify Operation
Apply Configuration
Perform management operation
Verify operation
AssessRisk
Continuously
Minimize Operator
Intervention
30
Automated Risk
Assessment
31
Avoid and Mitigate Large Failures
Availability Principle
32
B4 and data-centers have dedicated control-plane network❖ Failure of this can bring down entire control plane
Fail openContain failure radius
Fail OpenCentralized
Control Plane
Preserve forwarding state of all switches❖ Fail-open the entire data center
33
Traffic
Exceedingly tricky!
Data center
Availability Principle
34
A bug can cause state inconsistency between control plane components ➔ Capacity reduction in WAN or data center
Design fallback strategies
Design Fallback Strategies
35
A large section of the WAN fails, so demand exceeds capacity
B4
Design Fallback Strategies
36
B2
Fallback to B2!
Can shift largetraffic volumes from many data centers
B4
Design Fallback
Strategies
37
When centralized traffic engineering fails...❖ … fallback to IP routing
Big Red Buttons❖ For every new software upgrade, design controls so
operator can initiate fallback to “safe” version
38
Evolve or Die!
39
We cannot treat a change to the network as an exceptional
event
Evolve or Die
Make change the common case
Make it easy and safe to evolve the network daily
❖ Forces management automation❖ Permits small, verifiable changes
40
Conclusion
41
Content provider networks evolve rapidly
The way we manage evolution can impact availability
We must make it easy and safe to evolve the network daily
Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure
Presentation template from SlidesCarnival
43
Older Slides
Popular root-cause
categories
44
Cabling error, interface card failure, cable cut….
Popular root-cause
categories
45
Operator types wrong CLI command, runs wrong script
Popular root-cause
categories
46
Incorrect demand or capacity estimation for upgrade-in-place
Upgrade in place
47
Assessing Risk Correctly
Residual Capacity? Demand?
Varies by interconnect Can change dynamically
48
Popular root-cause
categories
49
Hardware or link layer failures in control plane network
Popular root-cause
categories
50
Two control plane components have inconsistent views of control plane state, caused by bug
Popular root-cause
categories
51
Running out of memory, CPU, OS resources (threads)...
Lessons from Failures
The role of evolution in failures▸ Rethink the
Management Plane
The prevalence of large, severe, failures▸ Prevent and
mitigate large failures
Long failure durations▸ Recover fast
52
High-level Management
Plane Abstractions
I want to upgrade this router
Why is this difficult? Modern high capacity routers:❖ Carry Tb/s of traffic❖ Have hundreds of interfaces❖ Interface with associated optical equipment❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which
have network-wide impact ❖ Have high capacity fabrics with complicated dynamics❖ Have configuration files which run into 100s of thousands of lines
“Intent”
53
High-level Management
Plane Abstractions
I want to upgrade this router
“Intent”
Management Plane Software
Management Operations
Device Configurations
Tests to Verify Operation
54
Management Plane
Automation
Management Plane Software
Management Operations
Device Configurations
Tests to Verify Operation
Apply Configuration
Perform management operation
Verify operation
AssessRisk
Continuously
Minimize Operator
Intervention
55
Large Control
Plane Failures
Centralized Control Plane
56
Contain the blast radiusCentralized
Control Plane
57
Centralized Control Plane
Smaller failure impact, but increased complexity
Fail-OpenCentralized
Control Plane
Preserve forwarding state of all switches❖ Fail-open the entire fabric
58
Defensive Control-Plane
Design
Gateway
Topology Modeler
TE Server
BwE
59
One piece of this large update
seems wrong!!
Trust but Verify
Gateway
Topology Modeler
TE Server
BwE
60
Let me check the correctness of the update...
Fallback to B2
Gateway
Topology Modeler
TE Server
BwE
61
B2
Mitigating Large Failures
Design Fallback Strategies▸ B4 B2▸ Tunneling IP routing▸ Big Red Buttons
62
Continuously Monitor
Invariants
63
Must have onefunctional backup
SDN controller
Anycast route must have AS path length
of 3
Data center must peer with two B2
routers
This Alone isn’t Enough...
64
65
We cannot treat a change to the network as an exceptional
event
Evolve or Die
Make change the common case
Make it easy and safe to evolve the network daily
❖ Forces management automation❖ Permits small, verifiable changes
66
Key Takeaway
67
Content provider networks evolve rapidly
The way we manage evolution can impact availability
We must make it easy and safe to evolve the network daily
Evolve or Die High-Availability Design Principles Drawn from Google’s Network Infrastructure
Ramesh Govindan, Ina Minei, Mahesh Kallahalla, Bikash Koley and Amin
Vahdat… and a cast of hundreds at Google
Presentation template from SlidesCarnival
Impact of Availability
Failures
69
What design principles can achieve high availability?
A Case Study: Google
Why is high network availability a challenge?
What are the characteristics of network availability failures?
70
The velocity of evolution is fueled by
traffic growth...
71
… and by an increase in
product and service
offerings
72
Networks have very different designs
Different hardware Different control planes
Different forwarding paradigms
These differences increase management and evolution complexity
73
❖ Fabrics with merchant silicon chips❖ Centralized control plane❖ Out of band control plane network
Data centers
Control plane
network
74
SIGCOMM 2015
B4
Gateway
Topology Modeler
TE Server
BwE
❖ B4 routers built using merchant silicon chips❖ Centralized control plane within each B4 site❖ Centralized traffic engineering❖ Bandwidth enforcement for traffic metering
75
SIGCOMM 2015
SIGCOMM 2013
Other ISPs
❖ B2 routers based on vendor gear❖ Decentralized routing and MPLS TE❖ Class of service (high/low) using MPLS priorities
B2
76
The Management
Plane
Low-level, per device, abstractions for
management operations77
Where do failures
happen?
No network or plane that dominates
78
How long do the failures
last?
Durations much longer than outage budgets
Shorter failures on B2
79
What role does
evolution play?
70% of failures happen when a management operation is in progress
80
Where do failures
happen?
12
326
10
8 5
14
6
Control plane
network
12
8
15
81
Failures are everywhere
82
Across networks
All three
All three
All three
All three
All three
83
Across planes
Data
Management
Data
Data
Control
Management
84
Management
Root-Cause Categorization
What are the root causes for these
failures?
85
Rethink the Management
Plane
Low-level network managementcannot ensure high availability
86
Re-think the Management
Plane
I want to upgrade this router
Lots of complexity hidden below this statement❖ Carry Tb/s of traffic❖ Have hundreds of interfaces❖ Interface with associated optical equipment❖ Run a variety of control plane protocols: MPLS, IS-IS, BGP all of which
have network-wide impact ❖ Have high capacity fabrics with complicated dynamics❖ Have configuration files which run into 1000s of thousands of lines
“Intent”
87
Contain failure radiusCentralized
Control Plane
88
Centralized Control Plane
Each partition managed by different control plane
Adds design complexity
Even if one partition fails, others can carry traffic
Key Takeaway
89
Content provider networks evolve rapidly
The way we manage evolution can impact availability
We must make it easy and safe to evolve the network daily
By learning from failures
90
What design principles can achieve high availability?▸ Lessons
learned from root-causes
What has Google Learnt from Failures?
Why is high network availability a challenge?▸ Factors that
impact availability
What are the characteristics of network failures?▸ Severity,
duration, prevalence
▸ Root-cause categorization
91
DataCenter
Data Center
DataCenter
In a global networkFailures are common Configuration can change
These can impact network availability92
How long does it take...
10s of minutes to hours Hours to days
DataCenter
Data Center
DataCenter
… to root-cause a failure … to upgrade part of the network
93
Outage budgets...
… for four 9s availability? … for five 9s availability?
4 minutes per month 24 seconds per month
99.99% uptime 99.999% uptime
94
To move towards higher availability targets, it is important to learn from
failures
95