The Impact of Router Outages on the AS-Level Internet · 2017-10-27 · The Impact of Router...
Transcript of The Impact of Router Outages on the AS-Level Internet · 2017-10-27 · The Impact of Router...
w w w .caida.org
The Impact of Router Outages on the AS-Level Internet
1
Matthew Luckie* - University of Waikato Robert Beverly - Naval Postgraduate School
*work started while at CAIDA, UC San Diego
SIGCOMM 2017, August 24th 2017
Internet Resilience
2
CE
PE
PE
CE
PE
CE
CE: Customer Edge PE: Provider Edge
Where are the Single Points of Failure?
Example #A Example #B
Internet Resilience
3
CE
PE
PE
Where are the Single Points of Failure?
If the CE router fails,the network is disconnected,
so the CE router is aSingle Point of Failure (SPoF)
CE: Customer Edge PE: Provider Edge
Example #A
Internet Resilience
4
CE
PE
CE
Where are the Single Points of Failure?
If the CE router fails,the network has an
alternate path available,so the CE router is NOT a
Single Point of Failure (SPoF)
CE: Customer Edge PE: Provider Edge
Example #B
Internet Resilience
5
CE
PE
CE
Where are the Single Points of Failure?
If the PE router fails,the customer network is
disconnected, so the PE router is a Single Point of Failure (SPoF)
CE: Customer Edge PE: Provider Edge
Example #B
Challenges in topology analysis• Prior approaches analyzed static AS-level and router-level
topology graphs,
- e.g.: Nature 2000
• Important AS-level and router-level topology might be invisible to measurement, such as backup paths,
- e.g: INFOCOM 2002
• A router that appears to be central to a network’s connectivity might not be
- e.g.: AMS 2009
6
What we did
Large-scale (Internet-wide) longitudinal (2.5 years) measurement study to characterize prevalence of Single Points of Failure (SPoF):
1. Efficiently inferred IPv6 router outage time windows
2.Associated routers with IPv6 BGP prefixes
3.Correlated router outages with BGP control plane
4.Correlated router outages with data plane
5.Validated inferences of SPoF with network operators
7
What we did
8
Identified IPv6 router interfaces from traceroute
83K to 2.4M interfaces from CAIDA’sArchipelago traceroute measurements
What we did
9
probed router interfaces to infer outage windows
We used a single vantage point located at CAIDA,UC San Diego for the duration of this study
What we did
10
Central counter : 9290
What we did
10
Central counter : 9290 Central counter : 9291
9290
T1: 9290
What we did
10
Central counter : 9290 Central counter : 9291T1: 9290
9291
Central counter : 9292T1: 9290T2: 9291
What we did
10
Central counter : 9290 Central counter : 9291T1: 9290
Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292
Central counter : 9293
9292
What we did
10
Central counter : 9290 Central counter : 9291T1: 9290
Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292
Central counter : 9293
9293
Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293
What we did
10
Central counter : 9290 Central counter : 9291T1: 9290
Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292
Central counter : 9293 Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293
T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294
9294
Central counter : 9295
What we did
10
Central counter : 9290 Central counter : 9291T1: 9290
Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292
Central counter : 9293 Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293
T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294
Central counter : 9295
Reboot!
Central counter : 1
What we did
10
Central counter : 9290 Central counter : 9291T1: 9290
Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292
Central counter : 9293 Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293
T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294
Central counter : 9295 Central counter : 1
1
T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1
Central counter : 2
What we did
10
Central counter : 9290 Central counter : 9291T1: 9290
Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292
Central counter : 9293 Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293
T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294
Central counter : 9295 Central counter : 1T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1
Central counter : 2 Central counter : 3
2
T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2
What we did
10
Central counter : 9290 Central counter : 9291T1: 9290
Central counter : 9292T1: 9290T2: 9291T1: 9290T2: 9291T3: 9292
Central counter : 9293 Central counter : 9294T1: 9290T2: 9291T3: 9292T4: 9293
T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294
Central counter : 9295 Central counter : 1T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1
Central counter : 2 Central counter : 3T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2
3
Central counter : 4T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2T8: 3
What we did
11
probed router interfaces to infer outage windows using IPID
T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2T8: 3
Infer a reboot when time series of values returned froma router is discontinuous, indicating router was restarted
Outage Window
Why IPv6 fragment IDs?
• IPv4 Fragment IDs:
- 16 bits, bursty velocity: every packet requires unique ID
- At 100Mbps and 1500 byte packets, Nyquist rate dictates4 second probing interval
• IPv6 Fragment IDs:
- 32 bits, low velocity: IPv6 routers rarely send fragments
- We average 15 minute probing interval
12
What we did
13
correlated routers with prefixes using traceroute paths
What we did
14
2001:db8:1::/48
2001:db8:2::/48correlated routers with prefixes
using traceroute paths
Ark VP
Ark VP
50-60 Ark VPstraceroute every
routed IPv6prefix every day
What we did
14
2001:db8:1::/48
2001:db8:2::/48correlated routers with prefixes
using traceroute paths
Ark VP
Ark VP
50-60 Ark VPstraceroute every
routed IPv6prefix every day
What we did
15
2001:db8:2::/48computed distance of
router from AS announcingnetwork
Ark VP
2001:db8:1::/48
0(CE)
1 (PE)
2
CE: Customer Edge PE: Provider Edge
What we did
16
2001:db8:1::/48
2001:db8:2::/48correlated router outage windows
with BGP control plane
0(CE)
What we did
17
2001:db8:1::/48
2001:db8:2::/48correlated router outage windows
with BGP control plane
T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2T8: 3
Outage Window
What we did
18
2001:db8:1::/48
2001:db8:2::/48correlated router outage windows
with BGP control plane
T1: 9290T2: 9291T3: 9292T4: 9293T5: 9294T6: 1T7: 2T8: 3
Outage Window
2001:db8:2::/48T5.2: Peer-1 WT5.2: Peer-2 WT5.3: Peer-3 WT5.3: Peer-4 WT5.8: Peer-3 AT5.8: Peer-2 AT5.8: Peer-1 AT5.8: Peer-4 A
RouteViews
classified impact on BGP according to observed activity overlapping with inferred outage
What we did
• Complete Withdrawal: all peers simultaneously withdrew route for at least 70 seconds
- Single Point of Failure (SPoF)
• Partial Withdrawal: at least one peer withdrew route for at least 70 seconds, but not all did
• Churn: BGP activity for the prefix
• No Impact: No observed BGP activity for the prefix
19
Data Collection SummaryWhat we did
• Probed IPv6 routers at ~15 minute intervals from18 Jan 2015 to 30 May 2017 (approx. 2.5 years)
• 149,560 routers allowed reboots to be detected
• We inferred 59,175 (40%) rebooted at least once,750K reboots in total
20
CDF
0.2
0.4
0.6
0.8
1
1 10 100Number of Outages
0
What we found• 2,385 (4%) of routers that rebooted (59K) we inferred
to be SPoF for at least one IPv6 prefix in BGP
• Of SPoF routers, we inferred 59% to be customer edge router ; 8% provider edge; 29% within destination AS
• No covering prefix for 70% of withdrawn prefixes
- During one-week sample, covering prefix presence during withdrawal did not imply data plane reachability
• IPv6 Router reboots correlated with IPv4 BGP control plane activity
21
Limitations• Applicability to IPv4 depends on router being dual-stack• Requires IPID assigned from a counter
- Cisco, Huawei, Vyatta, Microtik, HP assign from counter- 27.1% responsive for 14 days assigned from counter
• Router outage might end before all peers withdraw route- Path exploration + Minimum Route Advertisement Interval
(MRAI) + Route Flap Dampening (RFD) • Complex events: multiple router outages but one detected
- We observed some complex events and filtered them out
22
Validation
23
Reboots SPoFNetwork ✔ ✘ ? ✔ ✘ ?US University 7 0 8 7 0 8US R&E backbone #1 2 0 3 3 2 0US R&E backbone #2 3 0 1 0 0 4NZ R&E backbone 11 0 22 4 2 27Total: 23 0 34 14 4 39✔ = Validated Inference✘ = Incorrect Inference? = Not Validated
Validation
24
Reboots SPoFNetwork ✔ ✘ ? ✔ ✘ ?US University 7 0 8 7 0 8US R&E backbone #1 2 0 3 3 2 0US R&E backbone #2 3 0 1 0 0 4NZ R&E backbone 11 0 22 4 2 27Total: 23 0 34 14 4 39
Challenging to get validation data: operators oftencould only tell us about the last reboot
Validation
25
Reboots SPoFNetwork ✔ ✘ ? ✔ ✘ ?US University 7 0 8 7 0 8US R&E backbone #1 2 0 3 3 2 0US R&E backbone #2 3 0 1 0 0 4NZ R&E backbone 11 0 22 4 2 27Total: 23 0 34 14 4 39
No falsely inferred reboots: we correctly observedthe last known reboot of each router
Validation
26
Reboots SPoFNetwork ✔ ✘ ? ✔ ✘ ?US University 7 0 8 7 0 8US R&E backbone #1 2 0 3 3 2 0US R&E backbone #2 3 0 1 0 0 4NZ R&E backbone 11 0 22 4 2 27Total: 23 0 34 14 4 39
We did not detect some SPoFs
Data Collection Summary
27
83K
41.8K15.2K
46.5K
(b)~1.1M
79.8K(a) (c)
23.5K
10KJan ’17Jul ’16
AllIncrementing
Jan ’16Jul ’15Jan ’15
3M
1M
100K
30KNum
ber o
f Int
erfa
ces
PPS List Unresponsive(a) 100 Static 83K 12-24 hours(b) 225 Static 1.1M 12-24 hours(c) 200 Dynamic, ~2.4M 7-14 days
Control: six hours prior to inferred outages, Feb 2015Correlating BGP/router outages
28
2 1 0 −1 −2 −3 −4 −5Distance of Router from Destination AS (IP hops)
Frac
tion
of R
eboo
t/Pre
fix P
airs
0
0.1
Churn
0.2
0.3
0.4
0.5
12 11 10
Partial Withdrawal
9 8 7 6 5 4 3
Complete WithdrawalInsideDest.
ASOutsideDest. AS
During the inferred outages, Feb 2015Correlating BGP/router outages
29
2 1 0 −1 −2 −3 −4 −5Distance of Router from Destination AS (IP hops)
Frac
tion
of R
eboo
t/Pre
fix P
airs
0
0.1
Churn
0.2
0.3
0.4
0.5
12 11 10
Partial Withdrawal
9 8 7 6 5 4 3
Complete WithdrawalInsideDest.
ASOutsideDest. AS
BGP Prefix Withdrawals: SPoF
30
max
0.2
0.4
0.6
0.8
1
1min
5min
15min
30min
1hr
2hr
4hr
8hr
16hr
CDF
Complete Withdrawal Duration
min
0
44% less than 5 minutes, suggestive of router maintenance or router crash
SPoF prefixes mostly single homed
31
Router hopdistance
PECE
1
−2
0−1
1
Prefix announced through a single upstream
Prefix announced through multiple upstreams
0.20
23
Fraction of Population
0.4 0.6 0.8
−3
EspeciallySPoFs outsidedestination AS,as expected
Impact on IPv4 prefixes in BGP
32
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Cum
ulat
ive
Frac
tion
Rou
ter O
utag
es
Withdrawn Peers/Advertising Peers
Before OutageDuring Outage
We examined IPv4 prefixes for 5% sample of reboots.19% of correlated IPv4 prefixes withdrawn by at least 90% of peers during router outage window.
Control
Outage
Summary• Step towards root-cause analysis of
inter-domain routing outages and events
- Explore applicability of method to measurement of other critical Internet infrastructure: DNS, Web, Email
• In our 2.5 year sample of 59K routers that rebooted
- 4% (2.3K) were SPoF
- SPoF were mostly confined to the edge: 59% customer edge
• We released our code as part of scamper
33https://www.caida.org/tools/measurement/scamper/
2 1 0 −1 −2 −3 −4 −5Distance of Router from Destination AS (IP hops)
Frac
tion
of R
eboo
t/Pre
fix P
airs
0
0.1
Churn
0.2
0.3
0.4
0.5
12 11 10
Partial Withdrawal
9 8 7 6 5 4 3
Complete Withdrawal
Backup Slides
34
Impact on IPv4 Services
35
We examined IPv4 prefixes for 5% sample of reboots where at least 90% of peers during router outage window.
Active Hosts 39,107HTTP 25,592HTTPS 16,321
SSH 11,277DNS 7,922SMTP 7,383IMAP 5,127
censys.io April 2017
Web}
Email}
Partial Withdrawals
36
50% of pairs had1−2 peers withdraw
nearly all peers withdraw10% of pairs had
0
1
0 0.2 0.4 0.6 0.8 1Fraction of Peers Withdrawing Route
CD
F 0.6
0.4
0.2
0.8
50% of pairs had 1-2 peers withdraw prefix 10% of pairs had nearly all peers withdraw prefix
Degrees of ASes monitored
37
0
0.2
0.4
0.6
0.8
1
1 10 100 1000
Cum
ulat
ive
Frac
tion
of R
eboo
ting
ASes
AS Degree
Single Points of FailureMonitored Population
ASes that were inferred to have a SPoF were disproportionately low-degree ASes
Activity for IPv4 prefixes in BGP
38
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Cum
ulat
ive
Frac
tion
Rou
ter O
utag
es
Peers Sending Updates/Total Peers
Before OutageDuring Outage
At least 70% of peers reported BGP activity on IPv4prefixes for 50% of the inferred router outages
Reboot Window Durations
39
max
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
1min
5min
15min
30min
1hr
2hr
4hr
8hr
16hr
Reboot Window Duration
CDF
min
0
Half the maximum reboot lengths were less than 30 minutes (~two probing rounds)
Router + BGP outage correlation
40
Outage Window
Withdraw-Contained
10, 11, 12 1, 2, 3
W A
Router IP-ID Sequence:
Outage-ContainedW A
Withdraw-BeforeW A
Announce-AfterW A
BGP Sequence:
Data processing pipeline
41
UptimeProber
Cassandra
CAIDAIPv6Topology
rtr targets
<ip,time,ipid>
RouteViews
<peer,time,prefix>
InferredReboots
BGPCorrelation
single pointsof failure
AS borderdistance
Inferring router position
42
x1R1 R3
x2 x3R2 R5
y1 y2R4
0 -1 -212(a) interface addresses routed by Y appear in traceroute
x1R1 R3
x2 x3R2 R5R4
012(b) no interface addresses routed by Y appear in traceroute
Customer Edge(CE) Router
Provider Edge(PE) Router
? ?
AS X AS Y
AS X AS Y
Data Collection Summary
43
18 Jan ’1518 Oct ’16
(a)
18 Oct ’1624 Feb ’17
(b)
24 Feb ’1730 May ’17
(c)
Probing rate 100 pps 225 pps 200 pps
Interfaces 83K seen Dec ‘14
1.1M seen Jun to Oct ’16
Dynamic. 2.4Min May ‘17
Responsive every round~15 mins
every round~15 mins
every round~15 mins
Unresponsive 12-24 hours 12-24 hours 7-14 days
Why IPv6 fragment IDs?
44
At 100Mbps and 1500 byte packets.Nyquist rate dictates a 4 second probing interval
IPv4 ID values are 16 bits with bursty velocity as every packet requires a unique value.
source addressdestination address
TTL protocol checksumidentification
Ver DSCP lengthHLoffset
Why IPv6 fragment IDs?
45
IPv6 ID values are 32 bits with low velocityas systems rarely send fragmented packets.
source address
destination address
protocol TTL
identification
Ver DSCP flow idpayload length
protocol offsetreserved
Soliciting IPv6 Fragment IDs
46
echo request, 1300 bytes
packet too big, MTU 1280
echo reply, 1300 bytes
echo request, 1300 bytes
echo reply, 1280 bytes
Fragment ID: 12345