Network Inference from TraceRoute Measurements: Internet ...
Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to...
Transcript of Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to...
![Page 1: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/1.jpg)
Detecting network outages using different
sources of data
TMA Experts Summit, Paris, France
Cristel Pelsser
University of Strasbourg / ICube
June, 2019
1 / 44
![Page 2: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/2.jpg)
Some perspective on:
• From unsollicited traffic
Detecting Outages using Internet Background Radiation.
Andreas Guillot (U. Strasbourg), Romain Fontugne (IIJ),
Philipp Winter (CAIDA), Pascal Merindol (U. Strasbourg),
Alistair King (CAIDA), Alberto Dainotti (CAIDA), Cristel
Pelsser (U. Strasbourg). TMA 2019.
• From highly distributed permanent TCP connections
Disco: Fast, Good, and Cheap Outage Detection. Anant Shah
(Colorado State U.), Romain Fontugne (IIJ), Emile Aben
(RIPE NCC), Cristel Pelsser (University of Strasbourg), Randy
Bush (IIJ, Arrcus). TMA 2017.
• From large-scale traceroute measurements
Pinpointing Anomalies in Large-Scale Traceroute
Measurements. Romain Fontugne (IIJ), Emile Aben (RIPE
NCC), Cristel Pelsser (University of Strasbourg), Randy Bush
(IIJ, Arrcus). IMC 2017.
2 / 44
![Page 3: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/3.jpg)
Understanding Internet health? (Motivation)
• To speedup failure identification and thus recovery
• To identify weak areas and thus guide network design
3 / 44
![Page 4: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/4.jpg)
Understanding Internet health? (Problem 1)
Manual observations and operations
• Traceroute / Ping / Operators’ group mailing lists
• Time consuming
• Slow process
• Small visibility
→ Our goal: Automaticaly pinpoint network disruptions
(i.e. congestion and network disconnections)
4 / 44
![Page 5: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/5.jpg)
Understanding Internet health? (Problem 2)
A single viewpoint is not enough
→ Our goal: mine results from deployed platforms
→ Cooperative and distributed approach
→ Using existing data, no added burden to the network
5 / 44
![Page 6: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/6.jpg)
Outage detection from unsollicited
traffic
![Page 7: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/7.jpg)
Dataset: Internet Background Radiation
Internet
P1
P1 is advertised to the Internet
7 / 44
![Page 8: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/8.jpg)
Dataset: Internet Background Radiation
Internet
P1
P1 is advertised to the Internet
Scans, responses to spoofed traffic
7 / 44
![Page 9: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/9.jpg)
Dataset: Internet Background Radiation
Spoofed traffic
Internet
P1
P1 is advertised to the Internet
Scans, responses to spoofed traffic
Sends traffic with source in P1
7 / 44
![Page 10: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/10.jpg)
Dataset: Internet Background Radiation
Spoofed traffic
Internet
P1
P1 is advertised to the Internet
Scans, responses to spoofed traffic
Sends traffic with source in P1
Responds to spoofed traffic
7 / 44
![Page 11: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/11.jpg)
Dataset: IP count time-series (per country or AS)
Use cases: Attacks, Censorship, Local outages detection
2011-01-14 2011-01-17 2011-01-20 2011-01-23 2011-01-26 2011-01-29 2011-02-01 2011-02-04 2011-02-07Time
0
200
400
600
800
Num
ber o
f uni
que
sour
ce IP Original time series
Figure 1: Egyptian revolution
⇒ More than 60 000 time series in the CAIDA telescope data.
We use drops in the time series are indicators of an outage.
8 / 44
![Page 12: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/12.jpg)
Current methodology used by IODA
Detecting outages using fixed thresholds
9 / 44
![Page 13: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/13.jpg)
Our goal
Detecting outages using dynamic thresholds
10 / 44
![Page 14: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/14.jpg)
Outage detection process
2011-01-14
2011-01-17
2011-01-20
2011-01-23
2011-01-26
2011-01-29
2011-02-01
2011-02-04
2011-02-07
Time
0
200
400
600
800
Num
ber o
f uni
que
sour
ce IP
Training Validation TestOriginal time series
11 / 44
![Page 15: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/15.jpg)
Outage detection process
2011-01-14
2011-01-17
2011-01-20
2011-01-23
2011-01-26
2011-01-29
2011-02-01
2011-02-04
2011-02-07
Time
0
200
400
600
800
Nu
mb
er
of
uniq
ue s
ou
rce IP Training Calibration Test
Original time series
Predicted time series
Prediction and confidence interval
11 / 44
![Page 16: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/16.jpg)
Outage detection process
2011-01-14
2011-01-17
2011-01-20
2011-01-23
2011-01-26
2011-01-29
2011-02-01
2011-02-04
2011-02-07
Time
0
200
400
600
800
Num
ber o
f uni
que
sour
ce IP Training Validation Test
Original time seriesPredicted time series
• When the real data is outside the prediction interval, we raise
an alarm.
• We want a prediction model that is robust to the seasonality
and noise in the data → We use the SARIMA model1.
1More details on the methodology on wednesday.11 / 44
![Page 17: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/17.jpg)
Validation: ground truth
Characteristics
• 130 known outages
• Multiple spatial scales
• Countries
• Regions
• Autonomous Systems
• Multiple durations (from an hour to a week)
• Multiple causes (intentional or non intentional)
12 / 44
![Page 18: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/18.jpg)
Evaluating our solution
Objectives
• Identifying the
minimal number of IP
addresses
• Identifying a good
threshold
Threshold
• TPR of 90% and
FPR of 2%
0.0 0.2 0.4 0.6 0.8 1.0False Positive Rate
0.0
0.2
0.4
0.6
0.8
1.0
True
Pos
itive
Rat
e
All time series< 20 IPs> 20 IPs2 sigma - 95%3 sigma - 99.5%5 sigma - 99.99%
Figure 2: ROC curve
13 / 44
![Page 19: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/19.jpg)
Comparing our proposal (Chocolatine) to CAIDA’s tools
• More events detected than the simplistic thresholding
technique (DN)
• Higher overlap with other detection techniques
• Not a complete overlap→ difference in dataset coverage
→ different sensitivities to outages
DN17
BGP644
AP633
36
985 3
15
Chocolatine251
BGP445
AP440
235
489 196
511
14 / 44
![Page 20: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/20.jpg)
Outage detection from highly dis-
tributed permanent TCP connec-
tions
![Page 21: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/21.jpg)
Proposed Approach
Disco:
• Monitor long-running TCP
connections and synchronous
disconnections from related
network/area
• We apply Disco on RIPE Atlas
data, where probes are widely
distributed at the edge and behind
NATs/CGNs providing visibility
Trinocular may not have
→ Outage = synchronous disconnections from the same
topological/geographical area
16 / 44
![Page 22: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/22.jpg)
Assumptions / Design Choices
Rely on TCP disconnects
• Hence the granularity of detection is dependent on TCP
timeouts
Bursts of disconnections are indicators of interesting outage
• While there might be non bursty outages that are interesting,
Disco is designed to detect large synchronous disconnections
17 / 44
![Page 23: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/23.jpg)
Proposed System: Disco & Atlas
RIPE Atlas platform
• 10k probes worldwide
• Persistent connections with
RIPE controllers
• Continuous traceroute
measurements
(see outages from inside)
→ Dataset: Stream of probe connection/disconnections
(from 2011 to 2016)
18 / 44
![Page 24: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/24.jpg)
Disco Overview
1. Split disconnection
stream in sub-streams
(AS, country,
geo-proximate
50km radius)
2. Burst modeling and
outage detection
3. Aggregation and
outage reporting
19 / 44
![Page 25: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/25.jpg)
Why Burst Modeling?
Goal: How to find synchronous disconnections?
• Time series conceal
temporal characteristics
• Burst model estimates
disconnections arrival
rate at any time
Implementation: Kleinberg burst model2
2J. Kleinberg. “Bursty and hierarchical structure in streams”, Data Mining
and Knowledge Discovery, 2003.20 / 44
![Page 26: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/26.jpg)
Burst modeling: Example
• Monkey causes blackout in
Kenya at 8:30 UTC June,
7th 2016
• Same day RIPE rebooted
controllers
21 / 44
![Page 27: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/27.jpg)
Results
Outage detection:
• Atlas probes disconnections from 2011 to 2016
• Disco found 443 significant outages
Outage characterization and validation:
• Traceroute results from probes (buffered if no connectivity)
• Outage detection results from Trinocular
22 / 44
![Page 28: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/28.jpg)
Validation (Traceroute)
Comparison to traceroutes:
• Probes in detected outages can reach traceroutes destination?
→ Velocity ratio: proportion of completed traceroutes in
given time
0.0 0.5 1.0 1.5 2.0
R (Average Velocity Ratio)
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
Pro
babili
ty M
ass F
unction
Normal
Outage
→ Velocity ratio ≤ 0.5 for 95% of detected outages23 / 44
![Page 29: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/29.jpg)
Validation (Trinocular)
Comparison to Trinocular (2015):
• Disco found 53 outages in 2015
• Corresponding to 851 /24s (only 43% is responsive to ICMP)
Results for /24s reported by Disco and pinged by Trinocular:
• 33/53 are also found by Trinocular
• 9/53 are missed by Trinocular (avg time of outages < 1hr)
• Other outages are partially detected by Trinocular
23 outages found by Trinocular are missed by Disco
• Disconnections are not very bursty in these cases
→ Disco’s precision: 95%, recall: 67% 24 / 44
![Page 30: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/30.jpg)
Outage detection from large-scale
traceroute measurements
![Page 31: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/31.jpg)
Dataset: RIPE Atlas traceroutes
Two repetitive large-scale measurements
• Builtin: traceroute every 30 minutes to all DNS root servers
(≈ 500 server instances)
• Anchoring : traceroute every 15 minutes to 189 collaborative
servers
Analyzed dataset
• May to December 2015
• 2.8 billion IPv4 traceroutes
• 1.2 billion IPv6 traceroutes
26 / 44
![Page 32: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/32.jpg)
Monitor delays with traceroute?
Traceroute to “www.target.com”
Round Trip Time (RTT) between B and C?
Report abnormal RTT between B and C?
27 / 44
![Page 33: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/33.jpg)
Monitor delays with traceroute?
Challenges:
• Noisy data
• Traffic
asymmetry
0 2 4 6 8 10 12Number of hops
0
50
100
150
200
250
300
RT
T (
ms)
Traceroutes from CZ to BD
28 / 44
![Page 34: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/34.jpg)
Monitor delays with traceroute?
Challenges:
• Noisy data
• Traffic
asymmetry
0 2 4 6 8 10 12Number of hops
0
50
100
150
200
250
300
RT
T (
ms)
Traceroutes from CZ to BD
28 / 44
![Page 35: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/35.jpg)
What is the RTT between B and C?
RTTC - RTTB = RTTCB?
29 / 44
![Page 36: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/36.jpg)
What is the RTT between B and C?
RTTC - RTTB = RTTCB?
• No!
• Traffic is asymmetric
• RTTB and RTTC take different return paths!
• Differential RTT: ∆CB = RTTC − RTTB = dBC + ep
30 / 44
![Page 37: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/37.jpg)
What is the RTT between B and C?
RTTC - RTTB = RTTCB?
• No!
• Traffic is asymmetric
• RTTB and RTTC take different return paths!
• Differential RTT: ∆CB = RTTC − RTTB = dBC + ep30 / 44
![Page 38: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/38.jpg)
Problem with differential RTT
Monitoring ∆CB over time:
Time
0
10
20
30
∆R
TT
→ Delay change on BC? CD? DA? BA???31 / 44
![Page 39: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/39.jpg)
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = x0
32 / 44
![Page 40: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/40.jpg)
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = {x0, x1}
32 / 44
![Page 41: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/41.jpg)
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = {x0, x1, x2, x3, x4}
32 / 44
![Page 42: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/42.jpg)
Proposed Approach: Use probes with different return paths
Differential RTT: ∆CB = {x0, x1, x2, x3, x4}
Median ∆CB : • Stable if a few return paths delay change
• Fluctuate if delay on BC changes32 / 44
![Page 43: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/43.jpg)
Median Diff. RTT: Tier1 link, 2 weeks of data, 95 probes
−400
−300
−200
−100
0
100
200
300
400
Diffe
rential R
TT
(m
s)
130.117.0.250 (Cogent, Zurich) - 154.54.38.50 (Cogent, Munich)
Raw values
Jun 02 2015
Jun 04 2015
Jun 06 2015
Jun 08 2015
Jun 10 2015
Jun 12 2015
Jun 14 2015
4.8
5.0
5.2
5.4
5.6
Diffe
rential R
TT
(m
s)
Median Diff. RTT
Normal Reference
• Stable despite noisy RTTs
(not true for average)
• Normally distributed
33 / 44
![Page 44: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/44.jpg)
Detecting congestion
Nov 26 2015
Nov 27 2015
Nov 28 2015
Nov 29 2015
Nov 30 2015
Dec 01 2015−10
−5
0
5
10
15
20
25
30
Diffe
ren
tia
l R
TT
(m
s)
72.52.92.14 (HE, Frankfurt) - 80.81.192.154 (DE-CIX (RIPE))
Median Diff. RTT
Normal Reference
Detected Anomalies
Significant RTT changes:
Confidence interval not overlapping with the normal reference
34 / 44
![Page 45: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/45.jpg)
Results
Analyzed dataset
• Atlas builtin/anchoring measurements
• From May to Dec. 2015
• 2.8 billion IPv4 traceroutes
• 1.2 billion IPv6 traceroutes
• Observed 262k IPv4 and 42k IPv6 links (core links)
We found a lot of congested links!
Let’s look at one example
35 / 44
![Page 46: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/46.jpg)
Study case: Telekom Malaysia BGP leak
36 / 44
![Page 47: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/47.jpg)
Study case: Telekom Malaysia BGP leak
37 / 44
![Page 48: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/48.jpg)
Study case: Telekom Malaysia BGP leak
37 / 44
![Page 49: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/49.jpg)
Study case: Telekom Malaysia BGP leak
37 / 44
![Page 50: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/50.jpg)
Study case: Telekom Malaysia BGP leak
Not only with Google... but about 170k prefixes!
37 / 44
![Page 51: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/51.jpg)
Congestion in Level3
Rerouted traffic has congested Level3 (120 reported links)
• Example: 229ms increase between two routers in London!
Jun 08 2015
Jun 09 2015
Jun 10 2015
Jun 11 2015
Jun 12 2015
Jun 13 2015
−50
0
50
100
150
200
250
300
350
Diffe
ren
tia
l R
TT
(m
s)
67.16.133.130 - 67.17.106.150
Median Diff. RTT
Normal Reference
Detected Anomalies
38 / 44
![Page 52: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/52.jpg)
Congestion in Level3
Reported links in London:
Delay increase
Delay & packet loss
→ Traffic staying within UK/Europe may also be altered
39 / 44
![Page 53: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/53.jpg)
But why did we look at that?
Per-AS alarm for delay
40 / 44
![Page 54: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/54.jpg)
Conclusions and perspectives (1)
We proposed 3 different techniques to detect outages for 3
different sources of data
• Each source of data has its own coverage
• Core links (congestion and failures)
• Prefix, country, region, AS disconnections
41 / 44
![Page 55: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/55.jpg)
Conclusions and perspectives (1)
We proposed 3 different techniques to detect outages for 3
different sources of data
• Each source of data has its own coverage
• Core links (congestion and failures)
• Prefix, country, region, AS disconnections
• Each source of data has its own noise, properties
• Identifying the suitable model is a challenge
41 / 44
![Page 56: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/56.jpg)
Conclusions and perspectives (2)
There is no substancial, state of the art ground truth to validate
the results. We resort to
• the comparison of different techniques with different coverages
• evaluations on the basis of partial ground truth
• characterizations of the detected outages based on the
detection algorithm used
42 / 44
![Page 57: Detecting network outages using di erent sources of data · Builtin: traceroute every 30 minutes to all DNS root servers (ˇ500 server instances) Anchoring: traceroute every 15 minutes](https://reader034.fdocuments.net/reader034/viewer/2022050221/5f665d42e4569545646fa93d/html5/thumbnails/57.jpg)
Turn this
43 / 44