1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian...

23
1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University of Michigan) Jennifer Rexford (Princeton University) Jia Wang (AT&T Labs Research)
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian...

Page 1: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

1

Finding a Needle in a Haystack: Pinpointing Significant BGP Routing

Changes in an IP Network

Jian Wu (University of Michigan) Z. Morley Mao (University of Michigan) Jennifer Rexford (Princeton University)

Jia Wang (AT&T Labs Research)

Page 2: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

2

Motivation

CBR

CBRCBR

CBRAS1

AS2 AS3

destination

AB

C

D

Failure

Disruption

Congestion

Mitigation

AS4

source

A backbone network is vulnerable to routing changes that occur in other domains.

Page 3: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

3

Goal

Identify important routing anomalies Lost reachability Persistent flapping Large traffic shifts

Contributions:

•Build a tool to identify a small number of important routing disruptions from a large volume of raw BGP updates in real time.

•Use the tool to characterize routing disruptions in an operational network

Page 4: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

4

Interdomain Routing:Border Gateway Protocol

Prefix-based: one route per prefix Path-vector: list of ASes in the path Incremental: every update indicates a change Policy-based: local ranking of routes

CBRCBR CBRCBRCBRCBRCBRCBR

“I can reach 12.34.158.0/24”

“I can reach 12.34.158.0/24

via AS 1”

AS 1AS 2

12.34.158.5

data traffic data trafficAS 3iBGPeBGP eBGP

12.34.158.0/24

Page 5: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

5

Capturing Routing Changes

CBRCBR

CPEBGP Monit

or

CBRCBR

CBRCBR

CBRCBR

CBRCBR

CBRCBR

iBGP

iBG

P

iBGP

eBGP

eBG

P

eBGPUpdatesUpdates

Best routesBest ro

utes

A large operational network(8/16/2004 – 10/10-2004)

Page 6: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

6

Challenges

Large volume of BGP updates Millions daily, very bursty Too much for an operator to manage

Different from root-cause analysis Identify changes and their effects Focus on actionable events rather than

diagnosis Diagnose causes in/near the AS

Page 7: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

7

System Architecture

Event Classification

Event Classification

“Typed”Events

EEBR

EEBR

EEBR

BGP Updates

(106)

BGP Update Grouping

BGP Update Grouping

Events

Persistent Flapping Prefixes

(101)

(105)

EventCorrelation

EventCorrelation

Clusters

Frequent Flapping Prefixes

(103)

(101)

Traffic ImpactPrediction

Traffic ImpactPrediction

EEBREEBR EEBR

LargeDisruptions

Netflow Data

(101)

From millions of updates to a few dozen reports

Page 8: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

8

Grouping BGP Update into Events

Challenge: A single routing change leads to multiple update messages affects routing decisions at multiple routers

Approach:

•Group together all updates for a prefix with inter-arrival < 70 seconds•Flag prefixes with changes lasting > 10 minutes.

BGP Update Grouping

BGP Update Grouping

EEBR

EEBR

EEBR

BGP Updates

Events

Persistent Flapping Prefixes

Page 9: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

9

Grouping Thresholds Based on our understanding of BGP

and data analysis Event timeout: 70 seconds

2 * MRAI timer + 10 seconds 98% inter-arrival time < 70 seconds

Convergence timeout: 10 minutes BGP usually converges within a few

minutes 99.9% events < 10 minutes

Page 10: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

10

Persistent Flapping Prefixes

Types of persistent flapping Conservative damping parameters (78.6%) Protocol oscillations due to MED (18.3%) Unstable interfaces or BGP sessions (3.0%)

A surprising finding: 15.2% of updates were caused by persistent-flapping prefixes even though flap damping is enabled.

Page 11: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

11

Example: Unstable eBGP Session

ISP Peer

CustomerEC

EB

EA ED

p

Flap damping parameters is session-based Damping not implemented for iBGP sessions

Page 12: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

12

Event Classification

Challenge: Major concerns in network management Changes in reachability Heavy load of routing messages on the routers Change of flow of the traffic through the network

Event Classification

Event ClassificationEvents

“Typed” Events,e.g., Loss/Gain of Reachability

Solution: classify events by severity of their impact

Page 13: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

13

Event Category – “No Disruption”

ISP

EA

p

EB

EC

EE

AS2

ED

AS1

No Traffic Shift

“No Disruption”:

no border routers have any traffic shift. (50.3%)

Page 14: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

14

Event Category – “Internal Disruption”

ISP

EA

p

EB

EC

EE

AS2

ED

AS1

Internal Traffic Shift

“Internal Disruption”:

all traffic shifts are internal. (15.6%)

Page 15: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

15

Event Category – “Single External Disruption”

ISP

EA

p

EB

EC

EE

AS2

ED

AS1

external Traffic Shift

“Single External Disruption”:

only one of the traffic shifts is external (20.7%)

Page 16: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

16

Statistics on Event Classification

Events Updates

No Disruption 50.3% 48.6%

Internal Disruption 15.6% 3.4%

Single External Disruption 20.7% 7.9%

Multiple External Disruption 7.4% 18.2%

Loss/Gain of Reachability 6.0% 21.9%

First 3 categories have significant day-to-day variations

Updates per event depends on the type of events and the number of affected routers

Page 17: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

17

Event Correlation

Challenge: A single routing change affects multiple destination prefixes

EventCorrelation

EventCorrelation“Typed”

EventsClusters

Solution: group the same-type, close-occurring events

Page 18: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

18

EBGP Session Reset Caused most of “single external disruption”

events Check if the number of prefixes using that

session as the best route changes dramatically

Validation with Syslog router report (95%)time

Number of prefixes

session failure

session recovery

Page 19: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

19

Hot-Potato Changes Hot-Potato Changes

Caused “internal disruption” events Validation with OSPF measurement (95%)

[Teixeira et al – SIGMETRICS’ 04]

ISP

P

EA EB

EC

10119

“Hot-potato routing” = route to closest egress point

Page 20: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

20

Traffic Impact Prediction

Challenge: Routing changes have different impacts on the network which depends on the popularity of the destinations

Traffic ImpactPrediction

Traffic ImpactPrediction

EEBR

Clusters LargeDisruptions

Netflow Data

EEBR EEBR

Solution: weigh each cluster by traffic volume

Page 21: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

21

Traffic Impact Prediction Traffic weight

Per-prefix measurement from netflow 10% prefixes accounts for 90% of traffic

Traffic weight of a cluster the sum of “traffic weight” of the

prefixes A small number of large clusters have

large traffic weight Mostly session resets and hot-potato

changes

Page 22: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

22

Performance Evaluation Memory

Static memory: “current routes”, 600 MB Dynamic memory: “clusters”, 300 MB

Speed 99% of intervals of 1 second of updates

can be process within 1 second Occasional execution lag Every interval of 70 seconds of updates

can be processed within 70 seconds

Measurements were based on 900MHz CPU

Page 23: 1 Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network Jian Wu (University of Michigan) Z. Morley Mao (University.

23

Conclusion BGP troubleshooting system

Fast, online fashion Operators’ concerns (reachability, flapping, traffic) Significant information reduction

millions of update a few dozens of large disruptions

Uncovered important network behavior Hot-Potato changes Session resets Persistent-flapping prefixes