Routing Convergence and the Impact of Scale Dan Massey Colorado State University.

Post on 27-Mar-2015

224 views 2 download

Tags:

Transcript of Routing Convergence and the Impact of Scale Dan Massey Colorado State University.

Routing Convergence and the Impact of Scale

Dan MasseyColorado State University

26 October 05 2massey@cs.colostate.edu

Internet Routing and BGP Internet divided into Autonomous Systems

Large-scale implies maintaining entire topology at a router is not feasible.

BGP is the inter-AS routing protocol. Router stores the AS path to a destination.

Path allows router to apply policies

How quickly does BGP converge after a change? Can BGP continue to scale with more growth?

Do we need BGP changes or a new protocol?

26 October 05 3massey@cs.colostate.edu

(B A) (C B A) (E D B A) (H G F E A)

BGP Path Exploration

H

BZ

D

A

E

C

dest.

I G

Obsolete paths: (C B A), (E D B A) If Z knew [B A] failed, it could’ve avoided the obsolete paths

Z’s Candidate paths:

() (C B A) (E D B A) (I H G F A)

() () (E D B A) (I H G F A)

() () () (I H G F A)

F

( )

( )

( )( )

( )

( )

26 October 05 4massey@cs.colostate.edu

Path Exploration and Policy Internet does not select the shortest path

Policies limit the number of potential paths. Especially at high level tiers.

Example: Due to routing policy, AS-X (lower tier) sees more alternate paths than AS-Z (tier-1). Via multiple providers Via peers

Z

X

P2

Y

W

P1

26 October 05 5massey@cs.colostate.edu

Impact of Topology Growth

Denser connectivity => more alternate paths

Impact depends on policies and tier Lower tier nodes see more slow convergence

MRAIoff

MRAIon

Jan 2, 2004 Dec 2, 2004

Beacon prefix 198.32.7.0/24

RV peer (AS#)

#updates

#paths

#updates

#paths

1239 (tier1) 44 4 37 4

1221 62 8 87 11

2914 (tier1) 106 6 279 7

3557 102 19 198 39

26 October 05 6massey@cs.colostate.edu

Convergence Improvements

MRAI Timer (Deployed Now) Require minimum time between updates

Typically 30 seconds

Assertion Checking (Proposed in INFOCOM 02) Signal policy or topological failure in some cases

Discard routes that include failed subpath

Ghost Flushing (Proposed in INFOCOM 03) When the MRAI timer delays an update, send a withdrawal

Attach Failure Notification (INFOCOM05, CompNet05) Explicitly list the cause of the failure

26 October 05 7massey@cs.colostate.edu

MRAI Rate-Limiting Timer

Minimum Route Advertisement Interval (MRAI) timer:

Within M=30 seconds, at most one announcement from A to B

P1 P2 P 3P 4 P 5A’s path changes:

Msgs from A to B:P1

time=0 time=30time=60

P4 P 5

b. delay convergence

a. suppress transient changes

Impact:

26 October 05 8massey@cs.colostate.edu

MRAI and Ghost Flushing

MRAI prevents removal of stale information Suppose P1 to P5 are increasingly worse Neighbor believes P1 still available until time 30

P1 P2 P 3P 4 P 5A’s path changes:

Msgs from A to B:P1

time=0 time=30time=60

P4 P 5w

Ghost Flushing: if change to longer path and MRAI applies, send a withdraw

w

26 October 05 9massey@cs.colostate.edu

Root Cause Notification

The node who detects the failure attaches root cause to msg Other nodes copy the root cause to outgoing messages

(B A) (C B A) (E D B A) (H G F E A) H

BZ

D

A

E

C

I G

Z’s Candidate paths:

F () (C B A) (E D B A) (I H G F A)

( ), [B A] failure

( ), [B A] failure

( ), [B A] failure

the first msg is enough for Z to remove all the obsolete paths

26 October 05 10massey@cs.colostate.edu

Ghost Flushing

Assertion

BGP

Root Cause Notification

Fail-down Simulation Results

Fail-down: destination becomes unreachable

26 October 05 11massey@cs.colostate.edu

Ghost Flushing

AssertionBGP

Root Cause Notification

Implication: more redundancy means faster

Tlong convergence

Fail-over Simulation Results

Fail-over: nodes switch to worse paths

26 October 05 12massey@cs.colostate.edu

Conclusions? (Not Yet!) Root Cause Approach is Clear Winner

But several non-trivial deployment problems Not immediately clear we could standardize it.

Ghost-Flushing Does Well in Fail-down Easily incrementally deployed But may not work well in Fail-over

MRAI Timer Only Leaves us with current convergence problems And the network is getting larger…. And other complications in large systems….

26 October 05 13massey@cs.colostate.edu

Damping Analysis

simulation

calculation

no damping

Convergence UpdatesTrigger Damping Policies!

(could fix if we damped the RCNrather than just updates)

26 October 05 14massey@cs.colostate.edu

But What About Packets?

Improving packet delivery is the ultimate goal

Ghost Flushing

AssertionBGP

Root Cause Notification

26 October 05 15massey@cs.colostate.edu

Conclusions Root Cause Approach Adds Many Benefits

Convergence, dampening, packet delivery, diagnosis,….

New Routing Designs Should Include RCN Should be a required part of new routing

protocols Can RCN Be Added to BGP?

Not clear given existing complications To be continued in IRTF Routing Research

Group– Encourage interested researchers to join