L2 MPLS VPN (VPLS) Technology White Paper - H3Cmpls v… · S9500 L2 MPLS VPN (VPLS) Technology...

S9500 L2 MPLS VPN (VPLS) Technology White Paper

Hangzhou H3C Technologies Co., Ltd. 1/27

L2 MPLS VPN (VPLS) Technology White Paper

Keywords: MPLS, VPLS

Abstract: MPLS technologies make it very easy to provide VPN services based on IP

technologies and MPLS VPNs are highly scalable and easy-to-manage. There are two

MPLS-based VPN services: L3 MPLS VPN and L2 MPLS VPN. L2 MPLS VPN further

includes VPLS and VLL. VLL applies to point-to-point networking scenarios, while VPLS

supports point-to-multipoint and multipoint-to-multipoint networking. From users’ point of

view, the whole MPLS network is a Layer 2 switched network, through which Layer 2

connections can be established between sites. This document describes VPLS.

Acronyms:

Acronym Full spelling

MPLS Multiprotocol Label Switching

VPLS Virtual Private LAN Service



Table of Contents

1 Overview................................................................................................................................... 3

2 Basic Networking Architecture ................................................................................................... 4

3 Features.................................................................................................................................... 6

3.1 Terminologies.................................................................................................................. 6

3.2 Protocol Processing Mechanism ..................................................................................... 7

3.2.1 Basic Transmission Components of VPLS............................................................. 8

3.2.2 MAC address Learning and Flooding .................................................................... 9

3.2.3 VPLS Loop Avoidance ........................................................................................ 11

3.2.4 Peer PE Discovery and PW Signaling Protocol.................................................... 11

3.2.5 H-VPLS Implementation Mode ............................................................................ 12

3.3 Packet Frame Structure................................................................................................. 15

3.3.1 Packet Encapsulation on the AC ......................................................................... 15

3.3.2 Packet Encapsulation on the PW ........................................................................ 16

3.3.3 VPLS Packet and Encapsulation......................................................................... 17

3.4 Processing of User Data in the Entire Network .............................................................. 19

3.4.1 Processing of Common L2 and L3 User Data in the Entire Network..................... 19

3.4.2 Processing of User Protocol Packet Data in the Entire Network ........................... 19

4 Networking Overview............................................................................................................... 20

4.1 Key Points of VPLS Networking..................................................................................... 20

4.2 Typical VPLS Networking Example................................................................................ 21

5 Features of H3C S9500........................................................................................................... 21

5.1 Features of H3C S9500 VPLS....................................................................................... 22

5.1.1 Complete H-VPLS Solution ................................................................................. 22

5.1.2 Feature Configuration and Processing of VPLS Instances ................................... 22

5.1.3 H-VPLS AC Backup in the ME Mode................................................................... 23

5.1.4 Load Balance and Service Backup...................................................................... 23

5.1.5 Binding of Multiple VLANs with a Single VPLS Instance ...................................... 24

5.2 Processing Flow of H3C S9500 VPLS........................................................................... 25

5.3 VPLS-Relevant Features of H3C S9500........................................................................ 26

6 References.............................................................................................................................. 26



1 Overview As the social development moves on, the economic globalization trend becomes more

obvious, more and more enterprises are more widely distributed and the mobility of

company employees is increasing. This urges the telecom operators to provide link

connections so that the branches of an enterprise can be incorporated to construct

their own Intranet and their employees can easily access their Intranet outside the

enterprise.

At the very beginning, the telecom operators provided the enterprises with links via the

leased lines. The major disadvantages of this method are that it does not suite the

multi-branch and quick growth feature of the current enterprises and is also

characteristic of high cost and difficulty in management. Then as the rise of ATM and

frame relay technologies, the telecom operators turned to provide customers with

point-to-point L2 connections via virtual circuits, based on which the customers

constructed their own L3 networks to bear IP, IPX and other data streams. These

technologies all provided point-to-point L2 connections and the configuration was

complex. In particular, when a site was added, the administrator needed to make a lot

of configurations.

Today, IP networks are distributed all over the world and how to utilize the existing IP

networks to provide low-cost private networks for enterprises gradually becomes the

focus of the operators. Therefore, a kind of technology to provide VPN services on the

IP network and freely set any rate with simple configurations emerges. This

technology is the MPLS VPN. There are two MPLS-based VPN services: L3 MPLS

VPN and L2 MPLS VPN. L2 MPLS VPN further includes VPLS and VLL. VLL only

applies to the point-to-point networking application while VPLS can implement

multipoint-to-multipoint VPN networking, provides a more complete solution for the

operators who use point-to-point L2VPN service and can avoid the intervention in the

user’s internal route hierarchy as in the L3VPN. In this way, the operator may only

need to manage and operate a single network and provide multiple services (e.g.

best-effort IP service, L3 VPN, L2 VPN, traffic engineering, differentiated services, etc.)

on this network. This reduces the operator’s plenty of costs in construction,

maintenance and operation.

The VPLS service enables the users that are geographically isolated to connect with



one another via the MAN/WAN and enables the sites to be connected as if they were

connected in a LAN. A series of drafts[1] of IETF have described the VPLS solution

using the PWs (Pseudowires) of the MPLS as the Ethernet links and providing a

transparent transmission LAN service (TLS) via the MPLS network.

2 Basic Networking Architecture In the draft [2] related to VPLS, two VPLS network architectures are proposed: The

VPLS network with the fully-meshed logic connections of PWs (Pseudowires) and the

hierarchical VPLS architecture, as shown in Figure 1 and Figure 2 .

Figure 1 Common VPLS network architecture

As shown Figure 1 , the PEs in various sites of the VPLS network are logically fully

meshed. The VPLS network can provide point-to-multipoint connection service like

L3VPN and the PEs can learn MAC addresses and switch packets between multiple

points. The MPLS network provides tunnels for transparent transmission of VPN

packets, and the P equipment in the network does not involve in the learning and

switching of MAC addresses but only forwards the MPLS packets. Moreover, the

forwarding tables between VPNs on the PEs are independently of one another and the

MAC addresses can overlap between VPNs.



Figure 2 Hierarchical VPLS network architecture

As shown Figure 2 , in the hierarchical VPLS network architecture, the logic

fully-meshed connections are implemented in the core equipment (NPE) while the

user PE (UPE) is only connected with the nearest NPE via a PW to exchange packets

with the peer site. In this way, the network topology is hierarchical and the access

range is expanded. In the core network, the NPE has good performance/functions with

centralized VPN service flows while the UPE has low performance/function

requirement and is used for service access of VPNs. Meanwhile, link backup can be

implemented between the edge access equipment and the NPE and this enhances the

network robustness. The access network between the UPE and the NPE can be the

MPLS edge network (connected via VPLS or VLL) or a simple Ethernet (connected via

QinQ).In addition, the access mode of each UPE in the hierarchical VPLS network

architecture may be in fixed use and the access types from the UPE to the NPE can

be freely selected for the sites of the VPNs according to the actual access network

conditions.



3 Features

3.1 Terminologies

l MPLS L2VPN: MPLS L2VPN transparently transmits the L2 data on the MPLS

network. In the point of view of the user, this MPLS network is an L2 switching

network via which the L2 connections can be set up among different sites.

There are two types of MPLS L2VPN: VLL and VPLS.

l VPLS (Virtual Private LAN Service): It is a point-to-multipoint L2VPN service

provided in the public network. The VPLS service enables the users that are

geographically isolated to connect with one another via the MAN/WAN and

enables the sites to be connected as if they were connected in a LAN.

l VLL (Virtual Leased Line): It is a point-to-point L2 VPN service provided in the

public network. It enables two sites to be connected as if they were connected

via a line. It cannot provide switching among multiple points of the service

provider.

l CE (Custom Edge): It is the user edge equipment directly connected to the

service provider.

l PE (Provider Edge Router): It refers to the edge router in the backbone network

and is connected to the CE for the access of VPN services. It completes the

mapping and forwarding of packets from the private network to public network

tunnels and from public network tunnels to the private network. PEs can be

further divided into the UPE and the NPE.

l UPE (User facing-Provider Edge): It is the PE equipment close to the user side

and serves as the convergence equipment for users to access the VPN.

l NPE (Network Provider Edge): It is the core PE of the network and is located at

the edge of the core domain of the VPLS network to provide the VPLS

transparent transmission service between the core networks.

l VSI (Virtual Switch Instance): Through the VSI, the actual access links of the

VPLS can be mapped to various PWs.

l PW (Pseudo Wire): It is a bidirectional virtual connection between two VSIs

and is composed of a pair of unidirectional MPLS VCs.



l AC (Attachment Circuit): It refers to the connection between the CE and PE. It

may be the actual physical interface or a virtual interface. All the user packets

on the AC should generally be transparently transmitted to the peer site,

including the L2/L3 protocol packets of the users.

l QinQ (802.1Q in 802.1Q): It is a mechanism directly using the 802.1q-based

tunneling protocol of the Ethernet switch to provide multipoint L2VPN services.

It encapsulates the private network VLAN tag of the user into the public

network VLAN tag and the packet carries both layers of tags while crossing the

backbone network of the provider, thus providing the user with a simpler L2

VPN tunnel.

l Forwarders: It is a kind of PE. A PE receives the data frames sent over the AC

while a forwarder selects the PW for forwarding the packets. A forwarder is in

fact the forwarding table of VPLS.

l Tunnels: Used for bearing PWs. One tunnel can bear multiple PWs, generally

MPLS tunnels. A tunnel is a direct-connect channel between a local PE and the

peer PE to transparently transmit data between the two PEs.

l Encapsulation: The packet transmission over the PW uses the standard PW

encapsulation format and technology. There are two modes for VPLS packet

encapsulation over the PW: Tagged mode and Raw mode.

l PW Signaling: The PW signaling protocol is the basis for VPLS implementation,

used for establishing and maintaining PWs. It can also be used for

automatically discovering the peer PE of a VSI. At present, there are two PW

signaling protocols: LDP and BGP.

l Service Quality: To map the priority information in the L2 packet header of the

user into the QoS priority for transmission over the public network, generally

the MPLS network that supports traffic engineering should be applied.

3.2 Protocol Processing Mechanism

The VPLS-related draft describes the basic transmission components of the VPLS

network. All the VPLS services are completed by the series of transmission

components. The VPLS solution provided by the draft also centers on the formation

and application of these basic transmission components. In addition, the draft provides

the hierarchical VPLS application solution with non-fully-meshed connections of PWs.



3.2.1 Basic Transmission Components of VPLS

The whole VPLS network is just like a huge switch. It establishes PWs between sites

of various VPNs via MPLS tunnels and transparently transmits L2 packets of users via

these PWs. The PEs will learn the source MAC address and establish an MAC

forwarding table entry while forwarding a packet, so as to complete the mapping

between MAC addresses and user Attachment Circuits (ACs)/PWs. The P equipment

only needs to complete the MPLS data forwarding according to the MPLS label

without concerning the L2 user packets internally encapsulated in the MPLS packets.

The transmission components of the VPLS network and their functions are described

as follows:

l Attachment Circuit (AC): A connection line or virtual link between the CE and

the PE. All the user packets on the AC should generally be transparently

transmitted to the peer site, including the L2/L3 protocol packets of the users.

l Pseudo wire (PW): A bidirectional virtual connection established between two

VSIs of one VPN. It is composed of a pair of unidirectional MPLS VCs, borne

over the LSP and established via the PW signaling protocol. For the VPLS

system, the PW is just like a direct-connect channel from one local AC to the

peer AC to transparently transmit L2 data of users.

l Forwarders: A PE receives the data frames sent over the AC while a forwarder

selects the PW for forwarding the packets. A forwarder is in fact the forwarding

table of VPLS.

l Tunnels: Used for bearing PWs. One tunnel can bear multiple PWs, generally

MPLS tunnels. A tunnel is a direct-connect channel between a local PE and the

peer PE to transparently transmit data between the two PEs.

l Encapsulation: The packet transmission over the PW uses the standard PW

encapsulation format and technology. There are two modes for VPLS packet

encapsulation over the PW: Tagged mode and Raw mode.

l Pseudo wire Signaling: The PW signaling protocol is the basis for VPLS

implementation, used for establishing and maintaining PWs. It can also be

used for automatically discovering the peer PE of a VSI. At present, there are

two PW signaling protocols: LDP and BGP.



l Service Quality: To map the priority information in the L2 packet header of the

user into the QoS priority for transmission over the public network, generally

the MPLS network that supports traffic engineering should be applied.

The positions of the basic transmission components of VPLS in the network are shown

in Figure 3 :

Figure 3 Basic transmission components of VPLS

Let’s take the packet flow of VPN1 from CE3 to CE1 for example to describe the basic

data flow direction. CE1 sends an L2 packet and the packet enters PE1 via the AC.

After PE1 receives the packet, the forwarder selects a PW for forwarding the packet

and the system then generates L2 MPLS labels according to the forwarding table entry

of the PW (the private network label is used to identify the PW while the public network

label is used to cross the tunnel and reach PE1). After the packet reaches PE2

through the public network tunnel, the system pops up the private network label (the

public network label pops up via PHP on the P equipment). The forwarder of PE2

selects an AC for forwarding the L2 packet from CE3 to CE1.

3.2.2 MAC address Learning and Flooding

The control plane of VPLS does not need to advertise and distribute reachability

information, but it lets the address learning of the standard bridge function in the data

plane to provide reachability.

(1) Source MAC address learning

The MAC address learning process involves two parts:

l Remote MAC address learning associated with PW

Because a PW is composed of a pair of unidirectional VC LSPs (the PW will be

regarded as being up only when the VC LSPs in both directions are up), the PW

should map the MAC address to the VC LSP in the egress direction when the VC LSP



in the ingress direction has learnt an MAC address originally unknown to it.

l Local MAC address learning of the port directly connected to the user

For an L2 packet sent from the CE, the source MAC address in the packet should be

learnt by the corresponding port on the VSI.

The address learning and flooding process of the PE is illustrated in Figure 4 .

Figure 4 Address learning and flooding process of the PE

(2) MAC address reclamation

The MAC addresses dynamically learnt must have the refresh and re-learning

mechanisms. In the draft[2] related to VPLS, a kind of address reclamation message

that uses the optional MAC TLV for deleting or re-learning the specified MAC address

list is provided.

When the topology structure changes, the address reclamation message can be used

so as to quickly remove the MAC addresses. The address message falls into two

types: The address message with the MAC address list and the address message

without the MAC address list. If a backup link (becoming active) receives a notification

message with the re-learnt MAC address list, the PE will update the MAC address

entry in the FIB table of VPLS instances and send this message to the other relevant

LDP sessions to directly connect the PE. If the notification message contains a null

MAC address TLV list, it indicates that the PE should remove all the MAC addresses

in the specified VPLS instance (except the MAC addresses learnt from the PE that



sends this message).

(3) MAC address aging

The remote MAC addresses learnt by the PE require an aging mechanism to remove

the table entries related to the VC label but no longer in use. After the packet is

received, the aging timer corresponding to the source address shall be reset.

3.2.3 VPLS Loop Avoidance

To avoid loop occurrence, the STP protocol should be enabled in a common L2

network but the private network STP evidently should not participate in the network of

the ISP. In VPLS, fully-meshed connections and split horizon forwarding are used to

avoid the running of STP on the ISP network. Each PE must create a tree for each

VPLS forwarding instance to all the other PE routers in this instance. Each PE router

must support the split horizon policy to avoid loop occurrence, that is, the PE router

cannot forward packets between the PWs of the same VPLS instance (because all

PEs are directly connected in the same VPLS instance). In this sense, split horizon

forwarding means that the data packets received from the PWs on the public network

side will no longer be forwarded to the other PWs but can only be forwarded to the

private network side.

In the point of view of the user, it is allowed to run the STP at the L2VPN private

network side and all the BPDU packets of the STP are only transparently transmitted

in the network of the ISP.

3.2.4 Peer PE Discovery and PW Signaling Protocol

For the PEs in the same VSI, their addresses can be manually specified in a remote

manner or automatically discovered by the other auto discover mechanisms. At

present, the peer PE of a VSI can be automatically discovered via BGP or LDP and

these two protocols can also be used as the PW signaling protocol to establish PWs.

The establishment of a PW is to allocate a multiplex detachment label (VC label) and

advertise the allocated VC label to the peer PE. In addition to label distribution, the PW

signaling protocol is also used to advertise the parameters relevant to the VPLS

system, for example, PW ID, control word, interface parameters and so on. Through

the PW signaling protocol, a fully-meshed PW can be established between PEs to

serve the VPLS.



3.2.5 H-VPLS Implementation Mode

Because the VPLS solution described above requires that a fully-meshed tunnel LSP

should be established between all the PE routers providing the VPLS service,

n*(n-1)/2 PWs should be established between the PEs for each VPLS service.

However, these PWs are all generated by the signaling protocol. The real

disadvantage is that the above solution cannot achieve large-scale application,

because the PE routers providing VCs need to duplicate data packets and each PE

needs to broadcast the packet to all the peer equipment in the case of the first packet,

a broadcast packet or a multicast packet. Through the hierarchical connections, we

can reduce the load of the signaling protocol and data packet duplication(Although the

total number of broadcast packets duplicated remains unchanged, it is completed

together by multiple devices in H-VPLS), so that the VPLS can be applied on a large

scale.

Generally, the LSP will place some small edge devices in the user inhabit and

aggregate them into a PE in the central office. Therefore, it is quite necessary to

extend the tunneling technology of VPLS service to the MTU (Multi-Tenant Unit). In

this way, the MTU equipment can be regarded as a PE and used to provide the basic

VPLS virtual connection service at each edge. The feasible technologies include the

use of a PW and the Q-in-Q logical interface between the MTU and the PE. In the

two-layer hierarchical VPLS, one layer is the core PW (hub) of VPLS and the other is

the extended access PW (spoke).

(1) Two access means of H-VPLS

The two access means of H-VPLS are illustrated in the following figures:

Figure 5 LSP access mode of H-VPLS

As shown in Figure 5 , the UPE works as the aggregation equipment MTU and it only

establishes a PW with NPE1 to connect the link U-PW and does not establish any PW



with the rest peers. The data are forwarded in the following procedure: UPE1 sends

the packet from a CE to NPE1 and adds the multiplex detachment label (MPLS label)

of the U-PW to the packet. Upon receipt of the packet, NPE1 determines the VSI of the

packet according to the multiplex detachment label and then adds the multiplex

detachment label of the N-PW to the packet according to the destination MAC address

of the user data packet before forwarding the packet. After receiving the packet from

the N-PW, NPE1 adds the multiplex detachment label of the U-PW and sends the

packet to the UPE. Upon receipt of the packet, the UPE then forwards it to the CE.

If CE1 and CE2 exchange data for the local CEs, the UPE will directly forward the

packets between CE1 and CE2 without needing to report the packet to NPE1 because

of its bridge function. However, if it is the first packet or a broadcast packet whose

destination MAC address is unknown, the UPE will still forward the packet via the

U-PW to NPE1 while broadcasting the packet via the bridge to CE2, so that NPE1 can

duplicate the packet and forward it to each peer CE.

Figure 6 QinQ access mode of H-VPLS

As shown in Figure 6 , the MTU is a standard bridge device. QinQ is enabled on the

CE access ports and the VLAN-TAG is attached as the multiplex detachment label.

The packet is transparently transmitted to PE1 via the QinQ tunnel between the MTU

and PE1. PE1 then determines the VSI according to the VLAN-TAG attached by the

MTU and then adds the multiplex detachment label of the PW (MPLS label) to the

packet according to the destination MAC address of the user data packet before

forwarding the packet. After receiving the packet from the PW, PE1 determines the

VSI of the packet according to the multiplex detachment label (MPLS label) and then

adds the VLAN-TAG according to the destination MAC address of the user data

packet for the QinQ tunnel to forward the packet to the MTU, which will then forward

the packet to the CE.



If CE1 and CE2 exchange data for the local CEs, the MTU will directly forward the

packets between CE1 and CE2 without needing to report the packet to PE1 because

of its bridge function. However, if it is the first packet or a broadcast packet whose

destination MAC address is unknown, the MTU will still forward the packet via the

QinQ tunnel to PE1 while broadcasting the packet via the bridge to CE2, so that PE1

can duplicate the packet and forward it to each peer CE.

(2) Backup of the H-VPLS AC

Since there is only a single connection link between the MTU/UPE and the PE/NPE,

this solution has an obvious disadvantage: Once the AC fails, all the VPNs connected

to the aggregation equipment will be disconnected. Therefore, for the two access

models of H-VPLS, we need to design the backup link: In normal cases, the equipment

only uses one link (master) for access purposes; however, once the VPLS system

detects that the master link fails, it will start the backup link to continue to provide VPN

services.

For the H-VPLS using LSP access, because the LDP session is run between the UPE

and the NPE, the activity status of the LDP session can be used to judge the failure or

not of the master PW. For the H-VPLS using QinQ access, the STP should be run

between the MTU and the PE connected to the MTU, so as to ensure that the other

link will be started once the master link fails.

As shown in Figure 7 , the UPE detects that the U-PW with NPE1 fails, so it

automatically starts the backup PW to transmit data. Suppose there is a packet whose

MAC address is “A” in CE1, it initially reaches CE3 via the master PW. Because of the

MAC address learning ability of VPLS, the MAC address will be learnt by the

corresponding virtual interfaces on NPE1 and NPE3. Again because NPE3 does not

know the occurrence of link switchover at the peer, it still keeps the MAC address table

entry, which is obvious a mistake. For this reason, the relevant MAC addresses should

be reclaimed when the UPE conducts the active/standby PW switchover. The MAC

address reclamation can be implemented by use of the address reclamation message

of the LDP. If there are many MAC addresses to be reclaimed, then an address

reclamation message whose MAC address list is null can be directly sent, so as to

clear all the MAC addresses in the VPN (except for the address table entry of the link

over which the MAC address reclamation message is sent).



PW3

NPE 1

NPE 2

NPE 3

UPE

CE 2

CE 1PW1

PW2

CE 3

U-PW

U-PW（backup）

MAC A

MAC A，UPWMAC A，PW1

MAC A，reclamation

Figure 7 MAC address update after the active/standby PW switchover

The MAC address reclamation message is sent and processed as follows: The UPE

sends the MAC address reclamation message to NPE2. After processing this

message, NPE2 learns the address “MAC A” and tells it to the backup PW before

sending the message to the other peers (NPE1 and NPE3). The other peers process

the received message, learn “MAC A” and tell it to the corresponding PWs.

(3) Multi-domain VPLS service

The hierarchical VPLS can also be used to create the VPLS service of a larger scale

and spare the need of full-meshed connections for all the VPLS equipment in the case

of a single VPLS domain or crossing multiple domains. Each fully-meshed VPLS

network is connected via a single LSP tunnel and each VPLS network uses one PW to

connect two domains. When more than two domains are connected, the inter-domain

fully-meshed PW must be established on each edge PE. In this way, a three-layer

model is created: The direct connections between the MTU and the PEs; the

fully-meshed connections between PEs in the domain; and the fully-meshed

connections between the inter-domain edge PEs.

3.3 Packet Frame Structure

3.3.1 Packet Encapsulation on the AC

The packet encapsulation mode on the AC is decided by the user access mode. There

are two user access modes: VLAN access and Ethernet access. They are defined as

follows:



l VLAN access: The header of the Ethernet frame sent upward from the CE or

sent downward from the PE carries a VLAN Tag, which is a service delimiter

designed by the ISP to differentiate users. We call this TAG the “P-Tag”.

l Ethernet access: The header of the Ethernet frame sent upward from the CE or

sent downward from the PE does not carry any service delimiter. If the frame

header contains a VLAN Tag, it indicates that the tag is only the internal VLAN

Tag of the user packet and is insignificant to the PE equipment. We cal this kind

of user’s internal VLAN Tag the “U-Tag”.

The VSI access means of the user can be specified through configuration.

3.3.2 Packet Encapsulation on the PW

There are also two packet encapsulation modes on the PW: Raw mode and Tagged

mode.

In the Raw mode, the P-Tag is not transmitted on the PW: For the uplink packet on the

CE side, if a packet with a service delimiter is received, the service delimiter will be

removed first before the packet is sent upward, attached with two layers of MPLS

labels and then forwarded; or if a packet without service delimiter is received, the

packet will be directly sent upward and then attached with two layers of MPLS labels

before being forwarded. For the downlink packet on the PE side, the packet will be

added or not added (depending on the specific configurations) with a service delimiter

before being forwarded to the CE but it is not allowed to rewrite or remove any existing

tag.

In the Tagged mode, the P-Tag must be carried on the frame transmitted on the PW:

For the uplink packet on the CE side, if a packet with a service delimiter is received,

the service delimiter will not be removed but the packet will be directly sent and then

attached with two layers of MPLS labels before it is forwarded; or if a packet without

service delimiter is received, the packet will be added with a null tag first before it is

sent upward, attached with two layers of MPLS labels and then forwarded. For the

downlink packet on the PE side, the service delimiter will be rewritten, removed or

retained (depending on the specific configurations) before the packet is forwarded to

the CE.

The protocol[2] stipulates that the tagged mode applies by default.



3.3.3 VPLS Packet and Encapsulation

As shown in Figure 8 , Figure 9 , Figure 10 , Figure 11 , the green arrows show the

encapsulation of the user packets not carrying a private network VLAN tag among the

devices playing different VPLS roles, while the purple arrows show the encapsulation

of the user packets carrying a private network VLAN tag among the devices playing

different VPLS roles. In addition, the encapsulation format between the PEs (on the

PWs) shown in the figure is given without considering the outer-layer tunnel label PHP

operation. If the operation is taken into account, then the packet encapsulation on the

PWs may be a single-layer MPLS label (inner-layer label).

Figure 8 Link packet encapsulation in the Raw mode via Ethernet access



Figure 9 Link packet encapsulation in the Tagged mode via Ethernet access

Figure 10 Link packet encapsulation in the Raw mode via VLAN access



U-TAG IP header DataMAC

IP header DataMAC

P-TAG

P-TAG


IP header DataMAC

P-TAG

P-TAG


IP header DataMAC

P-TAG

P-TAGMAC label1 label2

MAC label1 label2


IP header DataMAC

P-TAG

P-TAG


IP header DataMAC

P-TAG

P-TAG


IP header DataMAC

P-TAG


MAC label1 label2


IP header DataMAC

P-TAG

P-TAG


IP header DataMAC

P-TAG

P-TAG


IP header DataMAC

P-TAG


MAC label1 label2


IP header DataMAC

P-TAG

P-TAG


IP header DataMAC

P-TAG

P-TAG


IP header DataMAC

P-TAG


MAC label1 label2

PE1

CE1

PE2

CE2

Figure 11 Link packet encapsulation in the Tagged mode via VLAN access

3.4 Processing of User Data in the Entire Network

3.4.1 Processing of Common L2 and L3 User Data in the Entire Network

According to the characteristics of VPLS services, the common L2 and L3 user data

will be transparently transmitted to the peer end, including the MAC header of the user

packet and the private VLAN tag of the user.

For the unicast packet with a known MAC address from the PE, the system will

transparently transmit the packet to the corresponding CE.

For an unknown unicast, multicast or broadcast packet of the user, the system will

broadcast it in the entire VPLS domain, that is, all the CEs will receive the packet.

For an L3 packet of the user, the VPLS system will forward it based on the L2 header

of the packet without caring about the content of the L3 packet.

3.4.2 Processing of User Protocol Packet Data in the Entire Network

Since the intermediate P equipment forwards packets only based on the outer-layer

MPLS label without caring about whether the packet is a common packet or a protocol

packet, all the L2 and L3 protocol packets of the user will be transparently transmitted

by the VPLS system. The protocol packet of the private network will not interact with

the protocol of the VPLS system. They are independent from each other. The private



network protocol data will not affect the public network protocol.

For the user protocol packet whose destination MAC address is a unicast MAC

address, the system will transparently transmit the packet to the corresponding CE.

For the user protocol packet whose destination MAC address is a multicast or

broadcast packet, the system will broadcast it in the entire VPLS domain and all the

CEs will receive the protocol packet.

4 Networking Overview

4.1 Key Points of VPLS Networking

(1) Logic fully-meshed connections among the PEs

l Fully-meshed PWs must be set up among all the PEs for the VPLS basic

networking.

l Fully-meshed PWs must be set up among the NPEs for the H-VPLS

networking.

(2) Correct configuration of user access modes and access ports

l The access modes of all the VPLS instances at the access port must be

consistent.

l In the VLAN access mode, the uplink packets of the user must carry P-TAG, the

access port should be configured as “Trunk”, and the corresponding VLAN of

the connected VPN shall be allowed to pass.

l In the Ethernet access mode, the uplink packets of the user cannot carry

P-TAG but can carry the user private network tag, the access port should be

configured as “Access”, and the QinQ function of the port should be enabled.

(3) Correct UPE configuration in the H-VPLS networking



l The roles of UPE and NPE must be made clear. Incorrect configurations will

cause loops in the VPLS domain.

l The UPE is allowed to access one NPE only. When there are active and

standby links, it can access two NPEs.

l The NPE can access multiple UPEs.

l When the UPE accesses the NPE via QinQ, the access mode of the

corresponding instances on the NPE should be VLAN access. When there is

link backup, the STP needs to be enabled between the UPE and two NPEs to

backup the links.

l When the UPE accesses the NPE via LSP, the UPE can access the NPE in the

VLL or VPLS mode, and it should be specified at the NPE that the access

equipment be UPE. If there are the active and the standby PWs, it is necessary

to specify the active/standby relation of the NPEs.

l The role definitions of UPE and NPE are only within a certain VPLS instance.

4.2 Typical VPLS Networking Example

See Figure 12 :

Figure 12 Network diagram for typical VPLS networking

5 Features of H3C S9500 S9500 series switches support the basic VPLS networking and H-VPLS networking.

For the H-VPLS networking, they support multiple access modes: QinQ, VPLS and

VLL. The VPLS service feature board is adopted in the S9500 for centralized



processing of VPLS services. The S9500 supports comprehensive feature

management of VPLS instances and provides a good VPLS solution.

5.1 Features of H3C S9500 VPLS

5.1.1 Complete H-VPLS Solution

The S9500 fully supports the H-VPLS solution proposed in the draft. Its VPLS service

access network can be a common Ethernet or an MPLS edge network. In addition, the

access networks where multiple sites of a VPLS instance are located are independent

from each other. The local Ethernet access network can interwork with the peer MPLS

edge access network.

5.1.2 Feature Configuration and Processing of VPLS Instances

To facilitate VSI maintenance and management, a series of features can be supported

for each VPN, such as VSI traffic limit, broadcast traffic limit, MAC address quantity

limit and QoS class.

VSI traffic limit refers to the maximum traffic that the VPN can access on the PE. Once

the user traffic exceeds this limit, the user packets will be discarded.

To limit the L2 broadcast packets of the VPN, the user is allowed to specify the

broadcast traffic limit. VSI broadcast traffic limit refers to the percentage of the

broadcast traffic in the VPN to the maximum VPN traffic on the PE. Once the

broadcast traffic exceeds the broadcast suppression percentage of the VSI traffic limit,

the user’s broadcast packets will be discarded.

Since the forwarding table entry resources of the system are limited, it is necessary to

limit the MAC address quantity of VSI. Once the number of hosts in each VSI exceeds

the MAC address quantity limit of the VSI, the VPLS system will no longer learn MAC

forwarding table entries. The operator can configure a proper MAC address limit value

for the VPN users, so as to ensure that they can run internal private network services

normally. In some exceptional cases (such as the user equipment is infected by

viruses or there is MAC address attack), the system resources can be prevented from

being used up by the VSI.

When forwarding user packets, the VPLS system can classify the packets according

their priorities to ensure that some important data are preferably forwarded to the peer

in the public network. The system will map the priority information in the L2 packet



header of the user into the QoS priority for transmission over the public network

(mapping it into a tunnel transmission priority). Generally the MPLS network that

supports traffic engineering should be applied. There is a table of mapping from IEEE

802.1Q COS to tunnel EXP in the relevant protocol.

5.1.3 H-VPLS AC Backup in the ME Mode

The S9500 supports the backup of the AC from the MPLS edge network to the

H-VPLS. Since the UPE serves as a convergence device, all the VPLS services that

access via the UPE will be affected once the link between the UPE and the NPE is

faulty. To enhance the stability, the AC backup function can be enabled.

5.1.4 Load Balance and Service Backup

The S9500 supports load balance between multiple VPLS service boards, improving

the VPLS forwarding performance of the system. The S9500 also supports service

backup. When a service board fails, the S9500 automatically switches the traffic of the

board to a board working normally. This improves the reliability of VPLS services.

Currently, VPLS supports eight even label ranges: Label 0 to Label 7. By establishing

two tiers of mapping, VPLS redirects the services of VSIs to VPLS boards for

processing:

l Mapping between VSIs and label ranges

When you configure a VSI, the S9500 automatically selects for the VSI the label

range with the greatest number of available labels. You can also use the command

that the S9500 provides to specify a label range form the VSI.

l Mapping between label ranges and VPLS boards

Mapping between label ranges and VPLS boards is implemented by configuring

redirection on public network interfaces. The configuration command has two

important parameters: one specifies the redirection rule, namely the VPLS label

range; the other specifies the VPLS board. The command allows assigning label

ranges to VPLS boards. A single label range cannot be assigned to more than one

VPLS board.

Using the above mechanisms, you can assign VSIs to label ranges evenly, and

assign label ranges to VPLS boards evenly, implementing load balance between

multiple VPLS service boards.

When a board fails or is pulled out, the S9500 immediately selects the VPLS board



which is servicing the least number of VSIs to take over all the VSI services on the

original board. After the original board comes back into service or a normal VPLS is

substituted, the S9500 waits to see whether the board can work normally for a period

of time. If so, it switches the VSI services back to the board.

5.1.5 Binding of Multiple VLANs with a Single VPLS Instance

The S9500 supports binding a single VPLS instance with up to 64 local VLANs,

allowing the local VLANs to communicate with each other and to communicate with

the remote VLANs bound to the same VPLS instance at Layer 2. This expands the

user access scope greatly. Note that the number of VLANs that can be bound with a

VPLS instance may vary. Refer to the relevant specification description.

Figure 13 Bind multiple VLANs with a single VPLS

In the example shown in Figure 13 , VLAN 10 and VLAN 11 are bound with the same

VPLS instance. The following describes the data forwarding process:

When an ARP request of VLAN 10 arrives at the VPLS service board, the board looks

up its ARP table based on the destination IP address. As no match is found, the board

broadcasts the ARP request to all remote user side networks of the same VPLS

instance and to all the other local VLANs bound with the same VPLS instance (VLAN

11 in this example). When the local PE receives the ARP response from VLAN 11, it

forwards it to VLAN 10.



5.2 Processing Flow of H3C S9500 VPLS

The user-side packets access via the common interface board of S9500. The system

will send upward all the data in the private VLAN to which the VPLS service is bound

to the VPLS service board for centralized processing. After learning the MAC address

and searching for the forwarding table entry, the system will add two layers of MPLS

labels to the original user packet and then forward it to the next-hop device of the peer

PE.

In S9500, since the VPLS packets at the PW side cannot be processed by common

interface boards, the user needs to configure redirection rules on the corresponding

port of the public network, so as to direct the packets to the VPLS service board for

processing. At present, S9500 supports the establishment of up to 128K PWs, that is,

the system can allocate 128K private network labels for PW establishment. During the

VPLS service configuration, the user is required to configure a rule to redirect the

MPLS packets within the private network label range corresponding to the PW on the

public network port, so that the VPLS services from the PW side will be directed to the

service board for centralized processing. After learning the MAC address and

searching for the forwarding table entry, the system forwards the packets to the CE.

When the S9500 runs VPLS, it requires a VPLS service board for centralized

processing. Figure 14 illustrates the processing model.

Figure 14 Diagram for VPLS processing on the S9500



5.3 VPLS-Relevant Features of H3C S9500

l Supports both Martini (LDP) and Kompella VPLS. Refer to the specific

specification description.

l Supports up to four VPLS service boards for load balance and service backup.

l The VPLS boards use NP boards for centralized processing and upgrade of

VPLS boards can be implemented by upgrading the software.

l No interface is provided; inputting and outputting of traffic depends on LPUs.

l The VPLS service boards take responsibility for MPLS label encapsulation and

MPLS forwarding, and therefore the LPUs can be standard ones or enhanced

ones and do not necessarily support MPLS.

l Supports MAC address aging and reclamation.

l Supports MPLS network loop avoidance based on full mesh and split horizon.

l Allows private networks to run STP for loop avoidance on private networks and

supports transparent transport of STP protocol messages between private

networks.

l Supports H-VPLS and provides QinQ and LSP access between UPE and NPE.

l Supports VSI traffic bandwidth limiting.

l Supports VSI broadcast traffic limiting.

l Supports VSI MAC address limiting and helps in preventing MAC address

attacks.

l Supports QoS and mapping from CoS priorities to EXP priorities.

For other relevant information, refer to the related documents.

6 References Draft[1]:

Draft: draft-ietf-pwe3-control-protocol-11.txt

Draft[2]:

RFC: RFC 4761 (Virtual Private LAN Service (VPLS) Using BGP for Auto-Discovery

and Signaling.txt)

RFC: RFC 4762 (Virtual Private LAN Service (VPLS) Using Label Distribution Protocol

(LDP) Signaling.txt)



Draft: draft-ietf-l2vpn-vpls-ldp-03.txt

Draft: draft-ietf-l2vpn-vpls-bgp-02.txt

Copyright ©2007 Hangzhou H3C Technologies Co., Ltd. All rights reserved.

No part of this manual may be reproduced or transmitted in any form or by any means without prior written consent of Hangzhou

H3C Technologies Co., Ltd.

The information in this document is subject to change without notice.

S9500 NAT Technology White Paper

Copyright © 2007 Hangzhou H3C Technologies Co., Ltd. Page 1/20

H3C S9500 NAT Technology White Paper

Keywords: NAT

Abstract: Network Address Translation (NAT) provides a way of translating the source IP address

in an IP packet header to another IP address. In practice, NAT is primarily used to allow

users using private IP addresses to access the Internet. With NAT, a few public IP

addresses are used by a larger number of private network hosts to solve the problem of

IP addresses depletion.

Acronyms:


NAT Network Address Translation



Table of Contents

1 Overview................................................................................................................................... 3

2 Introduction to NAT.................................................................................................................... 3

2.1 Related Terms................................................................................................................. 3

2.2 Operation of NAT ............................................................................................................ 4

2.2.1 Single Instance ..................................................................................................... 4

2.2.2 Multi-Instance ....................................................................................................... 9

3 Application Scenarios .............................................................................................................. 12

3.1 Common POP Network ................................................................................................. 12

3.2 Multi-ISP Network Using Policy-Based Routing ............................................................. 13

3.3 Multi-Instance VPN-Public NAT ..................................................................................... 14

3.4 Multi-Instance VPN-VPN NAT ....................................................................................... 15

4 H3C S9500 Characteristics ..................................................................................................... 16

4.1 Overview....................................................................................................................... 16

4.1.1 Use of Network Processor................................................................................... 16

4.1.2 Large Capacity, High Performance...................................................................... 17

4.1.3 Support for Access to Internal Servers ................................................................ 17

4.1.4 Support for Static Address Translation................................................................. 17

4.1.5 Rich ALG Features.............................................................................................. 17

4.1.6 Blacklist Function................................................................................................ 17

4.1.7 Logging function ................................................................................................. 18

4.1.8 Support for VPN Users........................................................................................ 18

4.1.9 Limit to the Numbers of Users and Connections Within a VPN ............................ 18

4.1.10 NAT for Inter-VPN Communication .................................................................... 19

4.2 NAT Operation Process of the H3C S9500 .................................................................... 19

4.2.1 NAT Single Instance............................................................................................ 19

4.2.2 NAT Multi-Instance.............................................................................................. 19



1 Overview As the Internet is faced with IPv4 address depletion, NAT and IPv6 were introduced

to solve this problem. NAT is used based on the following fact: In private networks

(such as an enterprise network), only a small number of hosts access the Internet at

a specific time, and 80％ traffic are limited within the network. Therefore, the hosts

are assigned private IP addresses (The IANA reserves the addresses of network

segments 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16). A private address does not

need to be globally unique, and it can be used in different private networks and

translated into a public address when the host using it accesses the Internet.

The MPLS L3 VPN technology is widely used, especially in large enterprise networks,

for it inherits the advantage of IP routing and integrates the fast forwarding and

flexible networking characteristics of MPLS. MPLS L3 VPN features network structure

simplification, easy maintenance, stable performance, and secure network access.

Integrating NAT with MPLS L3 VPN can make a private network invisible to the

outside to enhance network security, and help save operating costs by providing

reusable IP addresses. NAT multiple-instance enables perfect integration of NAT and

MPLS L3 VPN by allowing for access to the Internet and between VPNs through NAT,

and address reuse in different VPNs.

2 Introduction to NAT

2.1 Related Terms

l NAT: Provides a way of translating private IP addresses into public IP

addresses, allowing hosts in a private network (or a public network) to access

the public network (or the private network).

l NAPT（Network Address and Port Translation）: Network Address and Port

Translation (NAPT) identifies each internal host by TCP/UDP port number or by

the identifier field of ICMP packets. Unless otherwise stated, the port numbers

of IP packets refer to the TCP/UDP port numbers or the identifier of ICMP

packets. NAPT can better utilize IP address resources by allowing more internal

hosts to access the Internet simultaneously.



l VPN: Virtual Private Network (VPN) enables construction of private networks

over a shared public network by using multiple technologies, such as MPLS,

tunneling and encryption. Unless otherwise stated, the term VPN refers to Layer

3 VPN (BGP/MPLS VPN) in this document.

l ALG: Application Layer Gateway (ALG) provides address translation for some

special application layer protocol packets (such as ICMP destination

unreachable packets, FTP packets, and ILS packets). These application layer

protocols need to negotiate port numbers between client and server, and thus

the corresponding NAT entries are created based on the negotiation results; the

private IP addresses or port numbers are contained in the payload of such

protocol packets.

l EASY IP: Uses the IP address of an interface on the router as the public IP

address for translation through NAPT, to save IP address resources.

l FTP: The File Transfer Protocol (FTP) is used to transfer a file from a file

system to another.

l DNS: Domain Name System (DNS) is a distributed database used by TCP/IP

applications to translate domain names into IP addresses and provide email-

related routing information.

l ILS: Internet Location Service (ILS) is a dynamic directory service function

provided by Microsoft. Users can store and search dynamic information (such

as IP address) through ILS.

l FIB: Forwarding Information Base (FIB) stores the core data for Layer 3 packet

(IP packet) forwarding.

l ARP: The Address Resolution Protocol (ARP) is used to resolve an IP address

into a MAC address.

l NP: A Network Processor (NP) is a programmable, high-performance network

processor for handling packets.

2.2 Operation of NAT

2.2.1 Single Instance

1. NAT

NAT only translates IP addresses, as shown in 错误！未找到引用源。.



Figure 1 NAT operation

NAT operates as follows:

(1) The NAT device receives a packet from the private host to the public host.

(2) The NAT device selects an unused public address from its address pool and

establishes corresponding NAT entries (both inbound and outbound).

(3) The NAT device uses the outbound NAT entry to translate the source private IP

address to the public address and sends the packet to the public host.

(4) After receiving a response packet from the public host, the NAT device uses the

inbound NAT entry to translate the destination public IP address to the private

address and sends the packet to the private host.

Note that NAT cannot solve IP address depletion effectively, and is not commonly

adopted in practice.

2. NAPT

NAPT translates both IP addresses and port numbers (or the identifier field of ICMP

messages) and can better utilize IP address resources, allowing more internal hosts

to access the Internet simultaneously. NAPT does not support non-TCP/UDP/ICMP

packets.



Figure 2 NAPT process

As shown in the above figure, NAPT operates as follows:

(1) The NAT device receives a packet from the private host to the public host. (2) If the connection is new, the NAT device selects an unused IP address and a

port number from its address pool, and then creates corresponding NAT entries (both outbound and inbound).

(3) The NAT device uses the outbound entry to translate the source private IP address and port number to the public ones and sends the packet to the public host.

(4) After receiving a response packet from the public host, the NAT device uses the inbound NAPT entry to translate the destination IP address and port number to the private ones and sends the packet to the private host.

3. NAPT internal server

Normally, public hosts have no permission to access most private hosts, but they may

need to access some internal servers. The problem is that NAPT entries cannot be

dynamically generated when public hosts initiate connections to internal servers. To

solve this problem, you can configure NAT internal servers on the NAPT device, that

is, to configure mappings between public IP addresses/port numbers and private IP

addresses/port numbers.



Figure 3 Operation of NAT internal server

As shown in the above figure, NAPT with an internal server configured operates as

follows:

(1) The NAT device receives a packet from the public host to the internal server. (2) The NAT device uses the outbound NAPT entry to translate the destination

public IP address and port number to the private ones and sends the packet to the internal server.

(3) After receiving a response from the internal server, the NAT device uses the inbound NAPT entry to translate the source private IP address and port number to the public ones, and sends the packet to the public host.

4. NAPT ALG

Some application layer protocols need to negotiate port numbers between client and

server, so that the server can initiate connections to the client using the negotiated

port numbers (such as the establishment of an FTP data channel). If the NAT device

knows nothing about the negotiation process, it cannot perform translation between

private IP address/port number and public IP address/port number, and thus the

server and client cannot access each other. NAT ALG can solve this problem. The

following takes FTP as an example to describe ALG operation.

There are two FTP modes, Common FTP and Passive FTP. In Common FTP mode,

the client specifies a port for the server to establish a connection. If the client resides

in a private network, the NAT device needs to use ALG to generate a NAT/NAPT

entry through which the server can access the client. In Passive FTP mode, the

server specifies a port for the client to establish a connection. If the server resides in

a private network, the NAT device also needs to use ALG for the client to access the

server. When a private client wants to access a public server in Passive FTP mode,

or when a public client wants to access a private server in Common FTP mode, the

connection is initiated from the private network and thus ALG need not be used.



l Common FTP

If a private FTP client wants to access a public FTP server, two TCP connections

need to be established. One is the control connection (TCP port number 21 on the

server) which is used to forward control information, such as commands and

parameters; the other one is the data connection (TCP port number 20 on the server)

which is used to transmit files.

Figure 4 Common FTP mode

The client notifies its port number and IP address through the PORT command to the

FTP server over a control connection, and then the server initiates a TCP data

connection at port 20 to the specified IP address and port.

To allow the public server to access the private client, the corresponding NAT entries

need to be created on the NAT device. To do so, the NAT device monitors the control

flow between the client and the server. It uses the private IP address and port number

in the received PORT command to create NAT/NAPT entries, and replaces them with

the corresponding public ones in the PORT command.

l Passive FTP

In Passive FTP mode, both the control and data connections are initiated by the client.

The client sends a PASV request through the control channel to tell the server that it

will use the Passive FTP mode. Then, the client uses a port above 1023 to transmit

data to the server at a port dynamically assigned, which may not be 20.



Figure 5 Passive FTP mode

In the PASV request, the client notifies the server to use a specified data port (not the

default data port). Then, the server sends a response containing the port number and

its IP address, and waits for the client to initiate a connection.

The PASV response sent from the server is:

227 Entering Passive Mode. A1,A2,A3,A4,a1,a2

In this message, 227 represents the PASV response code; A1,A2,A3,A4 represents

the server IP address; (a1*256 +a2) is the port number of the server, which has the

same format as that of the PORT command.

If the server resides in a private network, the NAPT entries need to be created on the

NAT device for the public client to access the private server. To do so, the NAT

device uses the private IP address and port number in the PASV response received

from the server to create the corresponding NAT/NAPT entries, and replace the

private ones in the PASV response with the public ones through the entry.

2.2.2 Multi-Instance

1. NAT

NAT multi-instance extends NAT single-instance to support VPN address translation,

ensuring the same private IP addresses used in different VPNs are translated into

different public IP addresses. See Figure 6 for the translation process of NAT multi-

instance.



Figure 6 NAT process

NAT multi-instance operates as follows:

(1) As shown in the above figure, the NAT device receives a packet from a private

host to a public host. (2) If it is the first time that the private host accesses the public network, the NAT

device selects an unused public IP address from its address pool and establishes corresponding NAT entries (both inbound and outbound), containing the VPN name, source private IP address, and assigned public IP address.

(3) The NAT device uses the outbound NAT entry to translate the source private IP address into the public one and sends the packet to the public host.

(4) After receiving a response from the public host, the NAT device uses the inbound NAT entry to translate the destination public IP address into the private one and forwards the packet to the private host in the corresponding VPN.

2. NAPT

NAPT multi-instance extends NAPT single-instance to support VPN address

translation.



Figure 7 NAPT

NAPT multi-instance operates as follows:

(1) As shown in the above figure, the NAT device receives a packet from a private

host to a public host. (2) If it is a new connection from the private network, the NAT device selects an

unused IP address and a port number from its address pool, and then creates corresponding NAT entries (both outbound and inbound), containing the VPN name, private IP address/port number, and assigned public IP address/port number.

(3) The NAT device uses the outbound NAPT entry to translate the private IP address and port number into public ones and sends the packet to the public host.

(4) After receiving a response packet from the public host, the NAT device uses the inbound NAPT entry to translate the destination public IP address and port number into private ones and forwards the packet to the private host in the corresponding VPN.

3. NAPT multi-instance internal server

The internal servers in NAPT multi-instance provide VPN support at the private

network side. The operation process is similar to that of the single instance.



Figure 8 Operation of NAT internal server

NAPT multi-instance with a internal server configured operates as follows:

(1) As shown in the above figure, the NAT device receives a packet from the public

host to the internal server. (2) The NAT device uses the outbound NAPT entry to translate the destination

public IP address and port number to the private ones and sends the packet to the internal server in the corresponding VPN.

(3) After receiving a response from the internal server, the NAT device uses the inbound NAPT entry to translate the source private IP address and port number to the public ones, and sends the packet to the public host.

4. NAPT multi-instance ALG

NAT multi-instance ALG extends NAT single-instance ALG to support VPN.

3 Application Scenarios

3.1 Common POP Network

The fast expansion of the Internet results in shortage of IPv4 addresses. Therefore,

NAT is used on high-end routers and core switches in large-sized enterprise and

metropolitan-area networks, to facilitate network maintenance and management.

3.1 Figure 9 shows a common point of presence (POP) network.



Figure 9 Common POP network

3.2 Multi-ISP Network Using Policy-Based Routing

A private network may connect to multiple ISPs, as shown in 错误！未找到引用源。.

With policy-based routing (PBR) configured on the NAT device, hosts in network

10.8.1.0/24 can access the Internet through ISP 1, and hosts in network 10.8.2.0/24

can access the Internet through ISP 2. Configure address pool 1 (205.113.48.1

through 205.113.48.3) and address pool 2 (207.36.64.1 through 207.36.64.3) on the

NAT device. Address pool 1 belongs to ISP 1 and address pool 2 belongs to ISP 2.

When accessing the Internet, hosts in network 10.8.1.0/24 use the IP addresses in

address pool 1, and hosts in network 10.8.2.0/24 use the IP addresses in address

pool 2. Thus, hosts in different private network segments can access the Internet

through different ISPs, and can be provided with differentiated services.



Internat

ISP 1 ISP 2

10.8.1.0/24 10.8.2.0/24

NAT 1 Address Group205.113.48.1~205.113.48.3

NAT 2 Address Group207.36.64.1~207.36.64.3

Figure 10 Multi-ISP network using policy-based routing

3.3 Multi-Instance VPN-Public NAT

As shown in the following figure, each PE has its own address pool for NAT

translation and supports MPLS encapsulation.

This networking mode mainly applies to enterprises, where users can assign IP

addresses independently.



Figure 11 Multi-instance NAT

The hosts of each VPN can access the Internet through NAT configured on the local

PE. When receiving a packet from a CE, the corresponding PE matches it against the

configured ACL to determine whether it is destined to the Internet. If so, the PE

translates the source IP address, adds a public MPLS tag, and sends the packet out.

If the packet is destined to a host in another site of the same VPN, the PE

encapsulates the corresponding private and public tags and sends it out. In this way,

hosts in different sites of a VPN can access each other and the Internet over a

common link without any interference.

3.4 Multi-Instance VPN-VPN NAT

The same private IP addresses can be used in different branches of a government

network or an enterprise network. Besides accessing the Internet, VPN users may

need to access an authorized server that is usually placed in a VPN for security.

Other VPNs use RT to control the access to the server.



Figure 12 Multi-instance NAT

As shown in Figure 12 , hosts in VPN 1 and VPN 2 need to access the Internet, as

well as a public server in the VPN named Server. To implement this application,

configure NAT on the PEs connected to VPN 1 and VPN 2 respectively, and

configure ACL rules to achieve address translation for packets from VPN 1 and VPN

2 to the Internet and VPN Server. Communication between different sites of VPN 1 is

enabled through Layer 3 forwarding.

4 H3C S9500 Characteristics

4.1 Overview

4.1.1 Use of Network Processor

The S9500 uses NAT boards to implement NAT functions. Because one S9500 can

have multiple NAT boards, you need to specify the NAT board number when

configuring a NAT entry or an internal server. A NP is used as the core packet

processing chip on a NAT board. The NP is programmable and scalable to provide

flexible services.



4.1.2 Large Capacity, High Performance

Since the high-performance NP is used to process data packets, NAT of the S9500

features large NAT table capacity and powerful processing capabilities. The NAT

table can accommodate a maximum of 1.2 M NAT entries, the rate of link setup can

reach 150 Kpps, and the bidirectional packet forwarding rate through NAT can reach

3.0 Mpps. Suppose each packet is 64 bytes long, and the bidirectional translation rate

can reach 1.5 Gbps. Because NP is on a per packet basis, longer packets can

promote forwarding performance.

4.1.3 Support for Access to Internal Servers

Through configuring an internal server (a mapping between private IP address/port

number and public IP address/port number) on the NAT device, you can allow public

hosts to access the internal server in a private network. Additionally, the H3C S9500

supports the AnyServer feature, which enables public hosts to access any port of a

protocol on the internal server (ICMP does not use ports). This helps simplify internal

server configuration on the NAT device.

4.1.4 Support for Static Address Translation

Through static address translation, a private address can be mapped to a fixed public

address. Thus, hosts in the private network can access public networks using a fixed

public address. In addition, static address translation supports point-to-point

applications by enabling a public host to directly access a private host.

4.1.5 Rich ALG Features

S9500 uses software to implement NAT ALG for packets. The S9500 NAT ALG

functions support FTP, TFTP, DNS, ICMP time exceeded/unreachable messages,

LDAP, MSN Messenger 7.0 voice/video, and other commonly used application

software.

4.1.6 Blacklist Function

To prevent a private host from excessively occupying public network bandwidth, you

can limit its total network connections using a NAT blacklist based on link setup rate,



number of connections, or both.

To limit the number of connections, you can set a threshold value. Then, if the

number of connections established from a user exceeds this value, the user is added

into the blacklist and cannot establish new connections. When the existing NAT

entries of the user have aged out, the NAT device waits 30 seconds and then

removes the user from the blacklist to allow the user to establish new connections.

To limit the link setup rate, the token bucket in the standard single-rate color-blind

mode is adopted. If a private host’s link setup rate exceeds the CIR, it is added into

the blacklist and cannot establish new connections. The user is removed from the

blacklist and can establish new connections when the link setup rate decreases to a

value that makes enough tokens available in the token bucket.

4.1.7 Logging function

NAT entries can be logged to a server when they are established and aged out and

when they exceeds the specified active time. NAT logging configuration items include

enabling of logging, log version, source and destination IP addresses, source and

destination port numbers, and logging mode (flow-begin, logs sending interval for

active flows). When enabling logging, you can specify a configured ACL to determine

which packets needs to be logged.

4.1.8 Support for VPN Users

Traditionally, when two VPNs use the same public IP address to access the Internet

through NAT, address conflicts will occur and packets returned from the public

network cannot be sent to the correct VPN. The S9500 NAT multi-instance feature

adds VPN information into NAT entries, allowing multiple VPNs to access the Internet

through a common NAT device without affecting each other.

MPLS and IP networks are also supported to provide various networking modes for

ISPs.

4.1.9 Limit to the Numbers of Users and Connections Within a VPN

If multiple enterprises (VPNs) want to access the Internet through a common NAT

device, you need to specify the maximum numbers of users and connections of each



VPN to prevent any enterprise from excessively using address resources.

4.1.10 NAT for Inter-VPN Communication

In traditional MPLS VPNs, one VPN can access another VPN through RT. However,

if the two VPNs use the same private IP addresses, address conflicts will occur, and

therefore inter-VPN communication cannot be implemented only by using RT. To

satisfy this requirement, the H3C S9500 can translate VPN private IP addresses into

the IP addresses in the NAT address pool.

4.2 NAT Operation Process of the H3C S9500

4.2.1 NAT Single Instance

1. Outbound operation process

(1) Look up the NAT entries. If a match is found, go to Step 3. (2) Match the packet against the configured ACL to determine whether to perform

NAT. If not, the packet is forwarded; if yes, select an address from the address pool. A port is also selected for NAPT translation.

(3) Translate the source IP address and port number of the packet. If ALG processing is needed, the packet is processed by NAT ALG.

(4) Look up the FIB table to forward the packet.

2. Inbound operation process

(1) Look up the NAT entries. If a match is found, go to the next step; otherwise, the packet is discarded

(2) Translate the destination IP address and port number of the packet. If ALG processing is needed, the packet is processed by NAT ALG.

(3) Look up the FIB table based on the translated private IP address and forward the packet.

4.2.2 NAT Multi-Instance

NAT multi-instance extends NAT single instance by supporting VPNs. When a private host accesses a public host, NAT multi-instance creates a NAT/NAPT entry, which includes the VPN information. Thus, hosts in different VPNs can use the same private IP addresses. A packet returned from the public network matches the corresponding NAT entry and is forwarded to the VPN specified in the entry.




No part of this manual may be reproduced or transmitted in any form or by any means without prior written consent of

Hangzhou H3C Technologies Co., Ltd.


S9500 Network Security Technology White Paper


H3C S9500 Network Security Technology White Paper

Keywords: Network security, threat

Abstract: With the emergence of more and more network based critical services, security

problems are drawing more and more attention, making network security research a

hotspot in both the computer and telecommunications fields. This document describes

commonly known network attacks and introduces the network security features of the

H3C S9500 series.

Acronyms:


DoS Denial of Service



Table of Contents

1 Overview .................................................................................................................................. 3

2 Network Security Threats ........................................................................................................ 3

2.1 Definition of Network Security Threat............................................................................ 3

2.2 Classification of Network Security Threats ................................................................... 3

2.3 Security Threats to Network Devices ............................................................................ 4

2.3.1 Threats at Data Transport Level ......................................................................... 4

2.3.2 Threats at Signaling Level .................................................................................. 4

2.3.3 Threats at Device Management Level ................................................................ 4

3 Security Capabilities of H3C S9500 ........................................................................................ 4

3.1 Security Features at Data Transport Level ................................................................... 5

3.1.1 Defense Against Address Scanning.................................................................... 5

3.1.2 Defense Against DoS/DDoS Attacks .................................................................. 5

3.1.3 Broadcast/Multicast Rate Limit ........................................................................... 6

3.1.4 Defense Against MAC Address Table Capacity Attacks ..................................... 6

3.1.5 Support for Static MAC Address Entries and ARP Entries ................................. 6

3.1.6 Powerful ACL Capabilities................................................................................... 7

3.2 Security Features at Signaling Level ............................................................................ 7

3.2.1 Defense Against ARP Attacks ............................................................................. 7

3.2.2 Address Conflicts Detection ................................................................................ 7

3.2.3 Defense Against TC/TCN Attacks ....................................................................... 8

3.2.4 Defense Against Address Embezzlement........................................................... 9

3.2.5 Defense Against Routing Protocol Attacks ......................................................... 9

3.3 Security Features at Device Management Level .......................................................... 9

3.3.1 Support for User Levels ...................................................................................... 9

3.3.2 Secure Remote Management ............................................................................. 9

3.3.3 Security Auditing ............................................................................................... 10

3.3.4 Access Control .................................................................................................. 10

3.3.5 SFTP Service .................................................................................................... 10



1 Overview With the evolvement of Internet technologies and the explosive growth of the Internet

scale, Internet applications, starting from science research fields, have now reached

every walk of life. More and more network based critical services are emerging and

networks have become the new drive for improvement of productivity and life quality.

However, the Internet is based on IP, and therefore has inherent problems such as

security, quality of service, and operation mode, with security the most outstanding

and important problem. In addition, the openness of IP networks makes the security

problem even more complicated.

While the simplicity and openness of IP networks boost the rapid development of the

Internet, they also result in security vulnerabilities. Meanwhile, with the development

of technologies and the acceleration of information delivery, technical difficulty in

launching attacks to IP networks is falling and attack tools are becoming more

automatic, enabling more people to launch attacks. The number of network attack

events increases every year and the resulting economic cost is becoming higher and

higher. Network security threats are not only disturbing corporations, but endangering

the national information security, casting a shadow on the development of the Internet.

2 Network Security Threats

2.1 Definition of Network Security Threat

Network security threat refers to destruction and unauthorized access and

modification of data that is saved or transferred on networks, servers, and desktops.

Network security threats are usually implemented by specific technologies or tools

and are challenges to network security.

2.2 Classification of Network Security Threats

Security threats on IP networks fall into two categories: those to the security of hosts

(including user’s hosts and application servers) and those to the security of networks,

mainly network devices such as routers and switches. The former generally attack

specific operating systems, primarily the Windows systems. Examples include viruses



and Trojan horses. The latter mainly attack TCP/IP protocols. This white paper

discusses the latter, namely security threats to network devices.

2.3 Security Threats to Network Devices

Network devices provide three levels of functions: data transport level, signaling level,

and network device management level. According, this section describes security

threats to network devices in these three levels.

2.3.1 Threats at Data Transport Level

The network data transport level is responsible for processing and forwarding of data

entering a network device. Functions of this level may be affected by two types of

attacks:

l Attacks based on high traffic or abnormal packets, which are intended to

consume large quantity of CPU resources so that normal traffic cannot be

serviced.

l User data targeted attacks, which compromise the confidentiality and integrity of

user data by sniffing, tampering, or deleting user data.

2.3.2 Threats at Signaling Level

The signaling level maintains operation of network protocols to control routing and

switching of packets. Routing information sniffing and IP address forging are the main

threats at this level. These threats may cause routing information leakage and abuse.

2.3.3 Threats at Device Management Level

The device management level supports remote management of network devices.

Threats at this level come from two aspects: one is the vulnerabilities of the protocols

(such as Telnet and HTTP) for device management, and the other is management

defects such as the leakage of a management account.

3 Security Capabilities of H3C S9500 The H3C S9500 series are high-end routing switches that are based on the Comware

software platform. They not only hold all the security features of the Comware

platform, but also incorporate some other security features.



3.1 Security Features at Data Transport Level

3.1.1 Defense Against Address Scanning

When launching an address scanning attack, an attacker sends to a target network a

large quantity of IP packets with different destination IP addresses. In this case, the

network device connecting the target network has to send a great deal of ARP

packets for delivering of the attack packets. If no host is present with the destination

address of an attack packet, the network device has to send destination unreachable

notifications as well. When the target network has many hosts and the attack packets

are in great quantities, the CPU and memory resources of the network device may be

depleted, resulting in network service interruption.

The H3C S9500 series support defense against address scanning attacks. When an

H3C S9500 routing switch receives a packet destined for one of its directly connected

network segment, it checks whether an ARP entry is present for the destination

address. If not, it sends an ARP request and adds a drop entry for the destination

address to prevent subsequent packets to the address from impacting the CPU. If it

receives a response to the ARP request later, it removes the dropping entry and adds

an ARP entry. A drop entry expires after a specified period of time. This mechanism

can effectively block attack packets while allowing normal traffic.

The H3C S9500 series provide some configuration commands to enable/disable

defense against address scanning attacks.

3.1.2 Defense Against DoS/DDoS Attacks

During a denial of service (DoS) attack, an attacker sends large amounts of

connection requests to the target device to deplete the resources of the device,

making the device unable to function normally or even go down. DoS attacks usually

aim at servers, preventing servers from providing services for legal users. That is why

the attacks are called DoS attacks.

Distributed denial of service (DDoS) is an upgraded version of DoS. A DDoS attack

can compromise multiple devices at the same time and is more destructive in a

greater range.

The H3C S9500 series can better defend themselves against common DoS and



DDoS attacks such as Spoofing, Land, and Smurf, ensuring that when some protocol

is compromised, the others can function normally. Besides, when a server behind an

H3C S9500 routing switch is targeted by a DoS attack, the switch can assign specific

ACL rules to filter attack packets, so as to ensure that the connected server and hosts

can work normally.

3.1.3 Broadcast/Multicast Rate Limit

Broadcast and multicast packets in great quantities can consume a great deal of

network bandwidth and therefore degrade the forwarding performance of network

devices. When a loop exists on a network, broadcast and multicast packets may even

bring the network down.

The H3C S9500 series have powerful broadcast/multicast packet filtering functions.

Using these functions, you can set an absolute broadcast/multicast rate limit for a port

or a limit on the broadcast/multicast rate percentage. You can also configure ACL

rules to limit the rates at which broadcast packets, multicast packets, and unknown

unicast packets can pass a port.

3.1.4 Defense Against MAC Address Table Capacity Attacks

A MAC address table capacity attack sends a great deal of frames with different,

forged source MAC addresses to a target device, making the device learn a lot of

useless MAC addresses. As the capacity of a MAC address table is limited, the

device may not be able to learn MAC addresses of legal users normally. During Layer

2 forwarding, the attack packets may be broadcasted in the VLAN, wasting a lot of

bandwidth and impacting the hosts connected to the network device.

With the H3C S9500 series, you can set the maximum number of MAC addresses

that a port or VLAN can learn based on the number of hosts connected to the port or

VLAN, preventing a port or VLAN from using up all the MAC address table resources.

When setting the MAC address limit, you can also specify whether the device should

forward packets with unknown source MAC addresses when the limit is reached. This

allows you to prevent too much broadcast traffic in a VLAN from impacting other

devices.

3.1.5 Support for Static MAC Address Entries and ARP Entries

The H3C S9500 series support static MAC address entries and static ARP entries. By



configuring static MAC address entries, you can ensure the correct forwarding of

Layer 2 frames. By configuring static ARP entries, you can bind MAC addresses with

IP addresses, preventing IP addresses being embezzled.

3.1.6 Powerful ACL Capabilities

In complicated network environments, there may be kinds of attack packets

compromising network devices or the attached hosts.

The H3C S9500 series provide powerful ACL capabilities, allowing identification, limit

and filtering of packets based on the fields at data link layer, network layer, and

transport layer. The ACL rules can not only be based on common fields such as

ICMP, IGMP, TCP port number, UDP port number, IP address, and MAC address,

but can also be based on TTL, VLAN_ID, and EXP fields. In addition, you can

configure ACL rules for a device, a port, or a VLAN as required.

3.2 Security Features at Signaling Level

3.2.1 Defense Against ARP Attacks

The ARP protocol supports no authentication methods, although it is very important in

data forwarding. Attackers often use forged ARP packets to launch attacks. The H3C

S9500 series support defense against this kind of attack.

After an H3C S9500 routing switch receives an ARP packet, it hashes the source

MAC address of the packet. Besides, it counts the received ARP packets. When it

detects that the CPU is dropping packets and the number of ARP packets from a

MAC address exceeds the limit, it considers the host an ARP attacker and will log the

event, give an alert message, and add a source MAC address drop entry to filter

packets from the host.

3.2.2 Address Conflicts Detection

If the interface of a network device is using the same IP address as that of a host or

another network device which is connected with the interface, an address conflict

exits. In this case, if the network device cannot detect the address conflict, the ARP

entry for the network device on the other connected hosts may be updated to have a

wrong MAC address, disabling the hosts from communicating with the network device



normally.

The H3C S9500 series support address conflicts detection. When an H3C S9500

routing switch receives an ARP packet, it checks whether the source IP address of

the packet is the same as that of the interface connecting the network segment. If yes,

it sends an address conflict notification packet to tell the ARP packet sender that the

IP address has been used. At the same time, it sends a gratuitous ARP broadcast

packet, notifying all hosts and network devices on the segment to use the correct

ARP entry for the IP address. An address conflict alert message may also be

generated and logged, so that network administrators know the situation.

3.2.3 Defense Against TC/TCN Attacks

With Spanning Tree Protocol (STP) enabled, if a port of a device on the network

detects an STP state change, it generates a topology change (TC) or topology

change notification (TCN) message. When another device on the network receives

such a TC or TCN message and finds that the network topology has changed, it

needs to remove the MAC address and ARP entries to avoid using the entries for

data forwarding. If there are a lot of TC or TCN messages on a network, MAC

address and ARP entry flushing will occur frequently and large amounts of ARP

requests will then be broadcasted in the VLAN. In this case, Layer 3 packets may be

dropped and the network may not be able to function normally.

The H3C S9500 series can protect the network against TC/TCN attacks. Upon

receiving a TC/TCN packet, an H3C S9500 routing switch removes the MAC address

entries but does not remove the ARP entries. When relearning a MAC address, it

checks whether there is an ARP entry for the MAC address. If so, it directly modifies

the outbound port of the ARP entry. Modifying ARP entries based on MAC addresses

can avoid packet dropping during Layer 3 forwarding.

Frequent topology change may affect the operation stability of all devices on the

network. The H3C S9500 series can deal with this situation. After receiving the first

TC/TCN message, an H3C S9500 routing switch executes a series of processes

accordingly and starts a timer. Before the timer expires, it does not respond to any

more TC/TCN messages. Once the timer expires, it checks whether it has received

any TC/TCN messages during the period. If so, it performs the flushing operation.

This mechanism helps keep the devices working stably.



3.2.4 Defense Against Address Embezzlement

Address embezzlement refers to the situation where an illegal user exploits the IP

address of a legal user. In this case, the network device will learn a wrong ARP entry

and the legal user will not be able to get online normally.

The H3C S9500 series can protect users against address embezzlement attacks.

With MAC address and IP address bindings configured, an H3C S9500 routing switch

performs address validation when learning ARP entries and learns only legal ARP

entries.

3.2.5 Defense Against Routing Protocol Attacks

Routing protocol attacks send forged routing update packets to routers that do not

perform routing protocol authentication, populating the routing tables with forged

routes. This may even cause the networks to crash. Experienced attackers may

further launch more severe attacks.

The H3C S9500 series support routing protocol authentication:

(1) OSPF: Plaintext/MD5 authentication between neighboring routers and

plaintext/MD5 authentication within an OSPF area.

(2) IS-IS: Level-1 plaintext/MD5 authentication between interfaces, Level-2

plaintext/MD5 authentication of interfaces, plaintext/MD5 authentication within

an IS-IS area, plaintext/MD5 authentication in an IS-IS routing domain.

(3) BGP: MD5 authentication between neighboring routers and within a BGP area.

(4) RIPv2: Plaintext/MD5 authentication between neighboring routers.

3.3 Security Features at Device Management Level

3.3.1 Support for User Levels

The H3C S9500 series provide four user levels (visit, monitor, system, and manage)

and support encryption of user passwords and limit of password attempts. If a user

cannot enter the correct password before the limit is reached, the device will give an

alert message.

3.3.2 Secure Remote Management

The H3C S9500 series support the SSH protocol. Network administrators can log in



to a network device by SSH securely.

3.3.3 Security Auditing

The H3C S9500 series provide basic security auditing functions including security

alarm logging and user operation logging.

3.3.4 Access Control

The H3C S9500 series support 802.1x authentication in port-based mode and MAC-

based mode, guaranteeing secure LAN access.

3.3.5 SFTP Service

Secure FTP (SFTP) allows users to log in to devices and perform remote file

management securely. An H3C S9500 routing switch can function as an SFTP server

or client. When it functions as a client, you can log in to a remote device from it to

perform file management.





S9500 OSPF/IS-IS/BGP GR Technology White

H3C S9500 OSPF/IS-IS/BGP GR Technology White

Paper V1.00

Keywords: GR

Abstract: Graceful Restart (GR) ensures continuity of packet forwarding and hence key services

when the routing protocol restarts. GR is a highly reliable technology widely used in

active-standby switchover and system upgrade.

Acronyms:


OSPF Open Shortest Path First

ISIS Intermediate System-to-Intermediate System

BGP Border Gateway Protocol

GR Graceful Restart


Table of Contents

1 Introduction ............................................................................................................................... 3

2 Typical Networking Analysis ...................................................................................................... 3

3 Features.................................................................................................................................... 4

3.1 Terms ............................................................................................................................. 4

3.2 How GR Works ............................................................................................................... 5

3.3 OSPF GR ....................................................................................................................... 5

3.3.1 Standard OSPF GR .................................................................错误！未定义书签。

3.3.2 Compatibility mode ............................................................................................... 9

3.4 ISIS GR ........................................................................................................................ 13

3.5 BGP GR........................................................................................................................ 16

4 H3C S9500 Features............................................................................................................... 18


1 Introduction The control plane and forwarding plane of a high-end router/switch are separate from

each other. The control plan controls and manages the whole device, discovering

routes and delivering routes to the interface boards. The forwarding plane is

dedicated to data forwarding. The respective processors of these two planes are

functionally independent.

Each time the control plane restarts, all the routing protocols have to restart, the

neighbor relationships between the device and the adjacent devices have to be

rebuilt, and all the routing information databases have to be re-synchronized.

Neighbor relationship interruption triggers route recalculation on neighbors, causing

routing flaps and communication failures.

To solve this problem, IETF proposed a series of enhanced protocols for different

routing protocols, such as IS-IS, OSPF, BGP, and LDP respectively. With these

enhanced protocols, the original protocol operating flows are improved. When the

control plane restarts on a device, the device will notify its neighbors to temporarily

preserve its routing information and adjacency relationships with it. After the protocol

restarts, the neighbors will help the restarting device restore routing information in a

very short time. During the restart, no routing flaps occur, and packet forwarding on

the network remains normal. These enhanced protocols are so called “Graceful

Restart”.

GR ensures the continuity of packet forwarding and hence key services when the

routing protocol restarts. GR is widely used in active-standby switchover and system

upgrade.

2 Typical Networking Analysis GR generally works between neighbors, as shown in the following figure:


Switch BGR Helper

Switch EGR Helper

Switch CGR Helper

Switch DGR Helper

Switch AGR Restarter

Figure 1 Typical GR network application

When its control software restarts, Switch A starts GR and notifies its neighbors

Switch B, Switch C, Switch D, and Switch E to start GR. During the GR process,

Switch A finishes synchronizing routing information with its neighbors and the

forwarding services remain uninterrupted.

3 Features

3.1 Terms

GR Restarter

A GR Restarter is a device whose control plane restarts.

GR Helper

A GR Helper is a neighbor device that assists the GR Restarter in synchronizing

routing information during the GR process.

The GR Restarter and GR Helpers must be GR-capable and perform GR capability

negotiation in advance, including GR capabilities and GR time. If the negotiation

succeeds, when the control plane of a GR-capable device restarts, the neighbor

devices is notified to become GR Helpers and the routes of the GR Restarter remain


unchanged within the GR time.

3.2 How GR Works

GR works with different routing protocols in the similar way, though the GR flows for

respective protocols vary.

In the following figure, the solid lines indicate that adjacencies are formed between

Switch A and Switch B, and between Switch A and Switch C, while the dotted lines

indicate that Switch A, Switch B, and Switch C are GR-capable and GR capability

negotiation has been complete among them. When its control plane restarts, Switch

A begins to work as a GR Restarter and its forwarding plane remains normal. Switch

B and Switch C begin to work as GR Helpers, with the routes of the GR Restarter

unchanged. Then, the GR Restarter (Switch A) reestablishes neighbor relationships

with the two GR Helpers (Switch B and Switch C) and receives routing information

from them. When the GR Restarter finishes receiving all the routing information, it

calculates the routes and synchronizes the calculation results to the forwarding plane.

After that, the GR process is complete.

Switch BGR Helper

Switch CGR Helper

Switch AGR Restarter

Figure 2 Typical GR networking

This GR process is generic to routing protocols. GR processing details vary with

routing protocols. The following sections describe the GR processing mechanisms for

OSPF, IS-IS, and BGP respectively.

3.3 OSPF GR

OSPF GR has two modes, standard mode and compatible modes.

S9500 OSPF/IS-IS/BGP GR Technology White 3.3.1 Standard OSPF GR

1. Packet format

Standard OSPF GR uses an Opaque-LSA (Type 9) to notify a neighbor device to

start the GR process. Known as the Grace-LSA, the Opaque-LSA has an Opaque

type of 3 and Opaque ID of 0. The following figure depicts the Grace LSA format.

Figure 3 Grace LSA format

The following figure shows its TLV format:

Type Length

Value

2 Bytes 2 Bytes

Type Length

Value

2 Bytes 2 Bytes

Figure 4 TLV format

The RFC defines three types of TLVs:

l Grace Period TLV

The Grace Period TLV has a Type value of 1 and Length value of 4, and indicates the

maximum time during which a neighbor acts as a GR Helper. If the GR Restarter has

not completed the GR process before this period expires, the neighbor device stops


working as a GR Helper. Grace-LSAs must contain a Grace Period TLV.

l Graceful restart reason TLV

A Graceful Restart Reason TLV has a Type value of 2 and Length value of 1, and

describes the graceful restart reason. Possible values of the Value field are 0 for

unknown reason, 1 for software restart, and 2 for software reloading (upgrade).

Grace-LSAs must contain a Graceful Restart Reason TLV.

l IP interface address TLV

An IP interface Address TLV has a Type value of 3 and Length value of 4, and

indicates the IP address of the interface sending the Grace-LSA. This IP address

uniquely identifies the restarting device on a broadcast, NBMA, or P2MP network.

2. Protocol processing flow

Standard OSPF GR works as follows:


GR Restarter GR Helper

Grace-LSA

ACK

1

2

HELLO3

DD LSU ACK

Grace-LSA

ACK

4

6

LSU7

5

Figure 5 RFC 3623 protocol processing flow

1) Once brought up again, an OSPF interface on the GR Restarter sends a Grace-

LSA.

2) Upon receiving the Grace-LSA, the neighbor starts to act as a GR Helper and

send an ACK to the GR Restarter.

3) Hello packets are exchanged on the broadcast or NBMA network to elect a DR

and BDR.

4) The GR Restarter begins normal LSDB synchronization. The neighbor state

transits from Exstart, through Exchange and Loading to Full. During this

process, the GR Restarter stores received self-originated LSAs, and labels them

as Stale.


5) The GR Helper also begins normal LSDB synchronization. The neighbor state

transits from Exstart, through Exchange and Loading to Full. During this

transition process, the GR Helper operates as in the FULL state, without

generating any new Router LSAs or Network LSAs.

6) When all the neighbor relationships become FULL, namely, restored, Grace-LSA

flushing is initiated.

7) The GR process is complete, and new LSAs are generated and flooded. The

LSAs labeled as Stale but not regenerated are flushed.

3.3.2 Compatible OSPF GR

1. Packet format

l Link-local Signaling (LLS) Block

Compatible OSPF GR extends the OSPF packet format to carry different types of

application data. The following figure shows the extended OSPF packet format:

IP Header Length = HL+X+Y+Z

OSPF Header Length = X

OSPF Data

Authentication Data Length = Y

LLS Block Length =Z

Header Length

X

Y

Z

Figure 6 Extended OSPF packet format

The authentication data and LLS block fields are not included in the OSPF packet

length. Currently, only two types of OSPF packets, Type 1 (OSPF Hello) and Type 2

(OSPF DD), contain LLS Block, which is identified by the L Bit (0x10) in the Option

field.

* * D C L N / P M C E *

Figure 7 Option field of OSPF packet


The LLS Block field adopts an extensible TLV structure, defining two types of TLVs:

Extended Options TLV (EO_TLV) and Cryptographic Authentication TLV (CA_TLV),

as show in the following figures.

Type 1

Length4

Extended Options

2 Bytes 2 Bytes

Figure 8 EO_TLV format

Type 2

Length20

Sequence Number

2 Bytes 2 Bytes

Auth Data

Figure 9 CA_TLV format

An EO_TLV has a Type value of 1 and a Value field with 4-byte Extended Options for

Option extension in OSPF packets.

l OOB

In traditional OSPF, LSDB resynchronization is performed only when neighbor

relationships are reestablished. Normal LSDB synchronization is carried out through

flooding after neighbor relationships are established. OOB (out-of-band) LSDB

resynchronization is carried out in a network where neighbor relationships have been

established and the network topology is stable.


In traditional OSPF, LSDB resynchronization requires the neighbor state machine to

be in the Exstart state. This causes OSPF to generate new Type-1 LSAs (Router

LSAs), triggering route recalculation.

LR_Bit is introduced in the OOB flow for the OOB capability negotiation between

neighbors. LR_Bit is contained in the Extended Option in an EO_TLV. If the device is

OOB-capable, when sending OSPF Hello packets and DD packets, the device sets

the LR_Bit in the Extended Option of the EO_TLV to 0x00000001.

* * * * * * * * * * LR

Figure 10 LR_Bit

In addition, R_Bit is introduced in the OOB flow to notify neighbor devices to perform

OOB resynchronization. R_Bit is contained in the DD packets sent to neighbors. In

the DD packets, R_Bit and I/M/MS are set. This means the sender wants to start

OOB resynchronization. In this case, if the neighbor state machine is FULL, the

device sets the neighbor state to ExStart to start LSDB resynchronization.

During OOB resynchronization, the neighbor state is treated as FULL regardless of

whether the state is ExStart, Exchange, or Loading, this is, the device operates as if

the neighbor is in the FULL state, and therefore Router LSAs and Network LSAs do

not change, keeping the network stable.

l RS bit

In the compatible mode, an RS_Bit is added to the Extended Option of the EO_TLV

to notify the neighbor to start the GR process. The value of the RS_Bit is 0x00000002.

* * * * * * * * * RS LR

Figure 11 RS_Bit


The following figure shows the processing flow of compatible OSPF GR:



Hello LR = 1 RS = 1

Hello LR = 1 RS = 0

DD R = 1

DD R = 1

DD R = 1 LSU LSACK

1

2

3

4

5

6

Figure 12 Draft GR flow

1) Once brought up again, an OSPF interface of the GR Restarter sends a hello

packet containing LLS Block. The RS bit and LR bit in the Extended Options field

of the EO_TLV in the LLS Block are set.

2) Upon receiving the hello packet, the neighbor skips the two-way state, that is, it

keeps the neighbor state unchanged, enters the GR Helper process flow, and

sends back a hello packet with the LR bit on and RS bit off.

3) After receiving the hello packet with LR bit on, the GR Restarter sets the

neighbor state to 2-way and the subsequent flow is the same as that of the

traditional OSPF protocol. Once the DR election is complete, the first DD packet

(with R_bit on) is sent to start the OOB flow. In the hello packet sent after the DR

election, the RS_bit will not be set.

4) After receiving the DD packet with R_bit set and then setting the corresponding

neighbor state to Exstart, the GR Helper also enters the OOB flow.


5) During LSDB resynchronization, the neighbor state transits from Exchange to

Loading and to Full. During this process, the GR Restarter stores the received

self-originated LSAs, and labels them as Stale.

6) When all the neighbor relationships become FULL and all the routing information

is restored, the GR process is complete. LSAs are regenerated and flooded, and

the LSAs labeled as Stale are not regenerated and are flushed directly.

3.4 ISIS GR

1. Packet format

In IS-IS GR, a new TLV, namely, Restart TLV, is added to IIH packets to notify the

neighbor device to enter the GR flow. This new TLV has a Type value of 211. The

following figure illustrates its Value field:

Flags

Remaining Time

Restarting Neighbor ID

1

2

ID Length

Figure 13 Value field of a Restart TLV

The one-byte Flags field records necessary state flags during the restart. The

following figure shows the Flags format:

* SA RRRA**** R RRA

Figure 14 Flags format

Currently, only the last three bits (SA, RA, and RR) are used. When the control

software restarts, the RR (Restart Request) bit of the first IIH packet sent through

each interface must be set. Upon receiving the IIH packet, the neighbor device must

acknowledge the receipt by sending back an IIH packet with the RA (Restart

Acknowledgement) bit set. The SA (Suppress adjacency advertisement) bit is

optional and used to avoid blackhole routes.


The 2-byte Remaining Time field indicates the time in seconds before the neighbor

ages out. This field and the RA bit must be present at the same time. Upon receiving

an IIH packet with RR bit set from the restarting device, the neighbor device must

immediately acknowledges the receipt by sending back an IIH packet whose RA bit is

set to 1. In this acknowledge packet, the time in seconds before the corresponding

neighbor (restarting device) ages out is filled in the Remaining Time field.

The System ID of the restarting device is filled in the Restarting Neighbor System ID

field.


In IS-IS GR, three timers, namely, T1, T2, and T3 are defined.

l T1 timer:

Like the IIH timer, the T1 timer is defined on each interface. It defines the interval for

sending IIH packets with the RR bit set and defaults to three seconds. When the

device restarts, a T1 timer is created on each interface and an interface periodically

sends IIH packets with RR bit set. The T1 timer on the interface is not removed until

the interface receives the IIH acknowledge packet with RA bit set and the complete

CSNP packet.

l T2 timer:

The T2 timer defines the maximum wait time of LSDB resynchronization and defaults

to 60 seconds. Each LSDB has such a timer.

l T3 timer:

The T3 timer defines the maximum restart time in IS-IS. Once the T3 timer expires,

the GR process ends regardless of whether the LSDB resynchronization is complete

and the normal IS-IS processing flow begins. Upon initialization, the T3 timer is set to

65535 seconds. After all interfaces receive the IIH acknowledge packets with the RA

bit set, the T3 timer is reset based on the minimum among the Remaining Time

values of these packets.

The following figure depicts the IS-IS GR working flow:


IIH RR = 1 RA = 0

IIH RR = 0 RA = 1

CSN P

LSP

IIH RR = 0 RA = 0

1

2

3

4

5

T1Timer

T2TimerT3

Timer


Figure 15 IS-IS GR flow

1) When IS-IS is re-enabled on the GR Restarter, T2 and T3 timers are enabled

globally. When an interface is brought up again, the T1 timer is started on the

interface (Different from the original protocol flow, when the interface is up, the

T1 timer, instead of the IIH timer, is started), and an IIH packet with the RR bit

set is sent.

2) After receiving the IIH packet, the neighbor leaves the neighbor state of the

sender unchanged, and sends back an IIH packet with the RA bit set. The IIH

packet is filled with the GR Restarter’s age remaining time and System ID in the

Remaining Time and Restarting Neighbor System ID fields of the Restart TLV

respectively. If the interface is a broadcast interface, a DIS election is performed,

which is different from traditional IS-IS DIS election. If it is elected as the DIS, the

interface sends CSNP packets and all LSPs. If the interface is a P2P interface, it

directly sends CSNP packets and all the LSPs.

3) After receiving the IIH packet with RA bit set and all the CSNP packets, the GR

Restarter removes the T1 timer. Otherwise, the GR Restarter periodically sends

IIH packets with RR bit set and does not remove the T1 timer until it has received

the IIH packet with RA bit set and complete CSNP packets or when the

maximum number of T1 timer timeouts is reached.


4) Once the GR Restarter finds that the LSDB resynchronization at a level is

complete, it removes the T2 timer of the level.

5) After removing all the T2 timers, the GR Restarter removes the T3 timer and

enters the normal IS-IS flow.

3.5 BGP GR

1. Packet format

Graceful Restart Capability

BGP GR defines a new BGP capability which is known as the Graceful Restart

capability and has a capability value of 64. The following figure shows is Value field.

Figure 16 BGP GR Capability Value

The R bit identifies Restart State. When it is set to 1, it means the sender is restarting

and the receiver can send routing information without needing to receive the End-of-

RIB marker from the sender. This prevents locking when multiple BGP speakers

await the End-of-RIB marker.

The Restart Time means the maximum route holdtime after the peer is found down.

The <AFI, SAFI, Flags for address family> fields indicate which network address


families the GR feature supports. GR can support IPv4 and IPv6 at the same time.

In BGP GR, an End-of-RIB marker is defined to speed up the BGP GR process.

The Update message with both reachable NLRI and withdrawn NLRI as null is

designated as the End-of-RIB marker. After a BGP connection is established, this

marker can notify the peer that its initial notification is complete.

3.2.2.1 Protocol processing flow

The following figure depicts the BGP GR working flow:

BGP Open,GR 64,AF IPv4


BGP session OK

BGP Update message

BGP Open,GR 64,AF IPv4R=1, Time=180

End of RIB（ IPv4）

BGP Update message

Switch A Switch B



BGP session OK

BGP Update message

BGP Open,GR 64,AF IPv4R=1, Time=180

End of RIB（ IPv4）

BGP Update message

Switch A Switch B

Figure 17 BGP GR flow


1) Switch A sends an Open message containing the IPv4 GR capability to the

neighbor.

2) The Open message sent by Switch B also contains the IPv4 GR capability.

3) Switch A restarts, and sends an Open message with R_bit set to request Switch

B to start the GR Helper processing flow; the maximum route holdtime in the

message is 180 seconds.

4) Upon receiving the Open message, Switch B starts the GR Helper processing

flow, labels all the IPv4 routes received from Switch A before as Stale, and holds

the routes for 180 seconds before deletion. Other flows are the same as those of

traditional BGP.

5) Optimal route selection is not performed during this process.

6) After sending all the Update messages, Switch B sends an End-of-RIB to notify

update completion.

7) After receiving all the routing information, Switch A performs optimal route

selection and resends Update messages to notify neighbor devise to update

routes.

8) Switch B deletes the routes labeled as stale.

4 H3C S9500 GR Characteristics The routing protocols running on the S9500 series switches have rich GR features

that allow excellent fault tolerance and compatibility. Each protocol can interoperate

with devices of other vendors. OSPF GR, in particular, supports both standard and

compatible modes and therefore is scalable in GR networking.





S9500 QoS Technology White Paper


H3C S9500 QoS Technology White Paper

Key words: QoS, quality of service

Abstract: The Ethernet technology is widely applied currently. At present, Ethernet is the leading

technology in various independent local area networks (LANs), and many Ethernet LANs

have been part of the Internet. With the development of the Ethernet technology, most

common Internet users access the Internet through Ethernet. To implement end-to-end

QoS throughout the network, you must guarantee QoS for Ethernet. To do this, Ethernet

switching devices must use the QoS technology to provide different QoS guarantees for

different types of traffic flows, especially those traffic flows with higher demand for delay

and jitter guarantees.

Acronyms:


QoS Quality of Service



Table of Contents

1 Overview................................................................................................................................... 3

2 Basic Networking Structure........................................................................................................ 3

3 Features.................................................................................................................................... 4

3.1 Service Model ................................................................................................................. 4

3.2 Traffic Classification ........................................................................................................ 4

3.3 Traffic Policing................................................................................................................. 5

3.4 Priority Marking............................................................................................................... 6

3.5 Queue Scheduling........................................................................................................... 8

3.6 Congestion Avoidance................................................................................................... 10

3.7 Traffic Shaping.............................................................................................................. 12

3.8 Policy Routing............................................................................................................... 13

4 QoS Processing Procedure on the S9500 Series..................................................................... 14



1 Overview On traditional packet switching networks, switches and routers treat all packets equally and handle them using the first in first out (FIFO) policy. This service is called best-effort. It delivers packets to their destinations as possibly as it can, without any guarantee for delay and jitter.

With the development of computer networks, more and more traffic such as voice,

video, and critical data which is sensitive to bandwidth, delay, and jitter is transmitted

over networks. This enriches the services resources on a network greatly. On the

other hand, there is a higher demand for the Quality of Service (QoS) of network

transmission.

The Ethernet technology is widely applied currently. At present, Ethernet is the

leading technology in various dependent local area networks (LANs), and many

Ethernet LANs have been part of the Internet. With the development of the Ethernet

technology, most common Internet users access the Internet through Ethernet. To

implement end-to-end QoS throughout the network, you must guarantee QoS for

Ethernet. To do this, Ethernet switching devices must use the QoS technology to

provide different QoS guarantees for different types of traffic flows, especially those

traffic flows with higher demand for delay and jitter guarantees.

2 Basic Networking Structure

Figure 1 Basic networking structure



3 Features

3.1 Service Model

A service model refers to a set of end-to-end QoS functions. The simplest service

model is the Best-Effort model adopting the FIFO policy. It delivers packets to their

destinations as possibly as it can, without any guarantee for delay and jitter. The Diff-

Serv model was introduced to implement QoS for network transmission. The Diff-Serv

model is a multi-service model. It provides QoS services for each packet according to

the QoS parameters specified for the packet, thus satisfying differentiated QoS

demands. The Diff-Serv model is used to implement end-to-end QoS for some critical

services.

The S9500 series support the Diff-Serv model.

3.2 Traffic Classification

To specify different QoS parameters for packets of different levels, the Diff-Serv

model must classify the network traffic first. Traffic classification organizes packets

with different characteristics into different classes using classification rules. A

classification rule is a filter rule configured to meet your management requirements. It

can be very simple. For example, you can use a classification rule to identify traffic

with different priorities according to the ToS field in the IP packet header. It can be

very complicated too. For example, you can use a classification rule to identify the

packets according to the combination of link layer (Layer 2), network layer (Layer 3),

and transport layer (Layer 4) information including MAC addresses, IP protocol,

source addresses, destination addresses, port numbers of applications, and so on.

Generally, the traffic classification criterion is limited in the header of an encapsulated

packet. Contents of packets are rarely adopted for traffic classification.

The S9500 series support Layer 2, Layer 3, and Layer 4 ACL rules for traffic

classification. Such ACL rules can classify packets based on source MAC addresses,

destination MAC addresses, VLAN IDs, source IP addresses, destination IP

addresses, source TCP/UDP port numbers, destination TCP/UDP port numbers,

protocol types, IP precedence, ToS precedence, DSCP precedence, and whether

packets are fragmented.



3.3 Traffic Policing

To use limited network resources to provide customers with better services, you can

enable traffic policing on the incoming port for the traffic of the specified customers,

thus making the traffic adapt to the network resources assigned to it. Traffic policing

uses token buckets for traffic control.

Figure 2 Traffic policing

Figure 2 depicts the processing procedure of traffic policing. First, packets are

classified and the packets with the specified characteristics enter the token bucket for

processing. If the token bucket has enough tokens for sending the packets, the

packets can pass through; otherwise, the packets are dropped. In this way, you can

control the traffic of a certain class of packets.

The system puts tokens into the bucket at the set rate. You can set the capacity of

the token bucket. When the token bucket is full, the extra tokens will overflow and the

number of tokens in the bucket stops increasing. When the token bucket processes

packets, if it has enough tokens for sending these packets, the packets are sent, and

at the same time, the corresponding number of tokens are taken out of the bucket. If

the token bucket does not have enough tokens for sending these packets, these

packets are dropped. Therefore, the traffic rate is restricted under the rate of

generating tokens, thus implementing traffic control.



The S9500 series support traffic policing with the granularity of 8 kbps.

3.4 Priority Marking

Through marking different priorities for packets, you can identify the service levels of

different packets. The S9500 series can perform priority marking for specific packets.

ToS precedence, differentiated services codepoint (DSCP) precedence, and 802.1p

precedence can be marked. These priority types apply to different QoS models and

are defined in different models. The following part introduces IP precedence, ToS

precedence, DSCP precedence, 802.1p precedence, and EXP precedence.

I. IP precedence, ToS precedence, and DSCP precedence

Figure 3 IP precedence, ToS precedence, and DSCP precedence

As shown in Figure 3 , the ToS field of the IP header contains 8 bits: the first three

bits (0 to 2) represent IP precedence from 0 to 7; the following 4 bits (3 to 6)

represent a ToS value from 0 to 15. In RFC2474, the ToS field of the IP header is

redefined as the DS field, where a DiffServ code point (DSCP) precedence is

represented by the first 6 bits (0 to 5) and is in the range 0 to 63. The remaining 2 bits

(6 and 7) are reserved.

II. 802.1p precedence

802.1p precedence lies in Layer 2 packet headers and is applicable to occasions

where the Layer 3 packet header does not need analysis but QoS must be

guaranteed at Layer 2.



Figure 4 802.1Q Ethernet frame format

As shown in the figure above, each host supporting the 802.1Q protocol adds a 4-

byte 802.1Q tag header after the source address of the former Ethernet frame header

when sending the packet. The 4-byte 802.1Q tag header contains a 2-byte Tag

Protocol Identifier (TPID) whose value is 8100 and a 2-byte Tag Control Information

(TCI). TPID is a new class field defined by IEEE to indicate that the current packet is

802.1Q-tagged. Figure 5 describes the detailed contents of an 802.1Q tag header.

Figure 5 802.1p precedence

In the figure above, the 3-bit priority field in the TCI field is 802.1p priority in the range

of 0 to 7. The three bits specify the precedence of the frame. Eight precedence

values are used to determine which packets are sent preferentially when congestion

occurs. The precedence is called 802.1p precedence because applications related to

the precedence are defined in detail in the 802.1p specifications.

To provide differentiated services for VLAN VPN or QinQ frames, you must classify

frames by VLANs or 802.1p precedence in their inner VLAN tags. The inner VLAN

and 802.1p precedence of a packet determines its queue scheduling priority and drop

precedence. The 802.1p precedence of the inner VLAN tag of a packet determines

the scheduling priority and drop precedence of a packet at the egress.

Figure 6 802.1p precedence mapping



III. EXP precedence

Figure 7 MPLS label

In an Ethernet MPLS packet, there is a shim between the Layer 2 header and Layer 3

data. You can use the reserved fields in the shim, a 3-bit EXP to determine the

scheduling priority and drop precedence of the packet. You can classify MPLS

packets by their EXP precedence and determine the scheduling priority and drop

precedence of MPLS packets at the egress. You can map the DSCP precedence of

IP packets to the EXP precedence and use the EXP precedence to determine the

scheduling priority and drop precedence of MPLS packets at the egress.

Figure 8 EXP precedence marking

3.5 Queue Scheduling

When the network is congested, the problem that many packets compete for

resources must be solved, usually through queue scheduling. The S9500 series

support two queue scheduling algorithms: strict priority (SP), and weighted round

robin (WRR).



I. SP queue scheduling algorithm

Packets sent via this interface

Classifydequeue

High priority

Low priority

Queue 7

Queue 0

Packets sent

Sending queue

Queue 1

Queue 6

Queue 5～2

Figure 9 Diagram for SP queueing

SP queue scheduling algorithm is dedicated to critical service applications. The key

feature of mission-critical applications is that they require preferential service to

reduce the response delay when congestion occurs. Assume that there are eight

output queues on a port and the SP queueing classifies the eight output queues on

the port into eight classes, which are queue 7, queue 6, queue 5, queue 4, queue 3,

queue 2, queue 1, and queue 0 in the descending order of priority.

SP schedules the packets in a strict priority order. It sends the packets in the queue

of the highest priority first, and sends packets in a queue of a lower priority only when

the queue of a higher priority is empty. You can put critical service packets into the

queues with higher priority and put non-critical service (such as e-mail) packets into

the queues with lower priority. In this case, critical service packets are sent

preferentially and non-critical service packets are sent when critical service groups

are not sent.

The SP mechanism has its disadvantage. When congestion occurs and if high-priority

queues are occupied for a long time, the packets in the lower-priority queues are

“starved” before obtaining services.



II. WRR queue scheduling algorithm

A switch port supports eight output queues. WRR queue-scheduling algorithm

schedules all the queues in turn and every queue can be assured of a certain service

time. Assume there are eight priority queues on a port. WRR configures a weight

value for each queue, which is w7, w6, w5, w4, w3, w2, w1, and w0. The weight value

indicates the proportion of obtaining bandwidth. On a 100 M port, configure the

weight value of WRR queue-scheduling algorithm as 50, 30, 10, 10, 50, 30, 10, and

10 (corresponding to w7, w6, w5, w4, w3, w2, w1, and w0 in order). In this way, the

queue with the lowest priority can get 5 Mbps bandwidth at least, thus avoiding the

disadvantage of SP queue-scheduling that the packets in queues with lower priority

may not get service for a long time.

Another advantage of WRR queuing is that: though the queues are scheduled in

order, the service time for each queue is not fixed; that is to say, if a queue is empty,

the next queue will be scheduled. In this way, the bandwidth resources are made full

use.

3.6 Congestion Avoidance

When the network is congested, common network devices adopt tail drop to avoid

congestion. That is, when the queue length reaches the upper threshold, all the newly

arriving packets are dropped. However, if plenty of TCP traffic is dropped, which will

cause TCP timeout, the slow start and congestion avoidance mechanisms of TCP will

be triggered, thus reducing TCP traffic. If a queue drops packets of multiple TCP

sessions at the same time, slow start and congestion avoidance mechanisms will be

triggered for these TCP sessions at the same time. This is called global TCP

synchronization. In this case, these TCP sessions reduce the size of traffic sent to the

queue at the same time, so that the traffic sent to the queue is less than the

bandwidth of the queue, thus reducing the utilization of the line. On the other hand,

the size of the traffic sent to the queue is not stable but fluctuates between the

maximum bandwidth and a very small traffic size.

The S9500 series adopt the Weighted Random Early Detection (WRED) mechanism

to avoid global TCP synchronization. You can set the upper threshold and lower

threshold for a queue. When the queue length is smaller than the lower threshold, no



packet is dropped; when the queue length is between the lower threshold and the

lower threshold, WRED begins to drop packets randomly, and the drop probability

increases as the queue length increases; when the queue length is bigger than the

upper threshold, all newly arriving packets are dropped.

WRED drops packets randomly, thus avoiding global TCP synchronization. When the

sending rate of a TCP session slows down after its packets are dropped, the other

TCP sessions remain in high packet sending rates. In this way, some TCP sessions

remain in high packet sending rates in any case, and the link bandwidth can be fully

utilized.

If the current queue length is compared with the upper threshold and lower threshold

to determine the drop policy, bursty traffic is not fairly treated and proper data

transmission is affected. To solve this problem, WRED compares the average queue

size with the lower threshold and upper threshold to determine the drop policy. The

average queue size reflects the queue size change trend but is not sensitive to bursty

queue size changes, and thus bursty traffic can be fairly treated.

On a S9500 switch, you can set the exponential factor for average queue length

calculation, upper threshold, lower threshold, and drop probability for packets with

different precedence values respectively to provide differentiated drop policies.

When congestion occurs, the S9500 switch drops packets as soon as possible to

release queue resources and try not to assign packets to high-delay queues in order

to eliminate congestion.

A S9500 switch can assign drop levels to packets according to their 802.1p

precedence, that is, color the packets, or assign drop levels through priority marking.

The drop level can be 0 , 1, or 2, which represent green, yellow, and red respectively.

When congestion occurs, red packets are the first to be dropped, while green packets

are the last to be dropped.

You can set congestion avoidance parameters and thresholds for each queue and

each drop level.

The S9500 series support two drop algorithms:

l Tail drop: when packets are dropped, the drop policy for packets in a color (red,

yellow, or green, assigned according to drop levels) is determined by the



threshold set for the color. When the size of packets in a color (red, yellow, or

green) exceeds the corresponding upper threshold, the system beings to drop

newly arriving packets in this color.

l WRED drop algorithm: the drop levels are taken into account when packets are

dropped by queue. When the size of packets in a color (red, green, or yellow)

exceeds the lower threshold set for the color, the system begins to drop the

packets in the color between the upper threshold and lower threshold according

to a certain slope. When the size of packets in a color exceeds the upper

threshold set for the color, the system begins to drop all packets in the color

exceeding the upper threshold.

3.7 Traffic Shaping

Traffic shaping controls the rate of output traffic, so that the traffic can be sent out at

an even rate. Normally, traffic shaping is applied on a device to adapt its output rate

to the input rate of its connected downstream device so as to avoid unnecessary

packet drop and congestion. It differs from traffic policing mainly in that traffic shaping

buffers packets exceeding the rate limit so that packets are sent out at an even rate,

while traffic policing drops packets exceeding the rate limit. However, traffic shaping

introduces additional delay while traffic policing does not. The S9500 series support

port-based traffic shaping, that is, traffic shaping can be implemented to all traffic on a

port. It also supports queue-based traffic shaping on a port.



3.8 Policy Routing

Figure 10 Policy routing application scenario

The S9500 series can classify packets first and then configure traffic redirecting for a

certain class of packets to implement policy routing. As shown in Figure 10 , the

S9500 switch first classifies packets based on source IP addresses and destination

IP addresses to identify packets whose source IP addresses are private address

while whose destination IP address are public addresses. Then you can use policy

routing to redirect such packets to the NAT device for address translation and then to

the Internet.



4 QoS Processing Procedure on the S9500 Series

Figure 11 QoS processing procedure on the S9500 series

The S9500 series use traffic classification to classify traffic based on source MAC

addresses, destination MAC addresses, Ethernet types, VLANs, 802.1p priority, IP

protocol, source IP addresses, destination IP addresses, application port numbers,

ICMP packet types, IP precedence, ToS, DSCP, EXP, and VLAN IDs and 802.p

priorities in the inner VLAN tags of QinQ frames.

After classifying traffic into different classes, besides simply permitting a class of

packets to pass through or dropping a class of packets, the S9500 series provide a

policy control list (PCL) to perform the following actions for the traffic flows: traffic

policing, traffic accounting, marking QoS parameters (including 802.1p priority, DSCP,

EXP, and drop precedence), traffic mirroring, traffic redirecting, and specifying the

output queue.

After packets are marked with different drop levels through priority mapping, the

congestion avoidance module determines the drop policies for packets based on the

user-defined drop mode and the upper threshold and lower threshold set for each

color. With tail drop adopted, when the size of packets in a color (red, yellow, or green)

exceeds the upper threshold set for the color, the system begins to drop newly

arriving packets in the color. With WRED drop mode adopted, when the size of

packets in a color (red, green, or yellow) exceeds the lower threshold set for the color,



the system begins to drop the packets between the upper threshold and lower

threshold according to a certain slope. When the size of packets in a color exceeds

the upper threshold set for the color, the system begins to drop all packets in the

color exceeding the upper threshold.

After congestion avoidance is completed, the packets permitted to be forwarded are

assigned to the corresponding queues. The queue scheduling module uses SP or

WRR queue scheduling algorithm to schedule packets. When forwarding packets, the

output port performs traffic shaping for outbound traffic based on the token bucket

size.





S9500 RPR Technology White Paper (V2.00)


H3C S9500 RPR Technology White Paper

(Version 2.00)

Keywords: RPR

Abstract: RPR is a MAC layer technology designed for carrying large-capacity data services in

metropolitan area networks (MANs). It is physical layer independent, capable of running

over SONET/SDH, DWDM, and Ethernet. It can provide efficient, flexible network

solutions for broadband IP-based MAN carriers.

Acronyms:


DWDM Dense Wavelength Division Multiplexing

LSP Label Switched Path

MPLS Multi Protocol Label Switching

RPR Resilient Packet Ring

SDH Synchronous Digital Hierarchy

SONET Synchronous Optical Network

STP Spanning Tree Protocol

VRRP Virtual Router Redundancy Protocol (VRRP)



Table of Contents

1 Overview................................................................................................................................... 3

2 RPR Features ........................................................................................................................... 3

2.1 Concepts ........................................................................................................................ 3

2.1.2 Span ..................................................................................................................... 3

2.1.3 Edge..................................................................................................................... 3

2.1.4 Wrapping .............................................................................................................. 3

2.1.5 Steering ................................................................................................................ 4

2.1.6 Host...................................................................................................................... 4

2.1.7 Ringlet Selection Table.......................................................................................... 4

2.2 Protocol Processing Mechanism...................................................................................... 4

2.2.1 Data Operations on RPR Stations ......................................................................... 4

2.2.2 Efficient Bandwidth Use ........................................................................................ 6

2.2.3 Automatic Topology Discovery............................................................................... 7

2.2.4 Topology Protection and Self-Recovery................................................................. 7

2.2.5 Fairness Algorithm ................................................................................................ 9

2.2.6 QoS Guarantee................................................................................................... 10

2.2.7 Others................................................................................................................. 12

2.3 RPR Data Frame Format............................................................................................... 13

3 RPR Applications .................................................................................................................... 14

3.1 Layer 3 Application........................................................................................................ 14

3.2 Layer 2 Application........................................................................................................ 16

4 RPR Features on the S9500.................................................................................................... 18

4.1 Powerful Service Switching Performance ...................................................................... 18

4.2 Complete QoS Capabilities ........................................................................................... 18

4.3 Abundant Ring Selection Mechanisms........................................................................... 18

4.4 Layer 2 Bridging + L2 Tunneling.................................................................................... 19

4.5 Compatibility with Ethernet Protection Mechanisms....................................................... 19

4.6 Complete Clock Schemes ............................................................................................. 19

4.7 Ease of Configuration.................................................................................................... 20

4.8 RPR Implementation on the S9500 ............................................................................... 20



1 Overview Resilient packet ring (RPR) is a MAC layer technology standardized by the IEEE

802.17 workgroup. It is independent of the physical layer and can run on

SONET/SDH, fast Ethernet, and DWDM.

The RPR technology integrates high reliability of SDH self-recovery and Ethernet

advantages such as economics, high bandwidth, flexibility, and scalability. It

provides bandwidth management with data optimization and high performance

multi-service transmission on a ring topology.

2 RPR Features

2.1 Concepts

Figure 1 RPR ring topology

2.1.2 Span

In a bidirectional ring, the section between two adjacent stations is called a span.

A span comprises two bidirectional links.

2.1.3 Edge

A span on which data frames are not allowed to pass is called an edge. An edge

can result from fiber cut, signal attenuation, manual switch, or any other error or

protection action.

2.1.4 Wrapping

Wrapping is a protection mode of RPR. In wrapping mode, after a span or station

fails, protected traffic is directed at the point of failure to the opposing ringlet. The

two ringlets thus form a closed single ring around the point of the failure. As the

wrapping allows quick switchover without ringlet selection update, data frame



loss is minimized, but at the price of bandwidth.

2.1.5 Steering

Steering is another protection mode of RPR. Unlike in wrapping mode, in

steering mode, the RPR stations on the ringlet update the ringlet selection upon

detection of an edge. Based on the update result, the protected traffic is steered

to the newly selected ringlet. The steering mode thus avoids the bandwidth

waste with wrapping mode, but as it requires topology reconvergence, it can

cause frame loss and service interruption.

2.1.6 Host

For the purpose of this document, the upper layer of the RPR MAC layer is

referred to as the host. The host receives, processes, and transmits the traffic

destined for the local station.

2.1.7 Ringlet Selection Table

Each RPR station maintains a ringlet selection table, which includes information

such as ringlets and hops to reach other stations on the RPR ring.

When the ring is closed, two paths are available for reaching a destination, of

which the shortest one is selected by default.

2.2 Protocol Processing Mechanism

RPR is made up of dual unidirectional counter-rotating ringlets identified as

Ringlet 0 and Ringlet 1 and the links on the rings are operating at the same rate.

The two rings of RPR can transmit data at the same time.

Each RPR station is identified by a 48-bit MAC address as in Ethernet. The MAC

address of an RPR station on the RPR ring must be unique. The two physical

optical interfaces of an RPR station are regarded as a logical interface from the

perspective of the network layer and the link layer.

2.2.1 Data Operations on RPR Stations

Stations on an RPR ring handle data frames by performing the following

operations:



l Insert: to place a frame received from outside of the RPR ring onto a ringlet.

l Transit: to pass a frame to the next station. As the frame is expressly

forwarded at the RPR MAC layer, the throughput of the RPR station is

improved. For a multicast/unicast, the RPR station also sends a copy to the

upper layer.

l Copy: to deliver an inbound frame from the ring to the upper layer. The

copying of a frame does not imply its removal from the ring.

l Strip, to remove a frame from a ringlet. A station strips a frame if the frame

is destined for or sourced from the local station, or if the time to live (TTL)

value of the frame expires.

The following figure shows how a unicast data frame is transmitted on an RPR

ringlet.

Figure 2 Unicast traffic forwarding

As shown in the figure, the source station inserts the unicast frame into the data

stream on Ringlet 0 or Ringlet 1, the transit stations transit the frame, the

destination station copies and strips the frame.

For a multicast or broadcast frame, the stations on the RPR ring copy and transit

the frame. When the frame travels back to the source, the source station strips

the frame from the ring.



Figure 3 Multicast/broadcast traffic forwarding

2.2.2 Efficient Bandwidth Use

RPR allows efficient bandwidth use on a ring network:

l Destination stripping: Different from traditional ring technologies such as

SDH/SONET where a unicast frame is removed from the ring only after it

travels back to the source station, RPR adopts destination stripping where

a unicast frame is removed from the ring as soon as it reaches the

destination station.

l Spatial reuse: On an RPR ring, frame transmission on any one link is

independent of frame transmission on other links. By supporting concurrent

per-ringlet transmissions, the bandwidth available to the stations on a

ringlet exceeds the individual link capacity. On nonoverlapping segments,

concurrent transfers of independent traffic are allowed. On overlapping

segments, bandwidth allocated for traffic transfers is assigned based on a

bandwidth fairness algorithm.

l Automatic bandwidth allocation: Different from the complex static

bandwidth allocation with SDH, RPR supports bursty traffic, allowing fast

service deployment.

l No redundant bandwidth: Unlike SDH, RPR can transmit frames on both

ringlets without having to reserve bandwidth for protection purpose. With

RPR, the two ringlets back up each other to achieve self-recovery.

l Support for broadcast/multicast: For a broadcast or multicast, only one

copy travels on the ring. This broadcast/multicast frame is copied and

transited on each RPR station and stripped off from the ring when it travels

back to the source station.



l L2 rapid forwarding: As a station processes only the frames destined for it,

forwarding speed is improved.

2.2.3 Automatic Topology Discovery

Each station on an RPR ring uses topology and protection (TP) frames to

broadcast its topology and protection status information. After receiving the

information, other stations update their local topology databases, resulting in a

consistent topology database to be maintained on the ring.

When detecting a protection state change, a station sends eight TP frames at

intervals of 1 to 20 milliseconds (the default is 10 milliseconds). In addition, the

station sends TP frames periodically at intervals of 50 milliseconds to 10

seconds (the default is 100 milliseconds). This mechanism enables all stations

on the ring to get aware of protection and topology change timely and reliably,

ensuring timely protection switchover in addition to topology synchronization.

The automatic topology discovery mechanism of RPR enables an RPR station to

be plugged and play, allowing the station to get the ring topology and be sensed

by other stations automatically.

2.2.4 Topology Protection and Self-Recovery

Steered path

Station A

Station B

Figure 4 Path steering upon detection of a fault (A -> B)



Figure 5 Path wrapping upon detection of a fault (A -> B)

RPR can provide protection in response to fault, allowing services to recover

within 50 milliseconds. RPR provides two protection modes: steering and

wrapping.

In steering mode, a station, upon detection of a fault on a ringlet, broadcasts the

protection state change with TP frames on the ring. This also triggers ringlet

selection. When other stations receive the TP frames, they transit to the

corresponding protection state, recalculate the reachability of the stations on the

ring, and update their ringlet selection tables to select the ringlet that retains

connectivity to the destination stations.

Unlike the steering mode, the wrapping mode does not involve ringlet selection

update on the entire ring. Instead, the stations at both sides of a point of failure

transit to the wrapping mode upon detection of the failure while other stations

transmit traffic along the old path. When the protected traffic arrives at the station

on the one side of the point of failure, it is directed to the opposing healthy ringlet

to reach the station on the other side of the point of failure. Then the protected

traffic travels the original ringlet to reach the destination.

Compared with the steering mode, the wrapping mode provides quicker

protection resulting in less frame loss but requires more bandwidth. To benefit

from both, the RPR implementation of the S9500 adopts the wrap-then-steer

mode. In this mode, RPR starts the wrapping mode once a link fails to ensure

continuity of the ongoing service and switches to the steering mode after the

topology converges to save bandwidth.



2.2.5 Fairness Algorithm

Resources on a ring network are shared among the stations. RPR provides a

global fairness algorithm on the entire ring network to guarantee the fairness of

the sharing and improve bandwidth use efficiency. The fairness algorithm of RPR

can regulate traffic dynamically to minimize the likelihood of congestion and to

handle bursty large sized traffic effectively, thus ensuring normal use of the

network.

To achieve bandwidth allocation fairness, each RPR station monitors the use of

its bandwidth and provides an explicit backpressure mechanism between

stations. With this mechanism, the station notifies a source station of the current

available capability, having the source station regulate traffic transmission. Thus,

bandwidth allocation fairness is achieved on the ring.

The fairness algorithm of RPR involves the following three aspects:

l Determining the congestion threshold on a station

l Determining the broadcast rate to the upstream station

l Determining the traffic insertion rate on a station

When congestion occurs on a station, the station sends a congestion

advertisement on the ringlet opposite to the data transmission direction to

advertise a fair rate. Receiving the advertisement, the upstream station then

decreases the frame insertion rate down to the advertised fair rate. If congestion

also occurs on the current station, it does the same as its downstream station did.

Bandwidth management regulates low-priority data frames, but not high-priority

data frames or control frames to guarantee high-priority services. The bandwidth

management ability of RPR allows for bandwidth allocation efficiency and

fairness, which are impossible with Ethernet or other ring network technologies

where bandwidth management is not available.

The following figure illustrates how the fairness algorithm works on an RPR ring

comprising stations A, B, C, and D. Suppose the bandwidth of each link is 10

Gbps and traffic travels the outer ringlet.



Figure 6 Bandwidth fairness algorithm

The following is what occurs on the RPR ring:

(1) Both stations C and B send 4000 Mbps traffic to station D. They share

bandwidth on span C–D and represent 8-Gbps bandwidth in total. As the

link bandwidth is 10 Gbps, no congestion is present.

(2) Station A also sends 4000 Mbps traffic to station D. As a result, the total

traffic on span C–D reaches 12 Gbps, exceeding the maximum link

bandwidth (10 Gbps). Congestion thus occurs on span C–D.

(3) With the fairness algorithm, station C performs calculation immediately

after detecting the congestion to decrease the rate putting traffic onto the

ring to 2000 Mbps. At the same time, it sends control frames to station B

reversely along the inner ringlet to transmit congestion and fairness

algorithm information.

(4) Upon receiving the fairness control frames, station B immediately

decreases its traffic rate and sends fairness control frames to station A.

According to the fairness algorithm, both station C and station B adjust

traffic rate to 3000 Mbps.

(5) After receiving the control frames, station A does the same thing.

As a result of this process, stations A, B and C adjust their traffic rates to 3300

Mbps, sharing bandwidth fairly.

In this example, absolute fairness is maintained. RPR, however, allows exclusive

bandwidth allocation and weighted bandwidth allocation. Thus, traffic rate can be

different at each station depending on its fairness weight.

2.2.6 QoS Guarantee

The capabilities of 50 milliseconds self-recovery, efficient bandwidth use, and



advanced RPR-Fa algorithm enables RPR to provide good QoS guarantee for

services, achieving high reliability, large throughput, low delay, and low loss rate .

RPR services fall into three classes: class A, class B and class C, with

decreasing priorities.

l Class A: Provides low-jitter bandwidth guarantee to support TDM services.

It is subdivided into subclasses A0 and A1. For the A0 service, bandwidth is

reserved on the entire ring and the unused reserved bandwidth cannot be

used for lower priority services. For the subclass A1 and class B services,

bandwidth is reclaimable and the unused bandwidth can be used for lower

priority services.

l Class B: Provides low-delay bandwidth guarantee to transmit data in the

order of priority. Class B can be divided into two subclasses: committed

information rate (CIR) and excess information rate (EIR), that is, B-CIR and

B-EIR.

l Class C: Provides the best-effort service for traditional IP traffic.

RPR uses the Sc field to indicate the priority of an RPR frame.

For traffic to be forwarded, the RPR MAC layer adopts either a single transit

queue or dual transit queues. On a single-queue station, all services are put in a

first in first out (FIFO) queue regardless of their priorities. On a dual-queue

station, services are put either in the high-priority queue or in the low-priority

queue as follows:

l Class A service: put in the high-priority queue.

l Class B service: put in the low-priority queue. The subclass B-CIR service

has higher priority than the class C service and is not regulated by the

fairness algorithm. The subclass B-EIR service has the same priority as the

class C service and is regulated by the fairness algorithm.

l Class C service: put in the low-priority queue.

A committed information rate is allocated for the class B service. Traffic

conforming to the CIR has higher priority than that of the nonconforming traffic

(B-EIR traffic). The RPR MAC layer controls the traffic transmission order, which

differs with the queue model.

l On a dual-queue station

The RPR MAC layer assigns traffic sent by the host to the host queue and traffic



to be forward for other stations to the transit queues. The RPR MAC layer

dequeues frames from the transit queues in the following order:

(1) Frames in the high-priority transit queue.

(2) Class A frames so long as the low-priority transit queue is not full. In case

the length of the low-priority transit queue crosses a specified threshold,

the frames in the queue are sent.

(3) B-CIR frames.

(4) B-EIR and Class C frames, if the fairness algorithm is obeyed.

(5) Frames in the low-priority transit queue, if no higher priority frames are

waiting for transmission.

l On a single-queue station

Frames in the transit queue are transmitted first, regardless of the priority. The

class C and subclass B-EIR services are regulated by the fairness algorithm.

RPR supports bandwidth reservation, providing perfect QoS guarantee for

reserved bandwidth. Thus, transmission of the traditional voice service can be

implemented.

Because the fairness algorithm of RPR does not regulate the high-priority

services, all high-priority traffic will always be sent prior to low-priority traffic. To

prevent excessive high-priority traffic from affecting low-priority services, you are

recommended to set a threshold of high-priority traffic.

RPR provides multiple static traffic shaping methods such as rate limiting (using

a rate limiter) for high-priority and low-priority data frames. For low-priority data

frames, RPR also provides dynamic traffic shaping.

The QoS guarantee measures of RPR ensure that an excellent QoS guarantee

can be provided on an RPR ring even if the host does not provide QoS

guarantee.

2.2.7 Others

In addition to the features described in the preceding sections, RPR also delivers

features described in this section.

as a physical layer independent MAC layer protocol, RPR provides physical layer

interfaces for physical layer such as Ethernet, DWDM and SONET/SDH.



RPR allows for great bandwidth scalability. For example, your can scale RPR

ring bandwidth from 155 Mbps to 10 Gbps, and even to 40 Gbps.

A very important feature of RPR is that it avoids the N2 issue successfully to

achieve full connectivity at the MAC layer for N stations by using only N links.

Compared with SDH, POS and Ethernet, RPR has lower link cost.

RPR is an optimized Ethernet technology. It supports all Ethernet protocols and

services.

RPR supports equipment interoperability at the ring level. For example, you can

connect ATM devices, routers, and TDM devices to the same RPR ring. These

networks share the physical links and total bandwidth of the ring while being

transparent to each other.

RPR provides complete MIB features. This allows it to provide an extraordinary

operations and maintenance platform, achieving operability and manageability.

2.3 RPR Data Frame Format

Figure 7 RPR frame structure

There are two types of RPR frames: basic data frames and extended data

frames. Suppose station C sends data to station D on the RPR ring shown in the



above figure.

When stations A, B, C and D are located in the same VLAN, the data frames are

extended frames and are forwarded at Layer 2.

When stations C and D belong to a VLAN different from the one to which stations

A and B belong, the data frames are basic frames and are forwarded at Layer 3.

3 RPR Applications RPR stations can insert both Layer 3 traffic and Layer 2 traffic onto an RPR ring.

For Layer 3 RPR application, Layer 3 interfaces (VLAN interfaces on the S9500)

must be configured; for Layer 2 RPR application, RPR logical interfaces must

trunk the VLAN(s) to which Layer 2 data streams belong.

3.1 Layer 3 Application

Assign the RPR logical interfaces on a ring to the same VLAN, for which a VLAN

interface must be created on each RPR station. Assign the VLAN interfaces IP

addresses in the same network segment. Thus, from the perspective of the

service layer on a station, any other station on the ring is the direct next hop.

On an RPR station, service traffic arriving at other service boards is passed

through the VLAN interface to the RPR board where the service traffic is inserted

onto the RPR ring.

RPR supports various routing protocols. Protection switch due to fiber cut for

example does not result in route reconvergence or MPLS LSP re-establishment.

This is because two paths are available to a destination. When one fails, traffic

can travel the other to reach the destination. The protection switch may happen

as fast as within 50 milliseconds, far less than the hello interval of service layer

protocol neighbors. Moreover, the DOWN event of a physical port will not be

notified to the service protocol layer, unless both physical ports come down.

Therefore, using an RPR ring to connect distribution and access services can

provide the reliability and stability that are impossible with STP or other ring

network technologies where path switchover can result in update at the service

layer involving routing, MPLS LSP, ARP, MAC, and so on.

RPR rings support the Virtual Router Redundancy Protocol (VRRP). You can



achieve node protection by assigning two stations on an RPR ring to the same

VRRP group. Thus, when the master station fails or fiber cut occurs on the spans

of the station, the backup station takes over and advertises attribute discovery

(ATD) frames to direct the traffic destined for the VRRP group to it.

As shown in Figure 8 , the two common stations, also the egresses of the RPR

ring at the distribution layer, form a VRRP standby group, which is assigned a

virtual IP address. When the master fails, the backup station takes over to

forward traffic destined for this virtual IP address. The RPR module and the

VRRP module works together as follows to ensure that switchover occurs in the

VRRP group within 50 milliseconds:

Immediately after detecting that the master has disappeared from the RPR ring

through the fast RPR detection mechanism, the backup station issues the virtual

MAC of the VRRP group locally and advertises throughout the ring that it is now

the master of the VRRP group to direct traffic to it. The process occurs while the

VRRP module is handling the switchover situation.

Figure 8 Network diagram for a Layer 3 application with RPR



3.2 Layer 2 Application

When an RPR station receives a frame belonging to a VLAN carried on it, it

encapsulates the frame in extended RPR frame format and forwards the RPR

frame based on the MAC address onto the ring. Similar to Ethernet ports, RPR

ports can trunk multiple VLANs to provide multi-VLAN access, supporting both

Layer 2 and Layer 3 services.

RPR supports ring interconnect, including ring intersection, ring tangent, and ring

link. In ring intersection mode, two RPR rings are interconnected at a single

station installed with two RPR boards. In ring link mode, RPR stations are

connected with other boards, GE boards for example, to form a redundant

connection to ensure that a redundant path is available when fault occurs on

both RPR ringlets. In ring link mode and ring intersection mode, the likelihood of

layer 2 loop exists. You need to use some mechanism, STP on RPR ports for

example, to remove loops.

Figure 9 Network diagram for an RPR L2 application

You may configure VLAN-based tunnels on an RPR ring for transparent



transmission purpose. As shown in the above figure, you can tunnel VLAN 60

traffic from device A to device B by configuring a tunnel on their connecting RPR

stations. The tunnel created on either station must take the other station as the

destination. In addition, if the VLAN for which the tunnel is created contains only

one access port in addition to the RPR logical port, disable MAC address

learning in the VLAN.

As shown in the above figure, you may configure a tunnel on the RPR station

connected to device F to tunnel the traffic from device F to the egress on the 2.5

Gbps ring. As two egresses in redundancy are available, you can make

configuration as such that the tunnel can switch over to the backup egress

station when the primary one fails.

When tunnels are used on the RPR ring, the use of default ringlet selection can

cause a ringlet to be saturated. To address the problem, you can configure

VLAN-based ringlet selection to distribute traffic.

Tunneling is suitable for point-to-point service interconnection. For multi-point

service interconnection, use Layer 2 bridging. As shown in the above figure,

devices C, D and E are in VLAN 50. To bridge traffic between them through the

RPR ring, you must configure VLAN 50 on the corresponding RPR stations and

enable MAC address learning in the VLAN.

On the 10-Gbps RPR and 2.5-Gbps RPR rings, you can adopt VPLS plus RPR

to achieve multi-point interconnection.

Figure 10 VPLS tunneling on an RPR ring



As shown in the above figure, VPLS tunnels are established on the RPR ring to

implement L2 interconnection between A, B, C and D through MAC address

learning.

4 RPR Features on the S9500

4.1 Powerful Service Switching Performance

According to the RPR protocol, fault detection should be done within 10

milliseconds and service switching within 50 milliseconds. The RPR

implementation on the S9500 however can provide service switching as fast as

within 20 milliseconds, fully satisfying the carrier-class requirement and reflecting

the strength of RPR.

4.2 Complete QoS Capabilities

The RPR implementation on the S9500 provides complete QoS capabilities for

RPR traffic. It supports the access control list (ACL), rate limiting, traffic shaping,

queuing, and almost all QoS features available on Ethernet. It supports

mappings from COS, EXP and DSCP priorities to RPR priorities. Depending on

customer needs, it can support services of Class A, B and C, providing

bandwidth guarantee and other service differentiation capabilities. In addition,

the RPR implementation on the S9500 uses a fairness algorithm to ensure fair

access of stations to ring bandwidth, allowing for bandwidth use efficiency,

congestion avoidance and congestion pre-warning.

4.3 Abundant Ring Selection Mechanisms

The RPR implementation on the S9500 supports multiple ring selection modes.

By default, dynamic ringlet selection (shortest path selection) is adopted on an

RPR ring. Dynamic ringlet selection results in a ringlet selection table that

contains the shortest paths to other RPR stations on the ring upon topology

convergence. This ringlet selection table does not change when the topology is

stable.

In addition, you can configure static ringlet selection entries, which have higher

priority over dynamic entries. For Layer 2 tunneling, VLAN-based ringlet



selection is supported.

4.4 Layer 2 Bridging + L2 Tunneling

According to the RPR protocol, frames destined for or sourced from a MAC

address not on the RPR ring are flooded. All the stations on the ring therefore

can receive the traffic, causing bandwidth waste, affecting the fairness algorithm,

and tending to compromise data security.

The RPR implementation on the S9500 supports Layer 2 tunneling, allowing

traffic to be tunneled per VLAN. Ethernet frames in a VLAN can be transparently

transmitted between stations, and the stations do not need to learn MAC

addresses in the VLAN. Layer 2 tunneling is thus suitable for point-to-point

interconnection on the ring.

The Layer 2 bridging function is primarily provided on GE RPR rings while VLAN-

based tunneling is primarily adopted on 10-Gbps or 2.5-Gbps RPR rings. On a

10-Gbps or 2.5-Gbps RPR ring, you can achieve Layer 2 bridging by combining

VPLS with RPR. On a GE RPR ring, a maximum of 128-K MAC addresses are

supported.

4.5 Compatibility with Ethernet Protection Mechanisms

The RPR interfaces on the S9500 supports the Spanning Tree Protocol (STP).

They can participate in STP calculation like common Ethernet ports to eliminate

loops. With STP, RPR can remove loops at Layer 2 (if any) on intersecting rings

and ring links, delivering high reliability.

RPR can work with VRRP to achieve backup between stations on an RPR ring,

thus improving reliability. On the S9500, a special procedure is designed to

ensure that the switchover between the master and the backup in a VRRP group

on an RPR ring is within 50 milliseconds. Such node protection can well offset

the impact of extremities such as power failure of RPR boards.

4.6 Complete Clock Schemes

On a 10-Gbps RPR ring, as 10 GE LAN mode is supported, the RPR stations do

not need to synchronize the clocks.



In 10-Gbps POS mode, RPR resolves the clock synchronization loss between

stations by having hardware automatically insert idle frames.

In addition, the 9500 provides a special RPR clock switch mechanism. With this

mechanism, a master station uses a clock card to keep synchronization with the

reference clock source. The clocks on all other stations on the RPR ring keep

synchronization with the clock of the master station and automatically switch to

adapt to topology change. Thus, on an RPR ring, the clocks on the side towards

the master station are locked to a common source.

4.7 Ease of Configuration

On an RPR station on the S9500, traffic transmission is done through a pair of

RPR physical ports. To simplify configuration, an RPR logical interface is used in

equivalence to the pair of RPR physical ports. This logical interface can be

configured like a common Ethernet port. You can have the service run simply by

configuring some VLAN settings on the interface. The two RPR physical ports

are transparent to the service layer. All service-related configurations are done

on the RPR logical interface rather than on the two physical ports respectively.

4.8 RPR Implementation on the S9500

Figure 11 RPR implementation on the S9500

On the S9500, RPR is implemented through an RPR board, which can be GE,

2.5 Gbps, or 10 Gbps. The RPR board exchanges traffic with other boards



through a cross bar. A 10-Gbps RPR board supports 10-Gbps traffic insert/copy

at wire speed and copy of bursty traffic exceeding 10 Gbps. The services

between RPR rings are switched between RPR boards through the cross bar.





S9500 Active SRPU and Standby SRPU Switchover Technology White Paper


H3C S9500 Active SRPU and Standby SRPU

Switchover Technology White Paper

Keywords: Active and standby SRPU switchover, HA, High Availability

Abstract: HA, High Availability, is an indispensable feature of carrier-class devices. Active and

standby SRPU switchover is one of the most important implementations of the HA feature.

This manual introduces the active and standby SRPU switchover implementation on the

S9500 series switches.

Acronyms:


HA High Availability



Table of Contents

1 Overview .................................................................................................................................. 3

2 Active and Standby SRPU Switchover Mechanism................................................................. 3

2.1 Active and Standby SRPU Switchover Mechanism Overview ...................................... 3

2.1.1 Introduction to the Switchover Process .............................................................. 3

2.1.2 Introduction to the State Machine ....................................................................... 4

2.2 Active and Standby Switchover Triggering Mechanism ................................................ 7

2.2.1 Active and Standby Status Determination........................................................... 7

2.2.2 Active and Standby Switchover Triggering Mechanism...................................... 8

2.3 Registration Mechanism................................................................................................ 9

3 Active and Standby Performance ............................................................................................ 9

3.1 Configuration Layer Active and Standby Performance ................................................. 9

3.2 Protocol Layer Active and Standby Performance ......................................................... 9

3.2.1 Introduction to Graceful Restart ........................................................................ 10

3.2.2 Layer 2 Unicast Forwarding .............................................................................. 11

3.2.3 Layer 2 Multicast Forwarding............................................................................ 11

3.2.4 Layer 3 Unicast Forwarding .............................................................................. 11

3.2.5 Layer 3 Multicast Forwarding............................................................................ 12

3.2.6 MPLS/VPN ........................................................................................................ 12



1 Overview High Availability (HA), is an indispensable feature of carrier-class devices. HA feature

can be used to achieve a higher degree of system availability. When finding system

faults, HA can quickly and correctly recover the normal running of the system, thus

shortening the Mean Time to Repair (MTTR) of the system.

The S9500 series switches support the HA feature, for example, the active and

standby SRPU switchover is one of the implementations of HA. Devices supporting

HA are generally equipped with two SRPUs, with one being the active SRPU and the

other being the standby SRPU. The active SRPU communicates with the external

network to implement normal functions of each module in the system; while the

standby SRPU works as a backup of the active SRPU, and does not communicate

with the external network. When the active SRPU works abnormally, the system

performs switchover automatically, and the standby SRPU takes the responsibility of

the active SRPU to ensure normal services.

Refer to H3C S9500 HA Technology White Paper for the introduction to the HA

feature of the S9500 series switches.

2 Active and Standby SRPU Switchover Mechanism

2.1 Active and Standby SRPU Switchover Mechanism Overview

2.1.1 Introduction to the Switchover Process

The switchover process of the active and standby SRPUs involves three phases:

back up data in batches, back up data in real time and synchronize data.



l After the standby SRPU is started, the active SRPU will synchronize the backup

data of all modules to the standby SRPU. This process is called back up data in

batches.

l After backup in batches is finished, the system begins the real-time backup

process, during which the backup data is backed up to the standby SRPU when

the backup data of the active SRPU changes.

l After the switchover between the active and standby SRPUs, the standby

SRPU becomes the new active SRPU, and tells each module to collect and

synchronize data from the service board. This process is called data

synchronization. During the data synchronization process, each module

communicates with the service board to confirm and synchronize hardware

status, link layer status and configuration data to ensure the consistency of the

data and status maintained by the system, thus ensuring that the system can

work normally after switchover of the active and standby SRPUs. Only after the

data synchronization is completed can the standby SRPU become the active

SRPU completely.

2.1.2 Introduction to the State Machine

The active SRPU changes its state in turn as the following: Wait for standby SRPU

insertion, Wait for backup request, Back up data in batches, Back up data in real time

and Synchronize data.

The standby SRPU changes its state from Be ready, Receive data in batches, to

Receive data in real time in turn.

The Synchronize data state stays between the active state and standby state. In this

state, the standby SRPU has become the active SRPU in its hardware state, but it

needs to collect and synchronize data from the service board, and therefore, the

standby SRPU has not become the active SRPU completely. The system prompts the

following:

System is busy with warm backup, please wait ...

The standby SRPU can completely become the active SRPU only after data

synchronization is completed.

The following figure shows the state change of the active and standby SRPUs:



Active SRPU

Waiting for standby insertion

Active SRPU initialization

To be Active SRPUSend in position message to active SRPU

Standby SRPU

Receive in position message

Wait for backup in batches request

Receive backup in batches request

User sends backup dataActive

SRPU module

Back up data in batches

Backup completed

Back up data in real time

User sends realtime backup data

Ready?

Standby SRPU initialization

Send batch backup request

Receiving realtime backup data

Receive the backup in batches completed message

Receiving realtime backup data

Standby SRPU became active SRPU

Synchronize data

Send batch backup data

Send realtime backup data

Synchronize with line card

Standby SRPU Module

Figure 1 State machine of active and standby SRPU switchover

After being started, the active SRPU is in the Wait for standby SRPU insertion state.

In case of a single SRPU, the state machine of the active and standby SRPUs is

always in this state; while the standby SRPU is in the Be ready state after being

started. After sending the In position message to the active SRPU, the standby SRPU

will wait for the state change timer to time out. See Figure 2 for this process.

Upon receiving the In position message from the standby SRPU, the active SRPU

changes its state to Wait for backup request.

After the state change timer times out, the standby SRPU sends a Back up data in

batches request to the active SRPU and changes its state to Receive data in batches

state; upon receiving the Back up data in batches request from the standby SRPU,

the active SRPU changes its state to the Back up data in batches state, and begins to

collect data of each module, and synchronizes the data to the standby SRPU.

After the backup on each module on the active SRPU is completed, the active SRPU

will send a Backup completed message to the standby SRPU, and change its state to



Back up data in real time; upon receiving the Backup completed message, the

standby SRPU changes its state to Receive data in real time. When entering the real-

time backup state, the active and standby SRPUs enter a relatively stable state, as

shown in Figure 2 .

Standby insertion

Insertion ACK

Standby request

Request ACK

Back up request

Back up data in batches

Batch data backup

Backup in batches completed

Standby SRPU sent Backup in batches completed message

Backup of all modules completed

Active

Backup completedBatch data backup

Backup completedBatch data backup

Backup completed

Process repeats on every Module

Module HA HA Module

Standby

Figure 2 Active and standby switchover

When detecting that the SRPU is not in the position or is resetting, the standby SRPU

becomes the active SRPU. However, at this time, the software on the standby SRPU

has not met the requirements for becoming the active SRPU. Therefore, the standby

SRPU first enters the Synchronize data state and during this synchronization process,

the new active SRPU collects and synchronizes data from the service board. After the

data synchronization is completed, the new active SRPU becomes the active SRPU

completely, and the state machine enters the Wait for standby insertion state; while

the original active SRPU reboots and becomes the standby SRPU, and then enters

the Be ready state.



Caution:

The standby SRPU can only change from the Receive data in real time state to the Synchronize data state. Before entering the Receive data in real time state, the standby SRPU will reset itself upon detecting that the active SRPU is not in the position as it has not collected all the information of the system.

2.2 Active and Standby Switchover Triggering Mechanism

2.2.1 Active and Standby Status Determination

In case of double SRPUs, whether an SRPU is the active SRPU or the standby

SRPU is determined by the hardware during device startup. Generally, the device

selects the SRPU with a smaller slot number as the active SRPU (in case of double

SRPUs, the hardware will set a delay time on the SRPU with a bigger slot number to

make it startup later).

Upon initial startup, the two SRPUs are in the standby state, and perform software

startup respectively. The SRPU with a smaller slot number sets its state to normal in

a period after startup, and detects the state of the other SRPU; while the SRPU with a

bigger slot number checks whether the other SRPU is normal after 15 seconds delay

and sets its state to normal. When the state of the SRPU with a smaller slot number

becomes normal, the state of the SRPU with a bigger slot number is still abnormal,

and therefore the state of the SRPU with a smaller slot number is in the active state;

while the SRPU with a bigger slot number will set its state to standby after detecting

that the other SRPU is normal after 15 seconds delay. Therefore, in case of double

SRPUs, even if the SRPU with a bigger slot number is the active SRPU before

system reboot, the SRPU with a smaller slot number is the active SRPU after system

reboot.

Caution:

In case to double SRPUs, you need to ensure the consistency of the software and hardware versions of the active and standby SRPUs.



2.2.2 Active and Standby Switchover Triggering Mechanism

After entering the Receive data in real time state, if detecting the switchover

notification, the standby SRPU becomes the active SRPU. The detection notification

is triggered due to hardware interrupt, and the hardware switchover time of the active

SRPU and standby SRPU is in milliseconds. After hardware switchover, the new

active SRPU enters the Synchronize data state.

Active and standby switchover is triggered due to following reasons:

l The command line performs the active and standby switchover command.

l The active SRPU works abnormally.

l The active SRPU resets or is plugged out.

l Abnormal software reboot happens on the active SRPU. For example, the

hardware watchdog reboots because a module occupies CPU for a too long

time; or reboot caused by data access abnormality and command access

abnormality.

Meanwhile, after entering the Back up data in real time state, the active and standby

SRPUs will periodically send handshake packets.

Figure 3 Active and standby handshake process

The timers send handshake packets to the peer at an interval of one second.



If the standby SRPU does not receive handshake packets from the active SRPU in

120 seconds, it considers that the connection with the active SRPU fails and resets

itself.

2.3 Registration Mechanism

The software modules take corresponding actions during the state machine change.

They notify the system of their actions through the registration mechanism. When the

state changes, the system takes corresponding actions on the modules in turn based

on their priorities.

The switchover speed of the active and standby state machine depends on the

processing time of each module.

For example, if the configuration file is too large, the configuration module will take a

relatively long time to backup information in batches, resulting in the increase of the

time for backup information in batches.

The data synchronization duration is longer when the device is fully configured than

when it is not as the device management module will collect the board information

and port link status information.

3 Active and Standby Performance

3.1 Configuration Layer Active and Standby Performance

With the batch backup and real-time backup process, the configuration information on

the active SRPU has been backed up to the standby SRPU in time. Therefore, for the

configuration layer, data can be smoothly synchronized during active and standby

SRPU switchover.

3.2 Protocol Layer Active and Standby Performance

During the data synchronization process, the forwarding tables on the service board

are not re-learned after being deleted to ensure non-stopping forwarding of services.

During the process for the SRPU to collect and synchronize data, the original data on



the SRPU keeps unchanged, and only the changed data is updated.

3.2.1 Introduction to Graceful Restart

The control software and forwarding software of an S9500 switch are separated from

each other. The control plan controls and manages the whole device, discovering

routes and delivering routes to the interface boards. The forwarding plane is

dedicated to data forwarding. With respective processor, these two planes are

functionally independent. The software on the active SRPU is a control software,

which processes the configuration information of users and runs different protocols,

for example, runs the routing protocols such as OSPF, ISIS and BGP to discover

routes and apply these routes to each interface board. The software on each

interface board is a forwarding software, which maintains its forwarding table

according to the notifications of the active SRPU and forwards data based on the

forwarding table.

With the above distributed structure adopted, when the control software restarts

(because of hardware or software faults) or is reloaded (software upgrade), the

forwarding services are not interrupted. Control software restart or reloading does not

affect the normal running of the forwarding software. Therefore, as long as the

network topology keeps stable during control software restart or reload, data

forwarding of the router that is rebooting is feasible and reliable.

The current problem is that each time the control software restarts or it is reloaded, all

the routing protocols have to restart, the neighbor relationships between the device

and the adjacent devices have to be rebuilt, and all the routing information databases

have to be re-synchronized. Neighbor relationship interruption triggers route

recalculation on the adjacent devices, causing route oscillation and forwarding

interruption on the network. To solve this problem, IETF proposes a series of

enhanced protocols for different routing protocols, such as IS-IS, OSPF, BGP, and

LDP respectively. With these enhanced protocols, the original protocol flows are

improved. When the control plane restarts on the device, the device will notify its

neighbors to temporarily preserve the routing information and adjacency relationship

with the device. After the protocol restarts, the neighbors will help the restarting

device to update routing information and to restore it to the state prior to the restart in

minimal time. No route flapping occurs during the restart, the packet forwarding path



remains the same, and the whole system can forward data continuously. Hence, it is

called “Graceful Restart”, and sometimes called “non-stop forwarding (NSF)”.

3.2.2 Layer 2 Unicast Forwarding

In Layer 2 unicast forwarding, the MAC address table concerned exists on the service

board. When active and standby SRPU switchover is started, the new active SRPU

collects and synchronizes MAC address information from the service board, while the

original MAC address table on the service board is not deleted during this process,

and thus normal Layer 2 unicast data forwarding can be ensured on the service board.

Caution:

For cross-board or cross-chip Layer 2 forwarding, temporary packet loss may occur during the data synchronization process as the routing table entries of the standby SRPU have not been correctly updated.

3.2.3 Layer 2 Multicast Forwarding

For Layer 2 multicast forwarding, the multicast MAC address entries needed for

forwarding are saved on the service board. When active and standby SRPU

switchover is performed, as the original multicast MAC address entries on the service

board keep unchanged when the SRPUs collect data from the interface board, the

Layer 2 multicast streams can be successfully forwarded.

Caution:

During the data synchronization process, packet loss may occur because the multicast protocol will update the entries.

3.2.4 Layer 3 Unicast Forwarding

For Layer 3 unicast forwarding, data forwarding is implemented by ARP and FIB.

During the data synchronization process of the active and standby SRPUs, the



original ARP and FIB keep unchanged; meanwhile, because the routing protocols

such as OSPF, BGP and ISIS support the GR function and the adjacency neighbors

keep their neighboring relationship and routing table unchanged, the forwarding path

of the IP packets in a network keep unchanged, thus ensuring the ongoing of Layer 3

unicast forwarding services.

3.2.5 Layer 3 Multicast Forwarding

For Layer 3 multicast forwarding, the original multicast service can be successfully

forwarded during the data synchronization process as the multicast entries that have

been established on the interface board keep unchanged.

Caution:

During the data synchronization process, packet loss may occur because the multicast protocol will update the entries.

3.2.6 MPLS/VPN

For the MPLS/VPN services, the original MPLS/VPN service streams can be

successfully forwarded as the table entries of the original MPLS/VPN services on the

interface board keep unchanged during data synchronization.





L2 MPLS VPN (VPLS) Technology White Paper - H3Cmpls v… · S9500 L2 MPLS VPN (VPLS) Technology...

Documents

Transcript of L2 MPLS VPN (VPLS) Technology White Paper - H3Cmpls v… · S9500 L2 MPLS VPN (VPLS) Technology...