Bridges, Switches, Routers - University of California, Berkeley

Chapter 1

Bridges, Switches, Routers

1.1 Introduction

� Packet vs circuit (and virtual circuit) circuit switching

� Network—mesh interconnection of links and switches

– LANs (multiaccess, broadcast or shared medium Ethernet: 10BT—1000BT, Cat 3 UTP)

– WANs switches connected by point to point links

� Packet processors—Bridges, Routers, ATM switches

1

2 CHAPTER 1. BRIDGES, SWITCHES, ROUTERS

Routing

Congestion control Reservation

SwitchingPolicing Scheduling

Control

path

Data pathper packet

processing

Figure 1.1: Packet processor functions may involve the data path or control path.

1.2 Packet processor functions

Routing — creating and distributing information that defines path between source and destinationand determining the best path

Switching — per-packet forwarding decisions, and sending packet towards destination

Other functions — congestion control, reservations, policing, scheduling

Control functionsperformed infrequently; datapath functionsare performed per packet.

1.3. TRANSPARENT BRIDGING IEEE 802.1D 3

L1

L2L3

L4 L5

B1 B2B3

B4 B4

L1

L2

L3

L4 L5

B1 B2B3

B4

10

10

10

10

20

2020

20

10

R

R RD

D

D D

D = dedicated port for LANR = root port for bridge

D

1. Determine root bridge, and set its ports

in forwarding mode.

2. Each bridge deterimines root port, and

sets it in forwarding mode.

3. Bridges determine designated port for

each LAN segment.

4. All other ports are in blocked state.

STP

Figure 1.2: Bridged extended LAN and corresponding graph. Bridge forwards frames along span-ning tree, according to FDB.

1.3 Transparent bridging IEEE 802.1D

Ethernet LANs broadcast each packet to every device on the LAN. The throughput per host de-creases with number of hosts connected to the LAN. See Problem1.

Transparent bridgingprevents this by interconnecting LAN segments (collision domains) and for-wards unicast packets according to filtering database (FDB). Broadcast, multicast, and unknownunicast are flooded to all LANs. So all segments form a single broadcast domain.

A bridge has two or more ports. Packets from incoming ports are forwarded to outgoing ports alonga spanning treeto prevent loops, according to FDB. See Figure1.2.

� spanning tree algorithm: one root, then shortest path to root;

� learning process: produces FDB by relating MAC source address to incoming port and re-moving unrefreshed entries.

Bridges exchange configuration messages to establish topology and topology-change messages toindicate that STA should be rerun.

With a fixed number of bridge ports, througput per LAN segment decreases with the number ofsegments in an extended LAN. See Problem2.


Figure 1.3: LAN vs VLAN topology

Figure 1.4: VLAN tags

1.4 LAN switches IEEE 802.1Q

A LAN switch is a bridge with as many ports as number of LAN segments, and with enough capacityto handle traffic on all segments. Problem2 is solved through VLAN.

Virtual LANs or VLANs is a collection of LAN segments and attached devices with the propertiesof an independent LAN. Each VLAN is a separate broadcast domain: traffic on one VLAN isrestricted from going to another VLAN. Traffic between VLANs goes through a router.

VLAN tags or VID (4-byte) are added to MAC frames so switches can forward packets to ports withsame VID. FDB is augmented to include for each VID the ports (the member set) through whichmembers of that VLAN can be reached.

1.4. LAN SWITCHES IEEE 802.1Q 5

The member set is derived from VLAN registration information: (i) explicitly by managementaction or by (ii) GARP VLAN registration protocol (GVRP). GARP is generic attribute registrationprotocol.

Multicast filteringA VLAN is a single broadcast domain. If multicast messages are broadcast, thethrougput is limited by the slowest link: A switch with 124 10-Mbps ports has a capacity of 1.24Gbps but can transmit at most 6 1.5Mbps multicast video channels. GARP Multicast RegistrationProtocol (GMRP) (IEEE 802.1P) allows switches to limit multicast traffic along the ST. (See IGMP.)

� JOIN host sends this message to express interest in joining a multicast group. Switch addsport to multicast group and forwards multicast source to these ports. JOIN messages are sentonce every JOINTIME timeout.

� LEAVE message sent by host. Switch removes this port from multicast group unless anotherhost on that port sends JOIN message before LEAVETIME timeout.

� LEAVEALL message peroidically sent by switch.

When a host sends IP data to a multicast (Class D IP) address, the host inserts the low order 23 bitsin the low order 23 bits of the MAC address. So a NIC that is not part of the group ignores thesedata.


Figure 1.5: The IP header provides precedence and type of service fields

Quality of serviceThe 3-bit precendence allows 8 priority levels. The ToS bits are D-min delay,T-max throughput, R-max reliability, C-min cost.

802.1 provides no support for priority. 802.1P provides in-band QoS signalling with 8 COS levels. Aconforming bridge or switch maintains 8 queues. (VLAN tags may also carry priority information.)

1.5. PROBLEMS 7

1.5 Problems

1. “The throughput per host decreases with number of hosts connected to the LAN.” Formulatetwo mathematical models, one deterministic and one stochastic, in which this quote is anassertion. Then prove or disprove the assertion. You will have to model LAN speed, hostload and throughput.

Hint: Use M/M/1 model of section3.3.

2. Follow Figure1.2and propose a graph model for an extended bridged LAN in which bridgesmay have multiple ports.

(a) Use the graph to formulate two mathematical models, one deterministic and one stochas-tic, within which one can determine the throughput per LAN segment.

(b) How would you formulate as a mathematical assertion the statement “the througput perLAN segment increases with the number of ports in the bridge”?

Hint: try the Jackson network model of section3.3.

3. Discuss the differences between STP and OSPF in terms of throughput or efficiency in linkutilization.

Chapter 2

Processor architecture

2.1 Datapaths

When packet arrives at bridge,

� DA is searched in forwarding table (DA! output ports). If not found, packet is broadcast toall output ports;

� If found, it is forwarded across switching fabric to appropriate output port (or ports for mul-ticast);

� SA is learned and added to forwarding table;

� During transfer to fabric packet may be stored or dropped if storage is full;

� Packet is stored in output port queue (usually FIFO) and eventually transmitted.

When packet arrives at router,

� DA is searched in forwarding table. If not found, packet is dropped;

� If found, next-hop MAC address is appended, TTL is decremented, new Header Checksum iscalculated, and packet is forwarded across switching fabric to output port or ports;

� During transfer to fabric packet may be stored: if storage is full, this (or another) packet maybe dropped;

� Packet is stored in output queue (FIFO or more complex) and eventually transmitted.

When cell arrives at ATM switch,

� Its VCI is searched in forwarding table (VC translation table: (VCIin, Port in)! (VCI out,Port out)). If not found, cell is dropped;

9

10 CHAPTER 2. PROCESSOR ARCHITECTURE

� If VCI is policed, policing function determines if cell is conformant. If not, it may be dropped.If yes, cell is forwarded across switching fabric to output port;

� During transfer, cell may be stored: if storage is full, this or another cell may be dropped;

� Cell is stored in output queue and eventually transmitted. Service discipline may be FIFO orvery elaborate.

2.1. DATAPATHS 11

CPU

Memory

Line card #1

Line card #2

Line card #3

Line card #4

Line card #5

Line card #6

packet

CPU

Memory

Line card #1

Line card #2

Line card #3

CPU memory

packet

CPU memory

CPU memory

CPU

Memory

Line card #1

Line card #1

Line card #1

packet

CPU memory

CPU memory

CPU memory

CPU

Memory

Line card #1

Line card #2

Line card #3

packet

CPU memory

CPU memory

CPU memory

CPU

Memory

Line card #1

Line card #2

Line card #3

packet

CPU memory

CPU memory

CPU memory

A B

C D

Figure 2.1: Basic packet processor architecture

� Throughput in A is limited by CPU speed;

� In B, there is a choice about which CPU to forward packet;

� In C, packet travels bus only once, so throughput limited by bus speed;

� In D, several packets can be forwarded through crossbar.

General purpose CPUs are not well-suited for applications in which packets flow through. CPU’sare better when same data are examined several times, making use of cache.


Routing

Congestion control Reservation

SwitchingPolicing Scheduling

Control

path

Data pathper packet

processing

Forwarding

decision

Switching

fabricPolicing Scheduling

Figure 2.2: Elaboration of datapath functions

2.2 Performance

The packet delay through switch fabric consists of time (1) for forwarding decision, and (2) totransfer packet across switch.

Packet delay through processor consists of time (1) for policing decision, (2) forwarding decision,(3) to transfer across switch, and (4) for output scheduling decision.

2.3. FORWARDING DECISION 13

Header

arrival

time

Forwarding

decision

time

Switch

transfer

time

packet size

min back-to-back

packet size

packet

arrival rate

Output

scheduling

decision time

Time

Figure 2.3: Delay of switch and packet processor

2.3 Forwarding decision

Criteria: (1) speed of address lookups depends on number of memory references; (2) size of memory

ATM switchesperform direct lookup, figure2.4

VCI address space is224 = 16 M. Most switches contain216 or fewer entries, since it is downstreamswitch that chooses VCI that fits in supported address space (PNNI).

For multicast, lookup returns list of output ports, each with different VCI.

Addre

ss DataDRAM

VCI (port, new VCI)

Figure 2.4: ATM switches perform direct lookup


Network

address

Associated

data

net address

48 bits

associated

data

hit

location of

entry

log2N bits

(size N memory)

Figure 2.5: CAM or Content addressable memory. The 48-bit MAC address is presented. A suc-cessful parallel search asserts “hit” signal and returns pointer to entry where forwarding informationfor the MAC address is stored.

BridgeAddress space is248 so direct lookup is not possible. Three indirect lookup techniques:

Associative memory.Figure2.5. Typical CAM size isN = 1024 entries. Not suitable for largeLANs which support216 = 64; 000 entries.


Hashing

functionDRAM

add

ress d

ata

48 bits 16 bits log2N

address of

N linked listsM addresses

Figure 2.6: A 48-bit address is presented and the hashing function returns a pointer to one ofN

linked lists. The search through a linked list takes a random time proportional to length of list.

Hashing.For large LANs hashing is an option. Suppose the LAN hasM hosts. A hashing function,h, maps a hosts 48-bit address to a forwarding table with, say,N = 216 entries as in Figure2.6.

Two addressesx; y may collide:h(x) = h(y). The entry points to a linked list of (MAC address,forwarding data) of MAC addresses that map into the same entry. The list must be searched sequen-tially to locate the MAC address. The duration of search is proportional to the length of the list.Supposeh maps theM MAC addressesx1; :::; xM into theN linked listsj = 1; :::; N . Assumethath(x1); :::; h(xM ) are independent uniformly distributed overf1; :::; Ng.

The length of thejth list is the ramdom number

nj =MXi=1

1(xi = j); j = 1; :::; N; (2.1)

Let � = N=M . If � is small (number of lists larger than number of possible addresses), the listswill usually have 0 or 1 element. Problem3 asks to find the distribution ofnj. ForN �M (� � 1),the mean length of the list is about0:5(1 + �). However,nj being random, there is a chance thatsome lists (and corresponding search time) may be very large. For real-time applications, you maystore forwarding tables in such a way (e.g. as trees) that retrieval has a deterministic bound,


Prefix Outgoing port

128.32.0.0 /16 1128.32.239.0/24 7128.32.239.3/32 3

Figure 2.7: Forwarding table with CIDR

IP routers.With CIDR, router forwarding table entries are identified by a pair, (route prefix/prefixlength), with prefix length between 0 and 32 bits. See Figure2.7. The entry 128.32.0.0/16 is a16-bit long entry.

The forwarding decision must find the longest prefix match between the packet’s destination IPaddress and the prefixes in the forwarding table.

CIDR reduces table, but the forwarding decision is more complex. See [9].

With declining memory cost, it may be more economical to expand the prefixes and use simpler,exact matching algorithms.


Caching. The forwarding decision delay can be reduced by caching. Idea is that the IP destinationaddresses of successive packets are correlated.

The cache stores the full source and destination IP address and the corresponding forwarding deci-sion (including perhaps the entire replacement IP header).

When packet arrives SA and DA are used to do a full match in the local cache. If the addresses arenot there, the packet is forwarded to a central routing processor. A cache replacement rule is neededif there is a cache miss.

The improvement in delay depends on (1) the ratio of cache size to the size of the forwarding table,and (2) the temporal locality. The latter is likely to be higher in a campus router than an edge routerand larger there than in a core router. See Problem4.

Multicast. Some routers support multicast. The simplest rule is RPF (reverse-path forwarding): If amulticast packet arrives on portP from sourceS, look upS in the forwarding table. IfP is the bestport to reachS, forward the packet on all ports exceptP .

Switching fabricsNeed some queuing models.


2.4 Problems

1. For a commercial LAN switch, find the various times in Figure2.3. Also give the throughput.See, for example, www.bcr.com/bcrmag/08/98p25.htm

2. If forwarding decision, switch transfer, and output scheduling can be pipelined, what is thethroughput of the processor?

3. Find the (marginal) distribution of thenj given in (2.1), and calculate the mean lengthEnjof a list. Show that for� � 1 small, the mean is approximately0:5(1 + �).

Find the joint distributionp(n1; � � � ; nN ). Verify that it has the product form:

p(n1; � � � ; nN ) =

QNj=1 p(nj)P

n2A

QNj=1 p(nj)

:

HereA = fn jPnj = Mg, so the denominator is the normalizing constant.

TakeM = N = 216. Find the probabililty thatnj > 1000.

Suppose a memory access takes 100 ns,� = 1. Consider back-to-back Ethernet packets.What is the average throughput of this switch using the model of Figure2.3and ignoring theoutput scheduling decision delay.

4. The packets arriving at a line card belong to several multiplexed TCP connections.

(a) Formulate a model of packet arrivals with sayM simultaneous connections and in whichconnections last a random amount of time with a geometric distribution and meanT .

(b) Suppose the size of the cache isN . If there is a cache miss, an existing entry is replacedby the missing entry. How would you calculate the hit ratio as a function ofM;N; T?

(c) Suppose you are given a ‘typical’ trace of the addresses of packet arrivals, but no modelof the arrival process. You want to know how big a cache you would need so that the hitratio is a certain value, say0:9. What would you do?

(d) The time to search a cache isTc, the time to search the central forwarding table isTf ,the hit ratio is�. How would you decide if it’s worth having a cache?

Chapter 3

Queuing

3.1 Discrete time Markov chains

x = fxn; n � 0g is a Markov chain withxn 2 X finite or countable, stationary probability matrixP (i; j); i; j 2 X, initial distribution�0(i); i 2 X.

SoP (x0 = i0; � � � ; xn = in) = �0(i0)P (i0; i1)� � � � � P (in�1; in) (3.1)

for all n � 0; i0; � � � ; in 2 X.

�n is the marginal distribution ofxn written as a row vector. From (3.1)

�n = �0Pn: (3.2)

� is invariantif it satisfies the balance equations

� = �P: (3.3)

x is irreducibleif it goes from any statei to any other statej (with positive prob). Irreduciblechains have at most one invariant distribution. The chain is positive recurrentif it has one invariantdistribution.

If x is irreducible,

limN!1

1

N

NXn=0

1(xn = i) = �i; a:s:; i 2 X; (3.4)

i.e. �i is the fraction of timex spends in statei.

x is aperiodicif d = 1, where

d = gcdfn � 1 j Pn(i; i) > 0g; i 2 X:

If d > 1, x is periodic with periodd.

19

20 CHAPTER 3. QUEUING

If x is aperiodic and irreducible, with invariant distribution�, then for any initial distribution,

limn!1

�n = �: (3.5)

See Problems1, 2.

TheoremSupposex is irreducible andV : X ! [0;1). The drift ofV at i is

�(i) = EfV (xn+1)� V (xn) j xn = ig:

SupposeS is a finite subset ofX and there are constantsD > 0; A <1 so that

�(x) < �D; x 62 S;�(x) < A; x 2 X:

Thenx is positive recurrent. See Problem4.

3.2. CONTINUOUS-TIME MARKOV CHAINS 21

3.2 Continuous-time Markov chains

A random variable� is exponentially distributed with rate� if

P (� > t) = e��t; t � 0:

Its mean is

E(�) =1

�;

and it is memoryless,P [� > t+ s j � > s] = P (� > t); s; t � 0:

A rate matrixQ = fq(i; j)g on a countable setX satisfies

0 � q(i; j) <1; i 6= j

�q(i; i) = q(i) =:Xj 6=i

q(i; j) <1; i 2 X


xt

0

1

2

3

40

1

2

3 t

Figure 3.1: Constructing a continuous-time Markov chain

Given rate matrixQ and distribution�0 onX. Constructx = fxt; t � 0g thus:

1. Selectx0 = i with P (x0 = i) = �0(i).

2. If x0 = i, select� exponential with rateq(i). Let

xt = i; 0 � � < t:

3. At t = � x takes a jump fromi to j, independently of� and according to

P [x� = j j x0 = i; � ] = �(i; j) =:q(i; j)

q(i); i 6= j:

4. Return to step3 with x� = j, independently of process before� .

Thenx is a Markov process with right-continuous sample paths. Figure3.1shows a sample path.

Q is regular if1Xn=0

�n =1a:s:

Note,

P (x0 = i; xt = j) = �0(i)q(i; j)t + o(t); i 6= j

P (x0 = i; xt = i) = [1� q(i)t] + o(t):


xt

t

t1 t2 t3

S1S2

S3

Figure 3.2: A trajectory inA of Theorem

Theorem(Markov property) For any setA of trajectories

P [(xs; s � t) 2 A j xt = i; xu; u < t] = P [(xs; s � 0) 2 A j x0 = i]:

SuchA is of the formA = fx j xtk 2 Sk; k = 1; � � � ;Kg;

0 � t1 < � � � < tk, Sk � X, K <1. See Figure3.2.


Q is irreducibleif q(i) > 0 for all i 2 X if � is irreducible, where

�(i; j) =

(q(i;j)q(i)

; i 6= j

0; i = j

TheoremSupposex is c-t Markov chain with rateQ and initial distribution�. Then

1. � is invariant,P (xt = i) = �(i); t � 0; i 2 X iff balance equation

Xi2X

�(i)q(i; j) = 0: (3.6)

2. x has at most one invariant distribution� and then

limt!1

P (xt = i) = �(i); i 2 X;

limT!1

1

T

Z T

01(xs = i)ds = �(i); i 2 X:

3. If x has no invariant distribution,

limt!1

P (xt = i) = 0; i 2 X;

limT!1

1

T

Z T

01(xs = i)ds = 0; i 2 X:


Theorem(Time reversal) Supposex is stationary, c-t, Markov with rateQ, distribution�. Thetime-reversed process

~x = f~xt := xT�t; 0 � t � Tg

is stationary, Markov, with distribution� and rate~Q where

~q(i; j) =�(j)q(j; i)

�(i); i 2 X:

Why?

P (x0 = i; xt = j) = �(i)q(i; j)t + o(t); i 6= j and

P (~x0 = i; ~xt = j) = �(i)~q(i; j)t + o(t)

= P (xT = i; xT�t = j)

= P (xt = i; x0 = j) = P (x0 = j; xt = i)

= �(j)q(j; i)t + o(t):


0 1 2 3

µ µ µ

0

1

2

3

xt

t

arrivals

departures

Figure 3.3: Diagrams for M/M/1 system. Arrivals (blue) and departures(red) form Poisson pro-cesses.

3.3 M/M/1 model

See Figure3.3. The balance equation (3.6) is

�(0)� = �(1)�

�(n)(�+ �) = �(n� 1)�+ �(n+ 1)�; n � 1

which has a (unique) solution iff� < �:

�(n) = (1� �)�n; n � 0; with � :=�

�: (3.7)

3.3. M/M/1 MODEL 27

The queuext is time-reversible, because

�(i)q(i; j) = �(j)q(j; i); i; j � 0;

so the rate matrix of the time-reversed process,xT�t, is the same as that ofxt.

So the departures before timet form a Poisson process with rate�, independent ofxt. Surprise!

The mean queue length is

E(xt) =1Xn=0

n�(n) =1Xn=0

n(1� �)�n =�

1� �=

�

�� :

For� = 0:9, the mean is 10 packets.

Above,

� = av. number of exponential packet arrivals per sec

� = av. number of packets that can be transmitted per sec

� = av. utilization = P (xt > 0) =�

�:


A packet arriving at timet seesxt packets in queue with

P [xt = n j packet arrives in(t; t+ �)]

=P (packet arrives in(t; t+ �)) j xt = n]P (xt = n)

P ( packet arrives in(t; t+ �)

=��(n)

��= �(n)

so the average time between departure and arrival (including packet service or transmission time) is

T =

1Xn=0

1

�(n+ 1)�(n) =

1Xn=0

n+ 1

�(1� �)�n =

1

�� :

Alternatively,T = 1+Ext� = 1

�� .

ExampleConsider a 10 Gbps link. Packet lengths are exponentially distributed with mean length10,000 bits.1 So� = 1010 � 10�4 = 106 packets/s and��1 = 1�s per packet.

Link utilization is 90 percent, i.e.� = 0:9. Then the average number of packets in buffer is�(1� �)�1 = 9. The average delay faced by a packet including its own service (transmission) timeis 10�s.

If the packet goes through 10 nodes the average delay is 100�s (assuming independence of nodes).

For a 100 Mbps link, with same packet length distribution,� = 0:9, ��1 = 104 � 10�8 =

100�s/packet, and the average delay is 1000�s per link.

The probability of 100 or more packets in buffer is

Xn�100

�(n) =X

n�100

(1� �)�n =�100

1� �= 10� 0:9100 = xxx:

Compare queuing delay with propagation delay of3; 000 � 5�s/km = 15 ms for 3,000 km link.Possible number of bits in the 3,000 km, 10 Gbps link is15� 10�3 � 1010 = 150 � 106.

1What is a more realistic distribution?

3.3. M/M/1 MODEL 29

Alternative formulationA = fAt; t � 0g is a Poisson counting process with rate�—the arrivalprocess.S = fSt; t � 0g be a Poisson counting process with rate�—thevirtual service process.S;A are independent.

The queue att is given by

xt = x0 +

Z t

0[dAs � 1(xs� > 0)dSs]:

The departure counting process isD,

Dt =

Z t

01(xs� > 0)dSs:

D is also Poisson. Moreover,

� Future arrivals,fAs �At; s � tg, and current state,xt, are independent;

� Past departures,fDt �Ds; s � tg, and current state,xt, are independent.


external traffic

rate is i pkt/sec

i

line rate is µi pkt/sec

r(j,i)j

Switch

line i

external traffic

traffic

from

network

Figure 3.4: Parameters of Jackson network

Jackson networkSee Figure3.4. Assumptions:

� Independent, exponential service times with rate�i;

� Markovian routingr(i; j);

� Poisson external arrivals at rate i packets/sec;

Aggregate arrivals into nodei is �i where

�i = i +Xj

�jr(j; i); all i: (3.8)

Let xt = (x1t ; � � � ; xJt ) be queue-length process. This is Markovian. Problem5 asks to find its ratematrix.

3.3. M/M/1 MODEL 31

TheoremAssume�i < �i, all i. Thenx has an invariant distribution of the product form:

�(x1; � � � ; xJ) = �1(x1) � � � �J(xJ)

where

�i(n) = (1� �i)�ni ; n � 0; with �i =

�i

�i:

This is a surprising result. The departure from any node in the Jackson network neednot be Poisson,unlike the case of a single M/M/1 system.


0 1 2 3

µ 2µ 3µ

m-1 m m+1

(m−1)µ mµ mµ

µ

µroute to first free

server

Figure 3.5: The M/M/m/1 system

3.4 Other M/M/m/n models

M/M/m, the m server caseThe received request is routed to the first ofm available servers, Figure3.5. The buffer is infinite. The balance equations are

(�+m�)�(n) = ��(n� 1) +m��(n+ 1); n � m

(�+ n�)�(n) = ��(n� 1) + (n+ 1)��(n+ 1); 0 < n < m

��(0) = ��(1):

This gives

�(n) =

(�(0)

(m�)n

n!; n � m

�(0)mm�n

m!; n > m

(3.9)

It is assumed that� = �m� < 1. �(0) is obtained using

P�(n) = 1,

�(0) = [

m�1Xn=0

(m�)n

n!+

(m�)m

m!(1� �)]�1:

A packet arriving at timet sees all servers busy (xt � m) with probability

P [xt � m j packet arrives in(t; t+ �) ]

=Xn�m

P [xt = n j packet arrives in(t; t+ �) ]

=Xn�m

P [packet arrives in(t; t+ �)j xt = n]P (xt = n)

P (packet arrives in(t; t+ �))

=Xn�n

��(n)

��=Xn�m

�(n) = �(0)mm

m!

Xn�m

�n; from (3.9);

=�(0)(m�)m

m!(1� �)=: P (queue)

The expected number of packets waiting in queue (not in service) is

N(queue) =Xn�0

n�(n+m) =�(0)(m�)m

m!

Xn�0

n�n = P (queue)�

1� �

3.4. OTHER M/M/M/N MODELS 33

By Little’s law (see below), the average waiting time in queue (not in service) is

W =N(queue)

�;

and the total latency (waiting time) is

T =1

�+W:


3.5 Little’s law

SupposeA(t) is the cumulative arrivals in[0; t] into a stable queueing system,x(t) is number ofpackets in system (including those in service). LetD(i) = Si + Wi be latency of packeti. LetA(t)=t! � be arrival rate.

Suppose queue is empty att = 0 andt = T . From figure3.6, the time average of queue size is

R T0 x(t)dt

T=

PA(T )i=1 D(i)

T=A(T )

T

PA(T )i=1 D(i)

A(T ):

Taking limits asT !1, and if time averages equal ensemble averages, we get

E(x) = ��E(D):

S1 S2

W2 W4S3 S5S4

W5W5

x(t)

t

A(t)

Figure 3.6: Calculations for Little’s law

3.6. PASTA 35

3.6 PASTA

We have used the PASTA property (Poisson arrivals see time averages) several times.

Consider stationary queuing system with deterministic service time of 3 and periodic arrivals (period10). A sample path with arrivals at 1,2,3,11,12,13,21,22,23,� � � and queue processx(t) is shown infigure3.7.

Let �(n) be the probability thatx(t) = n at any timet, and letp(n) be the probabililty that anarriving packet seesn packets in queue. For this system,

�(0) = 1=10; �(1) = 4=10; �(2) = 4=10; �(3) = 1=10

p(0) = p(1) = p(2) = 1=3;

so the two probabilities are not the same.

1 2 3 4 5 6 7 1110

x(t)

Figure 3.7: PASTA property does not hold in this deterministic queuing system

Consider a M/G/1 system, with stationary probabilities�(n). Let p(n) be the probability that anarrival seesn packets in queue. Then,

p(n) = P [x(t) = n j packet arrives in(t; t+ �)]

=P (x(t) = n)P (packet arrives in(t; t+ �))

P (packet arrives in(t; t+ �))

= P (x(t) = n) = �(n)

using Bayes’ rule, independence of arrivals aftert from fx(s); s � tg, and independence of servicetimes.


S1

S22

S3S4

S5W2W2W

W55W5Wt

W(W(W t)

W3W3W

area (2)

Figure 3.8: Deriving Pollaczek-Khinchin formula

3.7 Pollaczek-Khinchin formula

Consider M/G/1 system with independent service timesS,ES = ��1,ES2 <1, Poisson arrivalswith rate�. LetW (t) be the remaining waiting time, i.e. the amount of time needed to serve packetsin the system att. Let Si andWi be the service time and waiting times of packeti, see figure3.8.The time average of waiting time

1

T

Z T

0W (t)dt =

1

T

A(T )Xi=0

area(i):

area(i) is the parallelogram area for packeti, soarea(i) = 1=2S2i +SiWi. Substituting and takinglimits asT !1,

EW = �(1

2ES2 +E(waiting time faced by arriving packet)ES):

By PASTA,E(waiting time faced by arrival) = EW . So,

EW =�ES2

2(1 � �ES)=

�ES2

2(1� �);

where� = �=� is the utilization.

Note: The formula

E

A(T )Xi=0

area(i) = EA(T )Earea(i)

involving a random sum ofA(T ) terms is sometimes called Wald’s formula. A general version ofWald’s formula is a consequence of the fact thatfA(t)��t; t � 0g is a martingale. See Problem8.

Determinism minimizes waiting

In general,ES2 = (ES)2 + �2, so

W =�((ES)2 + �2)

2(1 � �)� �(ES)2

2(1� �)=

�

2�(�� )

where the last expression is the waiting time for a deterministic service time (eg. ATM cells).

3.8. PROBLEMS 37

3.8 Problems

1. How does (3.2) follow from (3.1)?

2. Give examples of Markov chainsx with the following properties:

(a) x is irreducible and has no invariant distribution;

(b) x is finite with more than one invariant distribution;

(c) x is finite but not irreducible;

(d) x is infinite and positive recurrent;

(e) x is finite, irreducible and (3.5) does not hold.

3. Show that ifX is finite the convergence in (3.5) is geometrically fast, i.e.j�n � �j < c�n forsome0 � � < 1.

4. A packet processor takes 1�s to forward one packet. Packets arrivals are iid. In 1�s, ipackets arrive with probability�i; i � 0. Letxn be the number of packets at the beginning ofthenth �s in the (infinite) buffer.

(a) Show thatx = fxn; n � 0g is a Markov chain.

Hint: express the evolution ofx as a stochastic dynamical system of the form

xn+1 = f(xn; wn);

wherew = fwn; n � 0g is an independent process. Show that in this casex is alwaysMarkov, and ifw is iid, x has stationary transition probabilities.

(b) Show thatx is irreducible.

(c) Write the balance equations.

(d) Find conditions on the�i so thatx is positive recurrent. How would you find the ex-pected forwarding delay faced by a packet?

(e) Give an example of the�i so thatx is not positive recurrent. What happens to the queuesize in this case?

5. Find the rate matrix of the queue length processfxtg of the Jackson network in Figure3.4.

6. In the feedback network in Figure3.9, at each link a packet leaves the system with probability0.5. For what values of (in terms of�1; �2 is the system stable?

7. In Figure1.2suppose the bridges have a throughput of 1 Gbps, L1,L2,L4 and L5 are 10 MbpsLANS and L3 is a 100 Mbps LAN. Suppose traffic originating in each LAN is 50 percent ofLAN capacity.

Suppose 90 percent of the traffic originating in LAN Li is destined for a station in the sameLAN whereas 10 percent is destined for a station in LAN Lj, selected randomly.

(a) Can the network support this traffic?


0.5

0.5

0.5

0.5

µ1

µ2

Figure 3.9: Network for problem6

(b) By what factor can the traffic increase, before the network becomes unstable?

8. Letx(n); n � 0, be a Bernoulli sequence withP (x(t) = 1) = p. LetN be a random numberdefined below. For each case explain why or why not

ENXn=0

x(n) = pEN:

(a) N = constant a.s.

(b) N = argminfn j x(n+ 1) = 1g.(c) N = argminfn j x(n) = 1g.(d) N = argminfn j

Pni=0 x(i) > 100g.

Chapter 4

Switching

4.1 Packet switching

� Architectures

� IQ/HOL

� VoQ

� SQ

39

40 CHAPTER 4. SWITCHING

is blocked

IQ: hol blocking OQ: faster switch

VoQ: matching SQ: reduces buffer size

Figure 4.1: Packet switch architectures

4.1.1 Architectures

Second generation PRIZMA architecture is32� 32, with 2 Gbps ports ... all on one chip [15]

4.1. PACKET SWITCHING 41

λ /N

1

3

1

1

2

2

1

2

3

ρ1

11

1HOL queue

AtXt

Input from nonblocked queues

10

8

6

4

2

0

-------------------0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1

Average delay in cell times

ρ

read HOL arrivals

xtAt xt+1

min(1, xt)Xt

Figure 4.2: Virtual HOL queue

4.1.2 Input queues

Assume:

� discrete time t, independent arrivals, uniform destination with prob �=N

� N large so total number of port 1 arrivals is Poisson

P (n port 1 arrivals) = e��n

n!

Virtual HOL queue of Xt port 1 packets at head of queue:

Xt+1 = (Xt � 1)+ +At = Xt +At � 1(Xt > 0); t � 0: (4.1)

where,At = number of new port 1 packets that come to head unblocked queues:

Suppose equilibrium probability of unblocked queue is is �. Then At is Poisson with mean � =

�� . From (4.1)E(X) = E(X) +E(A) � P (X > 0);


so P (X > 0) = E(A) = �. Square (4.1), and take expectations,

0 = E(A2) + �+ 2�E(X) � 2E(X) � 2�2: (4.2)

Since E(A2) = �+ �2,

E(X) =2�� 2

2(1� �)=: �:

AlsoE(# blocked queues) = N(1� �) = NE(Xt � 1)+ = N(� � �):

For � = 1 this gives � = 1,� = 2�

p2 = 0:58:

That is 42 % switch bandwidth is not utilized.

quick upper bound Same switch. But at end of each cycle IQs are flushed. With � = 1, switchthroughput is X

j

P (Xj > 0) =Xj

[1� P (Xj = 0)] = N [1� (1� 1

N)N ]:

Per port throughput is

1� (1� 1

N)N

N!1�! 1� 1

e= 0:63:


4.1.3 Virtual output queue

� Each input port has N VoQ’s, one per output port.

� If several input ports have packets for same destination, which one should be served.

� Assume iid arrivals Aij(t) with rates � = �ij such that

Xi

�ij < 1; 8j = 1; � � � ; NXj

�ij < 1; 8i = 1; � � � ; N

� Service S(t) = fSij(t)g such that

Xi

Sij � 1; 8j = 1; � � � ; NXj

Sij � 1; 8i = 1; � � � ; N

Note: If = above, S(t) is a permutation over f1; � � � ; Ng.

� Queue lengths Lij(t) such that

L(t+ 1) = [L(t)� S(t)]+ +A(t):

Question: Given �, find S(t), based on past arrivals A(s); s � t and L(t), so that L is stable.

Conjecture: Always exists stabilizing matching S(t).


I O I OG M

Bipartite graph G Max size matching M

wij

all demands 0.5 Max size matchings

Figure 4.3: Matching in bipartitie graphs (above) and counterexample (below)

Matching: G = [V;E] be bipartite graph, i.e. V = I [O and E � I �O. M � V is a matching ifno two edges in M have a common vertex. Edges can be weighted.

Max size matching: M has max number of edges. Best algorithm has running time O(N5=2).

Max weight matching:Pfwij j (ij) 2Mg is max. Best algorithm O(N3 logN).

Conjecture: Maximum size matching is stabilizing.

Counterexample

Take N = 3. �11 = �12 = 0:5; �21 = �32 = 0:5.

Suppose L11(t) > 0; L12(t) > 0. Suppose A21(t) = A32(t) = 1 (which happens with prob 0.25.)Then there are 3 max size matchings and input 1 will be selected with prob 2/3. So,

Prob(S11(t) + S12(t) = 1) � 0:25 � 2=3 + 0:75 � 1 = 1=6 + 3=4 = 11=12 < 1;

i.e. L11(t) + L12(t)!1.

Suppose all rates are 0:5� Æ. Still get instability with max size matching for Æ > 0 small.


Try providing more service if queue lengths are large, i.e. choose

S(t) = argmaxfX

Lij(t)Sij j S is a permutationg

Intuition. In continuous time approximation,

d

dtL(t) = A(t)� S(t); neglecting negatives :

So, (below L;A; S are vectors or matrices as appropriate)

d

dtjL(t)j2 = 2L(t)T (A(t)� S(t)) (4.3)

d

dtEfjL(t)j2 j L(t)g = 2L(t)T (�� S(t)) (4.4)

So chooseS�(t) = argmaxfL(t)TS j S is a permutationg:

Recall

Theorem The assignment problem

max cT�

subject to 8j;P

i �ij � 1;

8i;P

j �ij � 1;

8i; j; �ij � 0

has an optimum solution that is a permutation.

Suppose in (4.4),P

i �ij;P

j �ij � (1� Æ). Then from Theorem,

L(t)T (�� S�(t)) � �ÆjL(t)j;

so with policy S�(t),d

dtEfjL(t)j2 j L(t)g � �ÆjL(t)j;

from which stability follows by [16].


Shared memory

queue 1

queue N

input 1

input N

output 1

output N

path by packet from

input 1 to output N

shared bus

Figure 4.4: Switch with time-division shared bus and centralized shared memory

4.2 Shared queue

This architecture is used in most low speed packet processors: a time-division bus with a centralizedmemory shared by all input and output lines, figure 4.4. Up to N packets may arrive at one timeand up to N may be read at one time, so memory bandwidth must be 2N -times line rate. Assume100 ns DRAM access time, 53B-wide bus, gives total bandwidth of 530 � 8 = 4240 Mbps. For a16-port ATM switch, this gives line rate of 4240=32 = 132 Mbps.

Let

Xt = size of 1-list at beginning of slot t

At = # of 1-packets arriving in slot t; 0 � At � N

Xt+1 = (Xt � 1)+At = Xt +At � 1(Xt > 0) (4.5)

Following same argument that led to (4.2) gives

EX =�+ �2 � �2

2(1� �);

where EA = � = P (Xt > 0) and EA2 = �2 + �2. For the Poisson case, �2 = �, so

EX =2�� 2

2(1� �): (4.6)

Shared vs separate queue

Suppose shared buffer is sized at EX(i)+k�(i) where �(i) is the standard deviation of X(i). Then

separate buffer size =X

[EX(i) + k�(i)]

shared buffer size =X

EX(i) + k[X

�(i)2]1=2]

4.3. OUTPUT QUEUE 47

4.3 Output queue

In an output queued switch the switch fabric must run N -times, and the ouptut memory must runN + 1-times line rate. The queue length in port 1 is given by (4.5).


4.4 Problems

1. Assume that At is Poisson in (4.1) or (4.5). The mean queue size is given by (4.6). Is

(a) Xt; t � 0 Markov? Why?

(b) If Xt is stationary, how would you find �(n) = p(Xt = n)?

Chapter 5

Matching

Crossbar switches need a controller to schedule a switch. The controller must find a good match,eg. longest queue first, oldest cell first, etc.

It is too expensive to run a centralized matching algorithm with complexity O(N2) or O(N3). (A40-byte packet at a line speed of 1 Gbps amounts to 360 ns/packet.)

So one may have to be satisfied with maximal matching, using distributed algorithms. Note that fora fully-connected bipartite graph, a maximal matching is also maximum.

In case of QoS, the matching must satisfy some preferences.

49

50 CHAPTER 5. MATCHING

Man # Preference list Woman # Preference list1 1 2 3 4 1 1 3 4 22 2 1 4 3 2 3 4 1 23 3 2 4 1 3 2 1 4 34 3 4 2 1 4 1 2 3 4

5.1 The dating game

Consider a dating game with N men and N women and following preferences. SMP algorithm byGale and Shapley finds a “stable” match, eg.

(1; 1); (2; 4); (3; 2); (4; 3):

The algorithm is

� iterative—proceeds in a sequence of proposals and (tentative) accepts

� upon termination—returns a matching (i; p(i)

� guarantees stability.

A matching is unstable if it contains pairs (i; p(i)); (j; p(j)) such that

i prefers p(j) to p(i) and p(j) prefers i to j:

(i; p(j)) is a blocking pair.

A stable matching has no blocking pair.

5.1. THE DATING GAME 51

The GSA algorithm. Say that a man or woman is

� free—if she/he is not engaged or matched to any man/woman

� engaged—if she/he is temporarily matched to some man/woman

� matched—if she/he is terminally matched


BEGIN

all are free

Is

some man

m free?

m proposes to w,

the first woman he has not yet

proposed to

is

w free?

w is currently

engaged to m'

does w

prefer m to

m'?

match w and m,

set m' free

ENDNo

Yes

w engaged to myes

m continues freeno

Figure 5.1: The GS algorithm

5.1. THE DATING GAME 53

Algorithm will terminate. No man can be rejected by all women. Because a woman can reject aman only if she is engaged. Once she is engaged, she stays engaged. So if every woman rejectsm, they are all engaged. Alternatively: in each iteration, a man makes worse choices and a womanmakes better choices.

GSA finds a stable matching. Suppose (i; p(i)); (j; p(j)) are matched but i prefers p(j) to p(i) andp(j) prefers i to j. Then, i must have proposed to p(j) before proposing to p(i); p(j) must haverejected i in favor of, say, k prefered by her to i. But women make better and better choices, sop(j)’s final match must be better than k, which is better than i, hence better than j.

Number of iterations is bounded by N2: there are N men and each makes at most N proposals.

There may be more than one stable matching. Suppose m1 prefers w1 to w2, m2 prefers w2 to w1;w1 prefers m2 to m1, w2 prefers m1 to m2.

Then (m1; w1); (m2; w2) and (m2; w1); (m1; w2) are both stable matches.


1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

1

2

3

4

a1

a3

g2

g4

1

2

3

4

1

2

3

4

Figure 5.2: 4� 4 RRM showing a1; a3; g2; g4 pointers with L(1; 1); L(1; 2); L(3; 2); L(3; 4) > 0.

5.2 Round-robin matching

Each input i maintains accept pointer ai. Each output j maintains grant pointer gj .

RRM cycle.

Step 1 Each i requests all j with L(i; j) > 0.

Step 2 Each j grants next requesting input Gj at or after current pointer value gj , i.e.

Gj = minfi j i � gj ^ L(i; j) > 0g;

then increments gj Gj + 1.

Step 3 Each i accepts next granted output Ai at or after current pointer value ai, i.e.

Ai = minfj j j � ai ^Gj = ig:

If grant i has been accepted, increments ai Ai+1. Figure 5.2 illustrates one RRM cycle. Initially,all ai = 1, and all gj = 1. The input requests are

1! 1; 1! 2; 3! 2; 3! 4; 4! 4:

So we have the following steps:

G1 = 1

G2 = 1

G3 = ;G4 = 3

g01 = 2

g02 = 2

g03 = 1

g04 = 4

A1 = 1

A2 = ;A3 = 4

A4 = ;

a01 = 2

a02 = 1

a03 = 1

a04 = 1

At the end of this cycle, the match is f(1; 1); (3; 4)g, and the pointer values are given above.

5.2. ROUND-ROBIN MATCHING 55

5.2.1 Analysis of RRM

Under heavy load, the grant counters may get synchronized, reducing utitlization. Consider N =

2; L(i; j) > 0 all i; j. Then it is possible for g1 = g2 always as follows.

a1 = 1

a2 = 1

g1 = 1

g2 = 1

G1 = 1

G2 = 1

g01 = 2

g02 = 2

A1 = 1

A2 = ;a01 = 2

a02 = 1Match = (1; 1)

a1 = 2

a2 = 1

g1 = 2

g2 = 2

G1 = 2

G2 = 2

g01 = 1

g02 = 1

A1 = ;A2 = 1

a01 = 2

a02 = 2Match = (2; 1)

a1 = 2

a2 = 2

g1 = 1

g2 = 1

G1 = 1

G2 = 1

g01 = 2

g02 = 2

A1 = 2

A2 = ;a01 = 1

a02 = 2Match = (1; 2)

a1 = 1

a2 = 2

g1 = 2

g2 = 2

G1 = 2

G2 = 2

g01 = 1

g02 = 1

A1 = ;A2 = 2

a01 = 1

a02 = 1Match = (2; 2)

At the end of the fourth cycle the situation repeats. Throughput is 50 percent.

Of course the following TDM cycle is also possible, and has througput of 100 percent.

a1 = 1

a2 = 2

g1 = 1

g2 = 2

G1 = 1

G2 = 2

g01 = 2

g02 = 1

A1 = 1

A2 = 2

a01 = 2

a02 = 1Match = f(1; 1); (2; 2)g

a1 = 2

a2 = 1

g1 = 2

g2 = 1

G1 = 2

G2 = 1

g01 = 1

g02 = 2

A1 = 2

A2 = 1

a01 = 1

a02 = 2Match = f(1; 2); (2; 1)g

Under heavy load, if grant counters get syncronized at any time (i.e. have the same value), they’ llstay synchronized forever.

Under light load, the grant counters will be randomly distributed. The probability that some input iis not served is

P [Gj 6= i;8j] = (N � 1

N)N ! 1� e�1 = 0:63:


1

2

1

2

(1,1) = 1(1,2) = 1

(2,1) = 1

µ(1,1) = /4µ(1,2) = 3/4

µ(2,1) = 3/4

Figure 5.3: PIM can be unfair under heavy load

5.3 Partial iterative matching, PIM

Step 1 Each unmatched input i sends requests to every output j such that L(i; j) > 0.

Step 2 Each j randomly picks Gj from received requests.

Step 3 Each i randomly accepts one of received grants.

The I in PIM means that this cycle is repeated to improve match.

5.3.1 Analysis of PIM

It appears that with uniform iid traffic, PIM achieves maximal match in 3 iterations.

In heavy load, every input makes requests. Probability that i receives no grant in one round equals

P [8j; Gj 6= i] = (N � 1

N)N ! 1� e�1 = 0:63:

PIM can be unfair. Figure 5.3 gives a 2 � 2 case where the request rates from input i to output j is�(i; j). So requests 1 ! 1; 1 ! 2; 2 ! 1 are made in each slot. The grant rates from output j toinput i will therefore be (1; 1) = (1; 2) = 0:5; (2; 1) = 1. So input will accept output 1 withprobability �(1; 1) = 0:25, and output 2 with probability �(1; 2) = 0:75; input 2 will accept output1 with probability �(2; 1) = 0:75.

Thus even though arrival rates for output port 1 are equal at inputs ports 1 and 2, the acceptancerates are not the same.

5.4. ISLIP MATCHING 57

5.4 iSLIP matching

The detailed reference is [4]. The RRM suffers from synchronization of the grant counters. TheiSLIP modifies RRM slightly so that the grant counters are incremented only if the grant is accepted.So step 2 of RRM is modified.

Step 2 Each j grants next requesting input Gj at or after current pointer value gj , i.e.

Gj = minfi j i � gj ^ L(i; j) > 0g;

then increments gj Gj + 1 only if Gj accepts output j.


5.4.1 Analysis of iSLIP

Consider the situation L(i; j) > 0 all i; j. In contrast with RRM, inputs 1 and 2 share outputs inTDM fashion.

a1 = 1

a2 = 1

g1 = 1

g2 = 1

G1 = 1

G2 = 1

A1 = 1

A2 = ;g01 = 2

g02 = 1

a01 = 2

a02 = 1Match = (1; 1)

a1 = 2

a2 = 1

g1 = 2

g2 = 1

G1 = 2

G2 = 1

A1 = 2

A2 = 1

g01 = 1

g02 = 2

a01 = 1

a02 = 2Match = f(1; 2); (2; 1)g

a1 = 1

a2 = 2

g1 = 1

g2 = 2

G1 = 1

G2 = 2

A1 = 1

A2 = 2

g01 = 2

g02 = 1

a01 = 2

a02 = 1Match = f(1; 1); (2; 2)g


5.4.2 Priority iSLIP

Suppose there are P priority levels. Then each input i maintains P � N VoQs, with Lp(i; j) thebuffer occupancy of priority p and output j. Then i gives strict priority, i.e. serves Lp(i; j) onlyif Lq(i; j) = 0, q > p. Each input maintains counter ap;i and each output maintains gp;j for eachpriority level.

Step 1 Each i selects highest priority level P (i; j) with non-empty queue to output j.

Step 2 Output j determines highest priority level P (j) = maxP (i; j). The output then chooses oneinput among those inputs that have requested at level P (j). The output maintains separate pointergp(j), and chooses input Gp;j among requests at level P (j) in the same round-robin scheme. Theoutput notifies each input whether or not its request is granted. The pointer gp(j) Gp;j + 1 isincremented only if granted input Gp;j accepts output j.

Step 3 If input i receives any grants, it determines the highest priority level grant, say p. Theinput then chooses one grant among the requests granted at this level. This is done according to thecounter ap;i, which is incremented ap;i ap;i + 1. The input then notifies each output whether ornot its grant was accepted.


5.4.3 Threshold iSLIP

It may be better to select a weighted maximal match with weights corresponding to queue length.If queue lengths are quantized in threshold levels t1 < t2 � � � < tP , then priorities may be assignedaccordingly as tp � L(i; j) < tp+1.

5.4.4 Weighted iSLIP

Suppose bandwidth from i to j is to be shared according to the ratio f(i; j) = n(i; j)=d(i; j) subjecttoP

i f(i; j) < 1,P

j f(; j) < 1.

In iSLIP each counter is an ordered circular list S = f1; � � � ; Ng. Now expand the list at output j toS(j) = f1; � � � ;W (j)gwhereW (j) is the lcd of fd(i; j)g and input i appears W (j)�n(i; j)=d(i; j)times in the list.


stat

e of

input

queu

es (N

2 b

its)

Grant

arbiters

Accept

arbiters

Decision

register

1

2

N

1

2

N

Figure 5.4: Interconnection of the input and output arbiters to construct the iSLIP scheduler

5.4.5 Implementation

Figure 5.4 shows how the iSLIP scheduler for a N � N switch is constructed from the input andoutput arbiters.

� The state memory records whether an input queue is empty. From this memory, an N2-bit wide vector presents N bits to each of the N output grant arbiters representing Step 1(request).

� The grant arbiters select a single input among the contending requests to implement Step 2(grant).

� The grant decisions are presented to the N accept arbiters, each of which selects at most oneoutput on behalf of each input to implement Step 3 (accept).

� The final decision is stored in the decision registers and the value of the ai and gj pointersare updated. The decision register is used to notify each input which cell to transmit and toconfigure the crossbar switch.

Chapter 6

Network processors

Figure 6.1 is a logical diagram of how a network processor (NP) fits in a system design. TheNP is located between the physical layer (MAC or framer) and the switch fabric. In the figurethe Serializer/Deserializer (SERDES) is the interface between the NPU and switch fabric. Theframer or MAC presents a packet to the NPU which must examine it, parse it, do necessary editsand database lookups to enforce various policies at layers 3-7 (forwarding, queuing, labels), andexchange messages with switch controller. The NP is in the data path.

63

64 CHAPTER 6. NETWORK PROCESSORS

Figure 6.1: Location of NP in a logical diagram. Source [17].

6.0.6 NP operation

Figure 6.2 shows a generic block diagram. Data of multiple physical interfaces or the switch fabricare transferred to/from the NP. The bitstream processors receive the serial stream of packet data andextract the information needed to process the data, such as MAC or IP source/destination address,TOS bits, TCP port numbers, MPLS or VLAN tags. The packet is then written into the packetbuffer memory. This information is fed to the processor complex—the programmable unit of theNP. Under program control, the processor may extract additional information from the packet andsubmits relevant information to the search engine which looks up the MAC or IP address, classifiesthe packet, or does a VCI/VPI lookup using the routing/bridging tables. Upon packet transmissionthrough the bitstream processor, the necessary modifications to the packet header are performed.

65

packet

buffer

memory

routing

and

bridging

tables

buffer

manager/

scheduler

general

purpose

CPU

processor

complex

search engine HW assistsbitsteream

processors

To/from PHY/

switch fabric

Figure 6.2: Generic NP architecture. Source [15].

Figure 6.3: Time to process 40B packets at different line rates. Source [17]

6.0.7 Speed of operations

Table 6.3 shows the time available to process back to back 40B packets at different line speeds. At1 Gbps, the time to process one packet is 360 ns. Using 10-ns SRAM permits a maximum of 36memory accesses. Thus faster line rates can be accommodated only by processing several packetssimultaneously in a pipelined or parallel fashion.

66 CHAPTER 6. NETWORK PROCESSORS

6.0.8 Packet buffer memory

For the architecture of figure 6.2, each packet header byte may traverse the memory interface atleast four times:

� write inbound packet

� read header into processor complex

� write back to memory

� read for outbound transmission

So for 40 byte back-to-back packets the required memory interface capacity is 10-120 Gbps for linerates of 2.5-40 Gbps.

Chapter 7

Distributed Switch

The single switch fabric architectures cannot scale beyond 32 ports. Hence the need for distributedarchitectures. We’ ll study blocking and routing properties.

7.1 Blocking

A switch network is a graph of switches, each with a set of input and output ports as in Figure7.1.There is a set of N input nodes and a set of M output nodes. Each internal link has a capacity of 1.A configuration C is a set of input-output pairs

C = f(i1; j1; r1); � � � (ik; jk; rk)g

with distinct inputs and outputs and disjoint routes connecting i1 to j1, � � �, ik to jk.

A DS is strictly non-blocking if given a configuration C and a pair (i; j) not in C , there existsa disjoint route from i to j. It is rearrangeably non-blocking if given any partial permutation ofinput-output pairs, there is a configuration that includes those pairs.

We first study modular architectures.

67

68 CHAPTER 7. DISTRIBUTED SWITCH

1

N

1

M

cap = 1

Figure 7.1: A distributed switch is a network of switches with certain number of input and outputports, N input nodes and M output nodes

7.2 Clos network

This is a 3-stage network as illustrated in Figure 7.2. The Clos network is specified by 5 numbersIN; N1; N2; N3;OUT. There are N1 � N2 � N3 switches arranged in 3 stages. The number ofinput-output ports and connectivity of the switches are as shown.

Theorem A Clos network with RNB switch modules is RNB iff

N2 � maxfIN;OUTg:

A Clos network with SNB switch modules is SNB iff

N2 � IN + OUT � 1:

The total number of input lines is IN �N1. The total number of output lines is OUT �N3.

The Clos network in the figure is SNB. It has 9 input lines and 8 output lines.

7.3. RECURSIVE CONSTRUCTION 69

N2 = 5

N1 = 3 N3 = 4

B

OUT = 2

Clos (3, 3, 5, 4, 2)

IN = 3

A

Figure 7.2: A Clos network is fully specified by (IN; N1; N2; N3;OUT)

7.3 Recursive construction

We can recursively construct an N � N SNB with N = p � q input and output lines as in Figure7.3. The result is a (p; q; 2p� 1; q; p) switch. It is SNB if each module is SNB.


N = p x q

p x (2p -1)

q x q

(2p -1) x q

N = p x q

q planes

2p - 1 planes

q planes

1

q

p

1

2p-1

1

q

p

Figure 7.3: Recursive construction of a SNB Clos network


N = p x q

q x q

N = p x q

q planesq planes

p x p p x p

1

q

p

1

p

1

q

p

Figure 7.4: Recursive construction of a RNB CLos network

Figure 7.4 is a N �N RNB switch if each module is RNB.


N N

N/2 ✕ N/2

N/2 ✕ N/2

2 log2 N – 1 stages of N/2 2 ✕ 2 switches

2 ✕ 2

Figure 7.5: The Benes switch

Figure 7.5 is a N �N RNB switch made up of 2� 2 switch modules.


1

2

3

4

1

2

3

4

1-->1 as shown; 4-->4 as shown; cannot accommodate

2-->3

Figure 7.6: Benes swtich is not SNB

Figure 7.6 shows that a Benes switch is not SNB.


Figure 7.7 illustrates an algorithm to rearrange existing connections in order to accommodate a newconnection.

Question 1: Can you supply a proof?

Question 2: Is there an alogrithm to accommodate new connections in an arbitrary network of Figure7.1?


12

3 4

5

Figure 7.7: Algorithm to add a new connection for a RNB switch


In a Benes switch, feasible flows may require multiple paths. Figure7.8 and 7.9 show this. Note:26664

e 0 1� e 0

e 0 0 1� e

1� 2e 0 e e

0 1 0 0

37775 = (1� 2e)

26664

1

1

1

1

37775+ e

26664

1

1

1

1

37775

+ e

26664

1

1

1

1

37775


1

2

3

4

1 2 3 4

e

e

1-2e

1-e

1-e

e e

1

e

1 1-e

1-e

e

1 1-e

e1

1-2e

2e

1-e

e

4

1

4

1

3

e

1-e

1 1-e

e

e

1

1-2e

2e

1-e

1-e

e

1-2e

e2

Figure 7.8: Split flow 1


1

1

e

e

1-e1

1-2e

1-e

e

e

1-e

2e

1

1

e

e

1-e

1-2e

1-e

4

21

2

3

4

1 2 3 4

e

e

1-2e

1-e

1-e

e e

1

3

1

1

e

e

1-e e

1-2e1-e

1

1-2e

1-e

e

e

1-ee

1-e

2e

1

Figure 7.9: Split flow 2


1

1

1

1 1

2

23

3

Figure 7.10: Max flow for single commodity is 3 and flows are integers; in multi-commodity case,max flows are 0.5 and non-integer

In a Clos switch, permutations can be achieved without splittling flows. In a general multi-commoditycase this is not so. Figure 7.10 shows that if this is a single commodity problem, the maximum flowis 3 and all flows are 1 (integer).

However, if the flows are 1!; 2! 2; 3! 3, the max flows are 0.5 each, and not integer.


1

1

1

1 1

2

23

3

1

1

1

1 1

2

23

30.5

0.5

Figure 7.11: Two copies of figure 7.10 are connected in parallel. Achieving flows of 1,1,1 requiressplittling

Figure 7.11 shows that a feasible permutation may require splittling flows. The green and cyanflows must be connected in parallel similarly to the red flow.

Bibliography

[1] J. Walrand and P. Varaiya. Chapter 12, Switching. High performance communication networks2nd edition, 2000.

[2] M.J. Karol, M. Hluchyj and S. Morgan. Input vs output queueing on a space-division packetswtich. IEEE Trans Comm, COM-35(12): 1347-56, Dec. 1987.

[3] T.E. Anderson, S. Owicki, J. Saxe and C.P. Thacker. High-speed scheduling for local areanetworks. ACM Trans Computer Systems, 11(4):319-52, Nov. 1993.

[4] N. McKeown. iSLIP: a scheduling algorithm for input-queued switches. IEEE Trans Network-ing, 7(2), April 1999.

[5] N. McKeown, V. Anatharam and J. Walrand. Achieving 100% througput in an input-queuedswitch. Proc. Infocom ’96, vol 1: 296-302.

[6] B. Prabhakar and N. McKeown. On the speedup required for combined input and outputqueued switching. Automatica, 35(12), Dec. 1999

[7] J.F. Hayes, R. Breault and M.K. Mehmet-Ali. Performance analysis of a multicast switch.IEEE Trans Comm, COM-39(4): 581-87, April. 1991.

[8] B. Prabhakar, N. McKeown and R. Ahuja. Multicast scheduling for input-queued switches. J.Selected Areas in Comm 15(5):855-66, June 1997.

[9] M. Waldvogel, G. Varghese, J. Turner and B. Plattner. Scalable high speed IP routing lookups.ACM Sigcomm ’97 September 1997.

[10] A. Demers, S. Keshav and S. Shenker. Analysis and simulation of a fair queueing algorithm.ACM Sigcomm ’89 Computer Communication Review, 19(4): 1-12, 1989.

[11] A. Parekh and R. Gallager. A generalized processor sharing approach to flow control of inte-grated services networks: the single node case. IEEE Trans Networking, 1(3): 344-57, June1993.

[12] A. Parekh and R. Gallager. A generalized processor sharing approach to flow control of inte-grated services networks: the multiple node case. IEEE Trans Networking, 2(2): 137-50, April1994.

81

82 BIBLIOGRAPHY

[13] S. Floyd and V. Jacobsen. Random early detection. IEEE Trans Networking, 1(4): 397-413,August 1993.

[14] I. Stoica, S. Shenker and H. Zhang. Core-stateless fair queuing: achieving approximately fairbandwidth allocations in high speed networks. ACM Sigcomm ’98, 1998.

[15] W. Bux, W.E. Denzel, T. Engbersen, et al. Technologies and building blocks for fast packetforwarding. IEEE Communications Magazine, 39(1): 70-77, January 2001.

[16] P.R. Kumar and S. Meyn. Stability of queuing networks and scheduling policies. IEEE Trans.Automatic Control, 40(2), February 1995.

[17] A. Deb. Building a network-processor based system. Integrated Communications Design,December 2000. Available at www.icdmag.com.

Bridges, Switches, Routers - University of California, Berkeley

Documents

Transcript of Bridges, Switches, Routers - University of California, Berkeley