./tungfai/Documents/BroadbandNetwork/Switch…  · Web viewProject Report. Tungfai Chan, Jiannmin...

23
Design and Analysis of Terabit IP Switch Project Report Tungfai Chan, Jiannmin Ho, Jiacheng Hu, Heungwing Law Information Networking Institute Carnegie Mellon University Pittsburgh, PA 15213 {tungfai, jiannmin, jiacheng, hlaw}@andrew.cmu.edu 1

Transcript of ./tungfai/Documents/BroadbandNetwork/Switch…  · Web viewProject Report. Tungfai Chan, Jiannmin...

Design and Analysis of Terabit IP SwitchProject Report

Tungfai Chan, Jiannmin Ho, Jiacheng Hu, Heungwing LawInformation Networking Institute

Carnegie Mellon UniversityPittsburgh, PA 15213

{tungfai, jiannmin, jiacheng, hlaw}@andrew.cmu.edu

1

AbstractThis report1 describes the design and analysis of a terabit IP switching system supporting 512 OC-48 (2.4Gb/s) links. The architecture provides scalability with essentially same cost function for configuration up to 4096 ports [1]. The system design also supports one-to-many multicast and quality of service (QoS) with optimal throughput and utilization without large buffers and complex design.

1. Introduction

For the last three decades, data communication went from 56kbps (APPANET) to nowadays 1Gb/s, a gain of more than a factor of 100 per decade. With wide deployment of fiber optics, a speed of 1 terabit/sec is only short period down the road. As the bandwidth of link improving, the bottleneck is falling into switching system. On the other hand, a variety of application including classical data applications, image-retrieval, and real-time audio and video, emerges in recent years, which demands different latency and bandwidth. To facilitate the integration of these services, a highly efficient switching systems that is capable of supporting large networks with a wide range of requirements in a cost-effective way is highly demanded.

These technology “push” and application “pull” effects motivate our design. The designed system has the following distinguishing characteristics.

Scaleable design: the design can scale up to 4096 ports with essentially the same cost and complexity as O(N log N) [7].

Multicast switching: the system is capable to support cost-effective one-to-many multicast without complicated the internal design.

Early Packet Discard: the system employs an efficient packet discarding technique. The algorithm allows our switch to achieve ideal throughput with small buffer requirement.

As mentioned above, one of the distinguishing features of our design is its scalability and cost effectiveness; but why are large switching systems of value in the first place? Isn’t any arbitrary size of network can be built using switches with modest numbers of ports? The answer is that larger switches have an inherent cost and performance advantage comparatively when they are used to build large-scale network [1]. It is because if we limit to smaller

1 This work was a course project of 18-757 Broadband Network taught by Professor Hyong Kim at the Carnegie Mellon University, PA

switches, the number of ports will increase dramatically as the size of network increase. For efficient switch architectures, the total system cost is dominated by the per port cost. Moreover, as the size of network increases, the topology such as hierarchical tree structure will be more complex.

This report is organized in the following sections. The conceptual design is discussed in Section 2. The simulator implemented for performance analysis of the design is discussed in Section 3. Simulation results and analysis of our design is presented in Section 4. In Section 5 are our concluded optimal parameters for the design and analysis, followed by future work and conclusion in Section 6 and 7 respectively.

2. Structures and Design

Figure 1 shows the overall organization of the design. It basically consists of three main components. The Input Ports at left receive packets from the incoming links and perform packet fragmentation. It then buffers the cells, which are waiting to get into switching fabric. Multicast functionalities are also taking place at this component. The Output Ports are responsibility for resequence cells received from the switching network and perform packet reassembly. Buffers are used to queue up fragmented packet as well as resembled packet that are waiting to be transmitted to outgoing links. The Switch Fabric is made up of switching element with eight inputs and outputs and a common buffer to resolve local contention. The element switch cells to the proper output using information contained in the cell header or through distribution for load balancing. The load balancing is performed in first k – 1 stage of a 2k – 1 stage (where k = log8 number of ports). Adjacent switch elements employ flow control mechanism to regulate the flow of cells between successive stages, eliminating the possibility of cell loss inside the Switch Fabric. This approach can reduce the amount of memory in the fabric while larger buffers are required at the Output Ports.

2

Figure 1: Switch Architecture

2.1 Switching Fabric

Our current simulated design has 512 ports with each port running at 2.4Gb/s (OC-48). In this sub-section, we will primarily focus on the topology and internal switching element being employed.

2.1.1 Switching Network

The switching network uses a Beneš network topology. We construct a 512-port network by taking eight copies of the 64 port network and adding a first and fifth stage on either side, with 64 elements apiece. Output j of the i-th switch element in the first stage is then connected to input i of the j-th 64 port network. Similarly, output j of the i-th 64 port network is connected to input i of the j-th switch element in the fifth stage. For n = 8k, the Beneš network constructed using eight port switch elements has 2k – 1 stages. Since k = log8 512 = 3, which is the best possible scaling characteristic. In [8], it is shown that when dynamic load distribution is performed in the first k – 1 = 2 stages of a Beneš network, that the load on the internal data paths of the switching network cannot exceed the load on the external ports; that is the network achieves ideal load balancing [1].

The Beneš topology allows for a simple routing algorithm in the last k = 3 stages. We have employed the self-routing used in Banyan Network as the routing algorithm and no central control is required to route a cell through the switching network. Each of the last 3 stages will look at 3 corresponding bits of the destination port number to determine the local output port. One drawback of the Beneš topology based on 8 port switch elements is that it grows in

factors of 8. In another word, the next larger switch will have 4096 ports.

Several planes can be run in parallel to archive desired fabric bandwidth. Signal lines, which are used for cell header, congestion signal, and more-to-send signal, are shared among across the planes and each cell is divided to send evenly among the planes. In another word, every plane will perform the exact operation. For example, if cell size is 64 and there are 4 planes, then each plane will be responsible for 16 bytes. This has an advantage that increasing number of planes will not complicate the design. Our current design is running at speedup of 2 with 2 planes running in parallel. That means each plane is running at 2.4Gb/s. In order to support such bandwidth, 8 parallel chip-to-chip links with each one is around 300Mb/s are employed for internal transmission. Below are some facts of the current design.

Stages 5Element at each stage 64Number of plane 2Total number of elements 640

2.1.2 Switching Element Design

As mentioned above, each switching element has 8 inputs and outputs ports running at 2.4Gb/s. To reduce the chance of head-of-line (HOL) blocking, shared buffers are used inside each element. Essentially, there are 2 types of elements being used in the switching system, Distributor, and Routing Element.

2.1.2.1 Distributor

This type of element is used in the first 2 stages. Distributor basically takes the packets at its input and distributes them evenly among its outputs. In this fashion it provides load balancing to the input ports of the switch in the next stage. In this way, the buffers can be used efficiently and prevents from build-up in a particular element while others are empty. Most importantly, its introduction ensures that the cell can be delivered through different routes, even though it has banyan network pattern at the last 3 stages [3] [4]. The algorithm used to archive distribution is Round Robin. When the distributor receives cells from input ports, it will forward each to the next higher available output port (congestion signal free ports) in increasing monotonically fashion. As Round Robin itself is deterministic and

3

so each element uses (element position module 8) as the starting position. It is because it prevents the situation that when all elements choose 0 as starting index, all cells will be forwarded into contiguous elements in the next stage. This clustering of cells will induce fast congestion and increase the latency. Through the simulation, it is proved that distributor is particularly useful to handle bursty traffic.

2.1.2.2 Routing Element

This type of element performs self-routing functionality to the system and is being used in last 3 stages. It looks up the corresponding 3 corresponding bit in the cell header and determines the local output port. For example, if the destination port is 30910 (1001101012), the third stage will first look at the first 3 bits (1002) and forward the cell to port 4; the forth stage will then look at middle 3 bits (1102) and forward the cell to port 6; the last stage will use the last 3 bits (1012) to decide the cell going to port 5.

Congestion signal is used to notify the previous stage not to send any cell in the next cycle. This prevents the buffer overflow of current element and eliminates the possibility of cell loss. We introduce the following algorithm to assign the signal.

in_count[x]: keep track of number of cell from port xAlgorithm:

After receiving all incoming cells of current cycle and forward all possible cellsCount number of free slot leftIf free slot > switch sizeFor all input ports

congestion[x] = false;Else

For number of free spaceSet congestion[x] = false where moretosend[x] == true and in_count[x] = 0

Assign the rest that congestion[x] = false where moretosend[x] == true

Basically, this algorithm gives a higher priority to those ports which has cells inside the buffer at that instance. This intends to bring a more diversity of cell into the buffer and avoiding head-of-line blocking. The argument behind is that when a switching element has less than 8 free buffers, it is going into congestion stage. That implies there are head-of-line blockings within the buffer. Since shared buffer is used and so the only situation that HOL blocking occur are cells going to same local output port. By bringing cells from different input ports can help to bring the balance of cells and thus it

is possible to reduce the HOL blocking. The worst case of this algorithm can still give out the same performance as without it.

2.1.3 Flow Control Mechanism

The flow control mechanism we employed in the design greatly eliminates the possibility of cell loss and greatly improves the throughput. We simply adopted the Binary Feedback together with Early Packet Discard [1] as the primary flow control model. As mentioned that congestion signal is used to notify the previous stage element does not send in the next clock cycle. This signal in fact provides feedback to previous stage regarding the current status. After previous stage is congested, it will follow the same model. In this fashion, the signal gets propagated backward and eventually it reaches the input queue. As input queue gets congested, dropping will take place within the port but not in fabric. This mechanism brings two advantages to the switching system. In the first place, this backward pressure ensures that no cell loss happens within the switching fabric and pushes the dropping responsibility to the input port to maintain the good-put. When a cell is dropped, the whole packet is useless. If a cell is dropped in the fabric, this is impossible to notify other elements to discard the cells from the packets because there is no central controller. There are 2 downsides as a result; it wastes the valuable bandwidth, which can be used for other cells and consumes buffer spaces in the output buffer where the cells are waiting to be reassembly. Therefore, it is preferable to drop those packets as a whole at the input where packets have not consumed any resources yet. Secondly, it centralizes the dropping algorithm in the input port, which makes the switching system more configurable and reduces the complexity of internal switching element if sophisticated methods are used.

2.2 Input Port

The design of input port is a little bit more complicated than initially expected due to the fact that the system is input queue oriented. In this section, besides presenting final design, we think it will be beneficial to discuss the evolution of the design and the rationales behind the model.

2.2.1 Initial Design

At the beginning, the function of our input queue is very simple and straightforward. It receives the incoming packet, segments it into fix-sized cells (Section 2.2.5), adds header in the cells and stores the

4

cells in the buffer. It is a ‘First In, First Out’ queue. When there is no room to accommodate the incoming packet, the incoming packet is dropped.

2.2.2 Dropping algorithm

From our first round simulation result, we found out that in many cases, the latency in input queue dominates the whole latency, so we try to find some methods to reduce the total latency. Basically, the most straightforward way to achieve the goal is to reduce the buffer size, but this will increase the drop rate. However, increase in drop rate is not the effect we would like to see in our switch. Therefore, we choose to implement the ‘Drop-head’ algorithm in our input queue to replace our original ‘Drop-tail’ algorithm. The main principle of the algorithm is to drop the ‘oldest un-sent packet’. The reason we adopt this algorithm is first, we assume that the majority of traffic are IP packets and those packets will be re-transmitted after timeout, therefore it is realistic to drop the oldest packet. Second, if the oldest packet is under-transmission, it is meaningless to drop those un-sent cells, because sent cells cannot be reassembled as a whole packet. In this case, we drop the whole second old packet. From the simulation result, we observed that the average latency of the ‘dropped packet’ is almost twice as the average latency and the total latency is also decreased to different extent. From those simulation results, we thought that this implementation did achieve our expected goal. (One more slightly surprise comes from the little improvement of drop rate. Our explanation is when a big packet arrives the input queue, the ‘drop head’ has a better chance to drop a small packet and accommodate the big packet. On the other hand, the ‘drop tail’ algorithm will simply drop the incoming big packet.)

2.2.3 Variation

As we thought about the fairness of our dropping policy in input queue, we came up with another variation: ‘drop head with priority’. The main difference from pure ‘drop head’ implementation is that if the incoming packet can’t find a packet with lower or equal priority to drop, the incoming packet will get dropped. The result we want to achieve in this algorithm is to ensure lower drop rate for the higher priority packets.

2.2.4 Multicast

We did spend much time surveying different methods to implement multicasting and it is one of our major goals we want to achieve in this project. In our

original design, the output port will have a recycling path to transmit the packet back to the input queue and based on this mechanism we could do multicast by recycling the packet [8]. According to different multicast ID the internal switch will duplicate the packet for different destinations. This idea was pretty neat until later on we discussed about the routing table (the size is pretty large and expensive to put them in every internal switch, and how to keep the table consistent is also a big concern), the redundant traffic to make duplicate copy and the fairness. What if the recycling packet finds out that the input queue is full, can we send only ‘part’ of the multicast packet? We came up with many variations of our original design, but still, they couldn’t satisfy us. Finally, we came up with a completely new design.

Our implementation is to put the multicast table in input queue. The multicast table has the following format.

m-cast ID dest-1 dest-2 ... dest-n

Based on the m-cast ID, the input queue will send the first cell of the packet to the dest-1, dest-2...dest-n sequentially and then the second cell, etc. Since multicast packets and unicast packets share the same input queue, we also do not want unicast packet waiting in the queue too long just because there is a multicast packet in front of it. Therefore, in our design, we would like to separate these two different traffic.

Every multicast packet occupies the input queue much longer than a unicast packet does, so we define a uni_multi_ratio to prevent the multicast packet dominating the whole switch. The basic scheduling algorithm is to use a counter to maintain a fixed ratio between multicast and unicast traffic, if there is no such kind of packet waiting to be sent, the other kind of packet can be sent. The algorithm for unicast is as the following.

if(unicast’s turn) {if(unicast packet not empty) send unicast packet

else send multicast packet}

Multicast packets follow the same algorithm.

This algorithm has the following advantages: 1) Easy to implement. All we have to do is to

keep a link to unicast queue and multicast

5

queue, and according to the counter, send packet from the corresponding queue.

2) Greatly simplify the design of internal switch and output port. The internal switch does not have to take care of the difference of multicast and unicast packets and they are not required to have copy function and routing table. We don’t need recycling path from output queue to input queue either.

3) Guaranteed bandwidth for unicast and multicast. Multicast packets will not overwhelm the switch, and starvation won’t occur in this method.

4) Limited overhead. In our design, we didn’t incur any extra traffic to transmit the multicast packet and the latency is also within reasonable range.

5) No wasted cycle. When there are packets waiting in queue, at least one of them can send. This scheduling algorithm won’t waste any chance to send.

2.2.5 Fixed Size Cell

Every packet coming into the input port will be fragmented into number of fixed size cell. Each cell has 64 bytes payload. The header contains information of input port, packet identifier, sequence number, and destination port. Here is the header format:

Input Port9 bits

Packet ID9 bits

Seq. Num5 bits

Dest. Port9 bits

The input port and destination port have contributed 9 bits each because there are 512 input ports for the switch system. There are 5 bits for sequence number because the larger packet size coming into the switch is 1536 bytes, which can be fragmented into 24 cells. In fact, the number of bit for packet ID is flexible because it probably does not need 512 identifiers and the extra bit can be used for other control signals. By and large, the header contributes 4 bytes of overhead, which is equal to 5.88%.

2.2.6 Further improvement

As hard as we try to reduce the long latency induced from the long waiting time in the input queue, the nature of multicast packet (every cell will stay in the queue until the last destination cell is sent) will still cause much longer latency than unicast packet. Here are two improvements we tried in our simulation to reduce the average latency.

2.2.6.1 Increase Traffic Ratio

Increase the traffic ratio for multicast packets. Since the latency of multicast is much longer the that of unicast, if the multicast packets have more chance to send, their average latency will decrease. The price for this improvement is the increasing average latency of unicast. Therefore, this method is very useful for light multicast traffic, since it won’t undermine the unicast’s latency very much and still improve the total average latency significantly.

2.2.6.2 Send Whole Packet

Send the whole packet to one destination one by one. As shown on Figure 2, the first destination will get whole cells and can be send out immediately, but the whole packet has to occupy the buffer much longer until the last destination cell has been sent, which means slightly higher drop rate. However, if every multicast packet has only few destinations to send, then this method is a good deal.

Figure 2: Send Whole packet

2.3 Output Port

In our input-queueing switch design, output port design is merely straightforward and simple. The basic tasks accomplish in output port are cells reassembly and quality-of-service (QoS) scheduling support. The former is trivial and we are more focus on the scheduling algorithm here.

2.3.1 Scheduling Algorithm

In order to accomplish cost-effectiveness and QoS supports, we adopt the Binary Scheduling Wheels Algorithm with Fast Forward (BSW) from [2]. Since IP datagram is connectionless protocol in which there is no flow defined for traffic, the output scheduler

6

uses packet type (ToS field in IP header) to differential the traffic type. Thus, the output port aggregates the traffic class to perform scheduling on the per-class basis (DiffServ) rather than per-flow (IntServ).

Figure 3: Binary Scheduling Wheels(adopted from [2])

Each traffic class is assigned power of 2 weights and during overload periods, shares the link bandwidth in proportion to their weights. Instead of forwarding as many cells as specified by the weight once a queue is selected, the BSW algorithm places queues on scheduling wheels with different weights and alternates among wheels. Figure 3 shows an example with W binary scheduling wheels. Each little box in the figure represents a list node containing a queue identifier that identifies a non-empty per traffic class queue. Once a scheduling wheel is selected, all queues on that scheduling wheels can forward on packet to the output [2]. The algorithm is shown below.

Initially:PreviousCounter = 0;CurrentCounter = 0;Mask: Bit i = 1 if scheduling wheel i is non-empty; set to 0 otherwise;

Loop:CarryIn = Position of the least significant 1 bit of Mask;Previous Counter = CurrentCounter;CurrentCounter = CurrentCounter + CarryIn;ChangingBits = PreviousCounter XOR CurrentCounter;CurrentMask = Mask;While ((CurrentMask AND ChangingBits) != 0)CurrentWheel = Position of the least significant 1 bit of (CurrentMask AND ChangingBits)Serve all queues in the scheduling wheel CurrentWheel;CurrentMask[CurrentWheel] = 0;If (scheduling wheel CurrentWheel becomes empty)

Mask[CurrentWheel] = 0;If (New queue is added to an empty scheduling wheel j)

Mask[j] = 1;(adopted from [2])

2.4 Physical Dimension

In layout of our switching fabric, four types of module components are used to assemble it. They are 1) functional module boards with 8 switching element and other control or buffer chips; 2) stage-to-stage interface cards; 3) interface cards for port cards; 4) coolant fans. The estimated total dimension of the switching fabric is 480mm x 600mm x 760mm.

Adopted Port Card:Lucent WaveStarTM cross-connect system

Adopted Routing Switching Element:IBM 3209K4080

3. Simulator

In order to facilitate different configurations of simulation, all the parameters are defined as execution arguments. As a result, we don’t have to re-compile our simulation code every time we try different configurations.

Besides, we define ‘statistic cycle’ in our simulation, so we could sequentially collect several sets of statistic data in one round of simulation.

3.1 Traffic Module and Input Port

Every input queue is associated with a single traffic module, so they are discussed together in this session. At the beginning of every 1000 cycles (1000x100 ns/cycle), we fetch all the events occurred during that period of time from the traffic module and put them into a traffic queue associated with each individual input port. Then at every cycle, the input will check if there is any event happening in the traffic queue, and if so, they will segments the packet into cells and store in the buffer. After that, the input queue will put a cell at the entry of switch, if there is any cell waiting to be sent.

3.2 Switching Fabric

The switching fabric is designed in modular architecture. We use has-a relationship in order to adapt future changes or improvements easier. The objects are: SwitchFabric has-a SwitchStage has-a SwitchElement. Each switching element has the knowledge of its position in terms of stage number and position within the stage. This variables are used to calculate the outgoing links during cells

7

forwarding. Most functionalities are pushed towards the element level such that the future changes can be applied easily and the modifications are transparent to upper level. Most importantly, it keeps the debugging work simpler. The main function routine which drives the whole switching system simulation is SwitchElement.processPackets(). This function is invoked by upper levels (stage, and fabric). This basically causes the switching element to retrieve the headers from the incoming links, process it according to our defined algorithm, then put them in appropriate outgoing links.

The inter-element communication links are done by using array of link[] object. Each link object is responsible for headers transmission between stages, from input ports, and to output ports. Since our design has 512 input and output ports, there is a static array with 512 header slots within each link object. Each slot simulates the actual chip-to-chip link. For each element, it can simply retrieve the cell headers from the corresponding links at that stage. For example, a switching element is at stage 2 with position 5. Its corresponding 8 input links are link[0].slot[40-47]. A simplified example by using 4x4 switch is shown in Figure 4. However, finding the outgoing link indexes are not as trivial as incoming link as it is totally dependent on the topology. We have implemented a function routine outLinkIndex() to compute the indexes.

int SwitchElement::outLinkIndex(int port) { int group_id = pos / size; int group_output = (pos % size) * size + port; switch (stage) { case 0: case 3: return group_output * size + group_id; case 1: case 2: return (port * size) + (pos % size) + (pos / size) * size * size; case 4: return (pos * size + port); }}

Figure 4: Simplified Example

3.3 Output Port

As mentioned in Section 2, the output port is responsible to store fragmented cells and packets reassembly. Due to the fact that cells will come in different order, resequencing cells needs to search throughout the queue on every time when a cell comes out from the switching fabric. Searching is a time consuming task and it will be the bottleneck of the simulation. We adopted the hash table concept into the simulation design.

Each packet is unique by its input port, and packet identifier. A simple hash function can be used and provides an index to the hash table. The maximum number of possibility of cells are packet identifier x input port. Though this contributes the significant part of memory usage, it provides a fix usage regardless of output queue size. Most importantly, the computation cost on each cell coming in is O(1) because the only thing required is to compute the hash index and goes to specific bucket. If a complete packet can be reassembled, that bucket will be cleared and inserted into output even queue.

Output event queue is used to scheduler the out-going packets. This is also where the weighted fair queueing takes place. To ensure the accuracy, we used silent_count to keep track the number of cycles the packet needs to put onto the outgoing link. This count will decrement for each cycle. During this period, all packets must wait until the count is zero again. Though in reality output port is not synchronized with fabric, using cycle approximation can greatly simplify the simulation and reduces the computation cost.

3.4 Statistic Data collection

8

Every cycle, we use integer type variables as ‘operation variables’ to collect all the statistical data, like latency, queue size, throughput and drop counts. Every 5000 cycle, we will store these data into double type variables, called ‘store variable’ and reset those ‘operation variables’. The reason we have two sets of similar variables are:

1) The integer operation is much faster than float operation, so we use integer type variable to do the frequent accounting operation.

2) The range of integer type variable is limited, so we have to store those data in the double type variables before those ‘operation variables’ overflow.

To get more stable and accurate information, we also define ‘statistic cycle’ to collect data every ‘statistic cycle’, to ensure the data is stable and accurate, we could skip the first set of data. Probably because our simulates long enough (at least 1000000 cycles per simulation), the difference between the data from the first ‘statistic cycle’ and the rest are not very significant.

3.5 Multicast Simulation

Since the traffic module didn’t provide multicast information, we simulate the multicast according to the following procedure. First of all, we re-define the max destination port address to be higher than the total port number. All the packets with destination port address which is greater the real port number is simulated as a multicast packet. Take our switch as an example, our port is numbered from 0 – 511 and we could define the max port number is 519. In this case, under uniform traffic, 8/520 of the incoming packet will be multicast packets and if the destination port address is greater than 511, we take it as a multicast packet. Next, after we got a multicast packet, we randomly generate how many destinations this multicast packet will be sent to and we also randomly generate different destination port address for the packets. Finally, we keep a counter to decide which kind of packet can be sent in current cycle.

4. Simulation Result and Analysis

Due to complexity of analytical models for our switch, we use the computer simulator to evaluate the performance of the switch under six types of scenarios by tweaking different parameters.

4.1 Internal Buffer Size in Each Switching Element verses Three Traffic Loads

In this scenario, we use uniform output port distribution traffic to evaluate our switch. The simulation conditions are as follows.

Speedup 2 Input Queue 512 slots Output Queue 2048 slots (1 slot = 64 bytes) Hot Spot 0.001953 (uniform probability) Burst Length 50000 Load 0.5/0.7/0.9 Buffer (8, 16, 32, 64, 128, 256) slots

Simulation time 100ms.

The performance of the switch is evaluated by its throughput, average latency and average drop rate. The following diagrams show these three results of performance. As shown in the throughput diagram Figure 5, the throughputs of different traffic are almost the same as the loads after the internal buffer size is increased to and over 32 cells. It is also consistent with average drop rate diagram Figure 6 showing that each average drop rate approaches zero while the buffer size is over 32 cells. The larger buffer size is deployed inside the element, the more cells can be accommodated in the switching fabric. Therefore, the drop rate will be reduced and the throughput would approach 100%. The reason of average latency approaching zero while the buffer size is over 32 cells can be explained by two contributions from average input queue latency and average switching fabric latency as shown in Figure 8 and 9. The dominant contribution of latency is from input

queuing latency. The more buffer inside the switching fabric can prevent congestion propagating

Figure 5. Throughput versus buffer size

9

back to the input queue and deferring downstream

cells to get into the switch fabric.

However, the phenomenon of the variation of switching fabric latency upon buffer size is out of expectation. While the size of buffer increases from 8 to 16 or 32 cells, the latency increases because more cells can wait for being transmitted inside the buffer for a longer time. The decreased latency after 16 or 32 cells size of buffer isn’t expected as same as the prediction of queuing models. However, it shows one feature of our switch design – backward congestion signal.

Figure 10 and 11 show the histograms of queuing

size distribution of five stages in the switching fabric by using 32 cells and 64 cells buffer size under the load 0.9. From the diagram of 64 cells buffer size, we can see that the buffer is seldom used up to 64 cells storage. But after decreasing the buffer size to 32 cells, congestion situation apparently shows up at the 5th stage and becomes worse and worse back to the 3rd

stage as shown in the diagram. This is resulted from congestion control signal prohibiting the cells from the previous stage to get into the new congested stage. The prohibited cells will accumulated at previous stage elements and make more cells being prohibited from the previous and previous stage (that’s why most cells accumulate in the 3rd stage). This effect is similar to reduced service rate at each output port of each switching element at routing stages. It also indicates that congestion in routing stages takes a long time to recover to previous un-congested state since not only internal blocking problems but also lower service rates are introduced in self-routing elements during congestion. Thus, in the case of 32-cell buffer size, the routing elements in

Figure 6. Average drop rate versus buffer size Figure 7. Total average latency versus buffer size

Figure 8. Switching Fabric Latency versus buffer size Figure 9. Average input queue latency versus buffer size

10

the 3rd stage are in the state of congestion and it will also propagate back to the 2nd stage. Each cell will take longer time to get through the switching fabric. However, in the case of 64-cell buffer size, even there is still some chance to let the buffer size over 32 or more, the transient time should be smaller than that in 32-cell case because the buffer size is larger enough to accommodate and congest the transient burst traffic. Therefore, the congestion will infrequently propagate back to the previous stage.

Actually, this situation could be analyzed by using concatenated input queuing switching model approximately. However, we skip it for the reason of its difficulty and short of time. We left it as a future work. In the simulation condition, the 64-cell buffer size is shown to be the best choice but just in this specific condition.

4.2 Hot Spot

In this scenario, we use non-uniform output port distribution traffic to evaluate our switch. The traffic to one specific output port is five times higher that that to the other output ports. The simulation conditions are as follows.

Speedup 2 Input Queue 512 slots Output Queue 2048 slot (1 slot = 64 bytes) Hot Spot 0.01 (higher probability to port 0) Burst 50000 Load 0.5/0.7/0.9 Buffer (8, 15, 32, 64, 128, 256) slots

Simulation time 100ms

The throughputs shown in Figure 12 are almost the same as only 0.4 for three loads. The higher traffic load input ports provide, the more drop rate it will result in. From Figure 13, we can find that the overall latency is decomposed into two contributions, input queuing latency and switching fabric latency. From 8-cell to 32-cell buffer size, the first part of total

Figure 10. Queue size histogram of 32-cell buffer sizeFigure 11. Queue size histogram of 64-cell buffer size

Figure 12. Throughput versus buffer size with hot spotFigure 13. Average drop rate versus buffer size with hot spot

11

average latency decreases because of the former contribution; over 32-cell buffer size, the second part of total average latency increases because of the latter contribution.

For the contribution of input queuing latency, while the buffer size increases, the switching fabric has more space to accommodate burst up traffic and internal blocking traffic resulted from hot spot. Incoming cells from input port could be transmitted into switching fabric with smaller waiting time in the input queues. Thus, the latency is decreasing as buffer size is increasing.

From Figure 14 in the switching fabric, we can observe that the larger the buffer size is the more latency it has. It indicates again that the routing elements are weak to deal with congestion situation (due to internal block in this case) and more buffer size will only increases queuing size in each element during longer congestion time. We also can expect a larger threshold (>256 max. in our simulation) of the buffer size to overcome the hotspot problem. However, we skipped it because of time and memory constraint.

4.3 Burst Length

In this scenario, we use six values of burst length in uniform and non-uniform destination traffic to evaluate throughput and average latency of our switch. We choose 64-cell buffer size through all conditions since it resulted in full throughput and almost zero latency. The simulation conditions are as follows.

Speedup 2 Input Queue 512 slots Output Queue 2048 slots (1 slot = 64 bytes) Hot Spot 0.01 (higher probability to port 0) Burst

20000/50000/100000/1500000/200000/ 500000

Load 0.9 Buffer 64 slots Simulation time 100ms

From Figure 16, there is no difference with previous result in uniform and non-uniform traffic conditions. As to latency diagram, the latency without hot spot condition is almost zero as same as previous result. The latency with hot spot is also almost the same as previous result. We find that there is no negative performance effect to our switch in hot spot scenario under this simulation condition. Since the burst traffic would be distributed in first two distribution stages, the incoming traffic will not concentrate in some specific switching elements after the burst up cells enter the switching fabric.

4.4 Traffic Load

In this scenario, five traffic loads with uniform destination distribution traffic and 64-cell buffer size are used to evaluate the switch. The simulation conditions are as follows.

Speedup 2 Input Queue 512 slots

Figure 14. Average latency versus buffer size with hot spot

Figure 15. Throughput versus burst length

Figure 16. Throughput versus load

12

Figure 18: Scheduling vs Latency

Output Queue 2048 slots (1 slot = 64 bytes) Hot Spot 0.001953 (uniform probability) Burst 50000 Buffer 64 slots Load 0.1/0.3/0.5/0.7/0.9 Simulation time 100msec

The throughputs are almost the same as the traffic loads. We find that 64-cell buffer size with other simulation conditions is quite suitable for producing 100% throughput in the switch. Please refer to Figure 16.

4.5 Average latency of packets under uniform traffic with multicast support

In this scenario, we look at average packet latency under uniform traffic when multicast function is turned on. The parameters are as follow:

Speedup 2 Input Queue 512 slots Output Queue 2048 slots (1 slot = 64 bytes) HotSpot 0.001953 (uniform probability) Burst 50000 Load 0.9 Buffer 64 slots Unicast:Multicast Scheduling 3:1 Unicast:Multicast 512:8/16/32/64/128 Simulation time 100ms

In Figure 17, the x-axis shows the ratio of multicast packet ports to unicast packet ports. The y-axis is the average latency in number of clock cycles. We can see that when more multicast packets come in, average latency increases, and the major increase in latency is due to the increasing incoming multicast packets.

As described before, a multicast packet remains in the input queue until the last destination packet in its multicast list is switched out. The significant

increase in latency of multicast packets is expected because the scheduling ratio has set a bound for multicast. The increase in latency of unicast packets is because as multicast packets increase, they use up their guaranteed bandwidth which is specified by the scheduling ratio. Before, during low mutlicast traffic load, unicast packet, apart from using its allocated bandwidth, it can use the available bandwidth unused by multicast for transmission while as multicast traffic increases, it re-claimed back its allocated bandwidth and so the latency for unicast has increased slightly. In the simulation, the ratio is 3:1 which implies the guaranteed bandwidth for unicast and multicast are 75% and 25% respectively. for sending unicast packets. However, the increase in latency of unicast traffic is bounded due to the guaranteed. 4.6 The effect of scheduling between multicast and unicast packets on average latency

Speedup 2 Input Queue 512 slots Output Queue 2048 slots (1 slot = 64 bytes) HotSpot 0.001953 (uniform probability) Burst 50000 Load 0.9 Buffer 64 slots Unicast:Multicast Scheduling 1/3/5/7/9:1 Unicast:Multicast 512:64 Simulation time 100ms

In this scenario, we look at how the scheduling between multicast and unicast packets affects average latency. The scheduling ratio defines a bound to allocate bandwidth for unicast and multicast. That is not a strict bound, instead, it specifies a guaranteed bandwidth for the traffic. That is the minimum bandwidth reserved for that traffic type. In Figure 18, the x-axis is the uni_multi_ratio, which indicates how often a multicast packet is served. As the ratio

Figure 17: Multipost Port vs Latency

13

increases, the scheduler gives more favor towards unicast traffic such that the latency of multicast increases. On the other hand, unicast traffic has lower latency as they have higher chances to get into the switching fabric. The uni_multi_ratio of 2 (1:1 ratio, multicast and unicast have equal chance to be served) gives the least total latency of both because they equally share the bandwidth. Therefore, it is reasonable that we choose this ratio as the optimal design parameter of our switch.

4.7 The effect of Drop Head vs. Drop Tail

Speedup 2 Input Queue 512 slots Output Queue 2048 slots (1 slot = 64 bytes) Hot Spot 0.001953 (uniform probability) Burst Length 50000 Load 0.5/0.7/0.9 Buffer (8, 16, 32, 64, 128, 256) slots

Drop Head Algorithm Simulation time 100ms.

In this scenario, we look at how Drop Head and Drop Tail affect average latency under uniform traffic. The thick lines represent average latency when Drop Head is used under different load. The thin lines represent average latency when Drop Tail is used. We can see that all the using Drop Head as the dropping algorithms does reduce the total average latency significantly. Figure 19.

5. Optimal parameters

Throughout a series of simulation, we concluded that using 64 slots buffer inside each switching element and input port can achieve optimal and cost-effective

results. In this section, we are going to base on our

Figure 19: Buffer Size vs Latency

Figure 21: Hotspot Traffic

Figure 20: Even Traffic

Figure 22: Throughput and Drop Rate

14

concluded optimal parameters and discuss its performance in various external environment. The uni_multi_ratio is set to 2, which means that multicast and unicast packets have equal chance to be switched. In the simulation, this gives the lowest total average latency.

No priority scheduling is used in our final design. The reason is that searching for a low priority packet in the input queues is an expensive operation. It also complicates implementation and this contradicts with our scalability goal.

5.1 Varying load under evenly distributed traffic

Under evenly distributed traffic, our switch can obtain optimal throughput most of the time. A very slightly packet drop can be observed when our switch is fully loaded. (100% load)

5.2 Varying hotspot rate under same load

This graph shows the effect of unevenly distributed traffic to our switch. Under uniform traffic (hotspot = 0.00195312), throughput is 100% and no packet is dropped. When unevenly distributed traffic arrives, throughput drops significantly. Drop rate increases at the same time.

5.3 Varying multicast traffic under same load

5.3.1 Throughput and Drop rate

Under the same load (90% load) and varying multicast traffic, we can see that our switch can obtain maximum throughput with very low drop rate under uniform traffic. (top and bottom line) The higher than 1 throughput is due to multicast packets which have more than one outputs.

When the multicast traffic is unevenly distributed, throughput drops and drop rates increase. (middle two lines) The result is reasonable.

5.3.2 Average latency

Figure 23 shows the average latency of multicast and unicast packets. The dotted lines indicate traffic with hotspot, the solid lines indicate uniform traffic. The thick lines are multicast packets, the thin lines are unicast packets. We can see that multicast packets

with hotspot traffic have the highest latency. Also, as

more input ports contain multicast packets, average latency with hotspot traffic decreases while average latency with uniform traffic increases.

6. Future Works

As much as we spend much time designing, simulating it, re-designing and re-simulating, our switch is still very crude and further refinement is needed.

1) Scaling up the switch design. One of our major goal is to come up with a scalable design. We think our switch is very good in terms of scalability, because our control logic is pretty simple and each port is very independent. In the short term, we can easily increase our bandwidth from current 1.2T to 9.6T by adding more input ports and 2 more stages of switching stages. However, due to the time pressure, we didn’t have a chance to simulate it and to verify whether our idea works or not.

2) Providing better Quality of Service. In current input queue, we provide higher priority packet with lower drop rate, but it is a relative guarantee, not an ‘absolute guarantee’. In the future, we would like to provide better Quality of Service in the switch. Not only provide ‘drop rate’ priority but also latency and bandwidth guarantee. Since our switch is an input queue switch, there are many improvements can be implemented in the input queue.

3) Further improvement on the latency of multicast packet. The latency of multicast packets is still very long compared with the

Figure 23: Latency

15

unicast packets. One point we could take advantage of is the spare bandwidth inside the switch. Since switch is speedup of 2 inside, if we could make several copies inside the switch, it still may not cause serious problems inside the switch and the latency could be reduced nearly same as the unicast packets. Consequently, we could try to re-design our multicast mechanism to make use of the copy function of internal switch.

4) Improving the performance of non-uniform traffic. The ‘hot-spot’ traffic has always been the most significant problem in our switch. Even though we tried to increase the buffer size, under current structure, the drop rate and throughput still doesn’t have much substantial improvement. We didn’t expect this could be a problem in our initial design idea. Therefore, in the future, a lot of work has to be focused on improving the performance of non-uniform traffic. One idea we have discussed in our meeting but haven’t have time to implement is to have an adaptive scheduling and routing algorithm inside the input queue and switching element. If the switch can sense the uniform traffic and adjust its behavior, then we might have a chance to improve the poor performance for the non-uniform traffic.

7. Conclusion

Large scale deployment of networking will require switching systems that can cost-effectively support thousands or tens of thousands of users in a single location. In this report, we have designed an IP switch architecture and gigabit technology components that are suitable for such large configurations, offering essentially constant per port costs and complexity with the system capacity of 1.2Tb/s. The system design supports one-to-many multicast. Also, it uses an efficient form of Early Packet Discard to guarantee high effective throughputs during overload.

References

[1] Chaney, T., Fingerhut, J.A., Flucke, M., and Turner, J.S., “Design of a Gigabit ATM Switch,” Infocom, 1997.

[2] Chen, Y., and Turner J.S., “Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks,” Proceedings of Globecom, 1998.

[3] Kim, H.S., and Leon-Garcia, A., “A Self-Routing Multistage switching network for broadband ISDN,” IEEE J.Select Areas Commun., vo. 8, pp.459-466, Apr. 1990.

[4] Kim, H.S., “Design and Performance of Multinet Switch: A Multistage ATM Switch Architecture with partially Shared Buffers,” IEEE/ACM Trans. On Networking, vol. 2, no. 6, pp. 571-580, Dec. 1994.

[5] McKeown, N., “Fast Switched Backplane for a Gigabit Switched Router,” Business Communications Review, Dec. 1997

[6] Partridge, C. et. al., "A 50-Gb/s IP Router," IEEE/ACM Transactions on Networking, vol. 6, no. 3, pp 237-248, June 1998.

[7] Turner, J.S., and Yamanaka, N., “Architectural Choices in Large Scale ATM Switches,” IEICE Transactions, 1998

[8] Turner, J.S., “An Optimal Nonblocking Multicast Virtual Circuit Switch,” Proceedings of Infocom, June 1994.

16