Hard Soft Ref

download Hard Soft Ref

of 15

Transcript of Hard Soft Ref

  • 8/12/2019 Hard Soft Ref

    1/15

    The Switch Reordering Contagion: Preventing aFew Late Packets from Ruining the Whole Party

    Ori Rottenstreich, Pu Li, Inbal Horev, Isaac Keslassy,Senior Member,IEEE, and

    Shivkumar Kalyanaraman,Fellow,IEEE

    AbstractPacket reordering has now become one of the most signicant bottlenecks in next-generation switch designs. A switch

    practically experiences a reordering delay contagion, such that a few late packetsmay affect a disproportionate number of other packets.

    This contagion can have two possible forms. First, since switch designers tend to keep theswitchoworder, i.e., the order of

    packets arriving at the same switch input and departing from the same switch output, a packet may be delayeddue to packets of other

    owswith little or no reason. Further,within aow, if a single packet is delayed for a long time, then all the other packets of the same

    ow will have to wait for it and suffer as well. In this paper, we suggest solutions against this reordering contagion. We rst suggest

    several hash-basedcounter schemes that prevent inter-ow blocking andreduce reorderingdelay. We further suggest schemes based on

    network coding to protect against rare events with high queueing delay within a ow. Last, we demonstrate using both analysis and

    simulations that the use of these solutions can indeed reduce the resequencing delay. For instance, resequencing delays are reduced by

    up to an order of magnitude using real-life traces and a real hashing function.

    Index TermsSwitching theory, packet-switching networks

    1 INTRODUCTION

    1.1 Switch Reordering

    Packet reorderingis emerging as a signicant design issue innext-generation high-end switch designs. While it was easierto prevent in older switch designs, which were more central-ized, the increasingly distributed nature of switch designs andthe growth in port numbers are making it harder to tackle [1].

    Fig. 1 illustrates a simplied multi-stage switcharchitecture, with input ports, middle elements, andoutput ports. It could for instance be implemented usingeither a Clos network or a load-balanced switch [2]. Variable-size packets arriving to the switch are segmented into xed-size cells. Each such cell is load-balanced to one of themiddle elements, either round-robin or uniformly-at-random.It is later sent to its appropriate output. There, in the outputresequencing buffer, it waits for late packets from other middleelements, so as to depart the switch in order.

    For instance, assume that cells through arrive in orderto the switch, andbelong to thesameswitchow,i.e. share thesame input and output. When arrives to the output, it can

    depart. However, if cells and still wait to be transmittedout of their middle elements, then cells through cannotdepart the switch, because they wait for and to be sent inorder.

    More generally, packet reordering occurs when there is adifference in delay between middle elements. In practice, this

    can happen because of many reasons: e.g., a temporarycongestion at a middle element following the sudden arrivalof a fewhigh-fanout multicast packets from different inputs, aperiodic maintenance algorithm, or a ow-control backpres-sure message that temporarily stops transmissions out of amiddle element but has not yet stopped arrivals to it [3].

    Because of packet reordering, a large delay at a singlemiddle element can affect the whole switch. For instance,assume that a given middle element temporarily stops trans-mitting cells because of a ow-control message. Consider anarbitrary switch ow, i.e. a set of packets sharing the sameswitch input and output. If the switch-ow cells are load-balanced in round-robin order across middle elements, and

    the switchowhas atleast cells,thenat least one cellwill bestuck at the middle element. In turn, this will block the wholeswitchow, since all other subsequent cells are waiting for itat the output resequencing buffer. Therefore, the switch willbehave as ifallthe middle elements were blocked. Instead ofaffecting only of the trafc, it potentially affects the wholetrafc.

    Reordering causes many issues in switch designs. The cellsthat are waiting at the output resequencing buffer are causingboth increased delays and increased buffering needs. In fact, incurrent switch designs, the resequencing buffer sizes aretypically pushing the limits of available on-chip memory. In

    addition, reordering buffers cause implementation complexityissues. This is because the output resequencing structures,shown in Fig. 1, are conceptually implemented as linked lists.

    O. Rottenstreich and I. Keslassy are with the Department of ElectricalEngineering, Technion, Haifa 3200003, Israel.E-mail: {or@tx, isaac@ee}.technion.ac.il.

    P. Li is with ASML Netherlands B.V., De Run 6501, 5504 DR Veldhoven,The Netherlands. E-mail: [email protected].

    I. Horev is with the Department of Mathematics and Computer Sciences,Weizmann Institute of Science, Rehovot 761001, Israel.E-mail: [email protected].

    S. Kalyanaraman is with IBM Research, Bangalore 560045, India.E-mail: [email protected].

    Manuscriptreceived 19 Apr. 2012; revised 07 Oct. 2012; accepted 12 Nov. 2012;published online 9 Dec. 2012; date of current version 29 Apr. 2014.

    Recommended for acceptance by V. Eramo.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identier below.Digital Object Identier no. 10.1109/TC.2012.288

    1262 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

    0018-9340 2012 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.

    See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • 8/12/2019 Hard Soft Ref

    2/15

    Therefore, longer resequencing buffers imply that designersneed to insert cells into increasingly longer linked lists at asmall constant average time.

    Therearetwowaystoaddressthesereorderingissues.First,it is possible to change routing. One commonly-proposedsolution to ght reordering is to hash each ow to a randommiddle switch [4][6]. However, while this indeed prevents

    reordering, it also achieves a signicantly lower throughput,especially when dealing with bigger ows. In addition, othersolutions rely on a careful synchronization of packets to avoidreordering[2],[4].Yetthesesolutionsareoftentoocomplexfortypical high-performance routers with high internal round-trip-times and large packetrates. In this paper,we assume thatrouting is given, and do not try to improve it in any way.

    A second approach, which is often adopted by switchvendors, is to use signicant speedups on the switch fabrics(typically 2X, but published numbers go up to 5X [6], [7]).However, this approach also wastes a large part of thebandwidth.Our goal is to suggest ways of scaling routers in the

    future without further increasing this speedup.

    1.2 Reordering Contagion and Flow Blocking

    In this paper, we divide reordering effects into two differenttypes, and suggest algorithms to deal with each type. Moreprecisely, we consider two types ofreordering contagion, i.e.cases where the large delay of one cell can signicantly affectthe delay of many other cells:

    A contagion within ows, e.g. when a large packet issegmented into many cells, and one late cell makes allother cellswait.A contagionacross ows, e.g. when a late cell of one ow

    makes cells ofother ows wait.The rst type is easy to understand. The paper suggests

    dealing with it by usingnetwork coding, so as to not have anyirreplaceable cell.

    The second type is due toow blocking, as explained below,and is tackled usinghash-based counter schemes.

    Today, switches guarantee that packets belonging to thesame switch ow, i.e. arriving at the same switch input anddeparting from the same switch output, will depart in the same

    order as they arrived. For switch vendors as well as for theircustomers, breaking thisswitch-ow orderguarantee is not anoption.

    To the best of our knowledge,allcurrent switch designersprovide this guarantee.1 That, even though no standardrequires it, and in fact the IPv4 router standard does not evenforbid packet reordering (Section 2.2.2 in [9]). Today, it is asignicantcustomer requirementthat needs to be addressed (itis well known that a major core-router vendor lost marketshare because of its routers experimenting some limitedreordering).

    However, this guarantee has been increasingly hard toaddress in high-end multi-stage switches because of thecomplex and often-overlooked ow blocking interactions.Within a switchow, let a ow be the set of all packets that alsoshare the same (source, destination) IP address pair. Then owblocking happens when in order to satisfy the switch-oworder guarantee, packets from one ow wait for late packetsfromanother ow within the same switch ow.

    To illustrate this ow blocking phenomenon, Table 1details the packets from Fig. 1. Assume that even thoughpackets through belong to the same switchow, they donot all belong to the sameow. Packets , belong to ow(shownwithacircleinFig.1)sincetheyhavethesame(source,destination) IP pair, packet belongs toow (shownwith a

    pentagon), packet belongs toow (shownwith a triangle).Likewise, packet belongs toow and packet belongs toow . An additional packet belongs to a different switchow and can independently depart the switch. As previouslymentioned, packet , which arrivesto the switchbefore and

    can departin order.However, packets , and are out ofswitch-oworder, and therefore needto waitforpackets , atthe output. They can only depart from the switch when and

    arrive, and areblockedmeanwhile. In particular, note thatthesepackets are blocked by a late packetfrom a differentow:this isow blocking.

    TABLE 1An Example of Packets with Their Flow IDs and Switch Flow IDs

    Packets through belong to the same switchow, i.e., sharethe sameswitch input and switch output. Packet belongs to a second switchow.In addition,packets through include packets from 5 differentows,i.e.,have 5 possible (source, destination) IP address pairs.

    Fig. 1. Switchow blocking in the switch architecture.

    1. However, since switch internal details are often kept condential,we could not nd a publicly available reference. [8](in Chapter 4), [7], [6](inAppendix B) refer to theuse of internal sequence numbers to maintainpacket order, but none of these detail how the sequence numbers areformed.

    ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1263

  • 8/12/2019 Hard Soft Ref

    3/15

    1.3 Our Contributions

    In this paper, wepropose solutions for the two different types of thereordering contagion. We rst try to suggest hash-basedschemes to separate between ows more accurately andreduce the inter-ow blocking, and then suggest a codingscheme to reduceintra-owreordering delay contagion.

    First, instead of providing a switch-ow order guarantee,we only provide a ow order guarantee, so that packets ofthe same ow are still maintained in order, but are not

    constrained with respect to packets of different ows. Asmentioned previously, this is compatible with all knownstandards. We then suggest schemes that use this new orderconstraint to reduce resequencing delays in the output rese-quencing buffers. Unlike most previous papers on switchreordering, these schemes do not change the internal load-balancing algorithm, and therefore also the internal packetreordering. In other words, our resequencing schemes areend-to-end in the switch, in the sense that they only affect theinput and output buffers, and not any element in-between, soas to be transparent to the switch designer internal routingand ow-control algorithms. Our schemes use various meth-ods to increasingly distinguish between ows and decreaseow blocking. Our rst scheme,Hashed-Counter, uses ahash-based counter algorithm. This scheme extends to switch fabricsthe idea suggested in [10] for network processors. It replaceseach single ow counter with hashedow counters, andtherefore effectively replaces ow blocking within largeswitch ows by reducing it to ow blocking within smallerow aggregates.

    Then, our second scheme, the Multiple-Counter scheme,uses the same counter structure, but replaces the single hashfunction by hash functions to distinguish even better be-tween ows and therefore reduce ow blocking.

    Fig. 2 illustrates the Hashed-Counter scheme and the

    Multiple-Counter scheme in comparison with a simple counter-based scheme called the Baseline scheme. The single counter isreplaced with an array of counters in theHashed-Counterandthe Multiple-Counter schemes. While a packet from a givenow is hashed to a single counter in the Hashed-Counterscheme, it is hashed to counters in the Multiple-Counterscheme.

    Finally, our last scheme, Multiple-Counter attemptsto reduce ow blocking even further by using variable-increment counters based on sequences. All these schemeseffectively attempt to reduce hashing collisions between dif-ferent owswithin thesameswitchow.Note that intheworst

    case, even if all hashes collide, all these scheme guarantee thatthey will not perform worse than the currently common scheme,whichuses the same counter forall owsand therefore has theworst ow blocking within any switch ow.

    However, if a packet is delayed in a long queue, thesecounter-based schemes cannot prevent it from affecting manypackets within its ow. Therefore, we suggest touse networkcoding to reduce reordering delay. We introduce several possiblenetwork coding schemes and discuss their effects on reorder-ing delays. In particular, we show the existence of a time-constrained network coding that is not necessarily related tochannel capacity optimality.

    Finally, using simulations, we show how the schemes can

    signicantly reduce the total resequencing delay, and analyzethe impact of architectural variables on the switch perfor-mance.For instance,resequencing delaysare reduced by up toan order of magnitudeusing real-life traces and a real hashingfunction.

    Note that since they do not affect routing, our suggestedschemes are general and can apply to a variety of previously-studied multi-stage switch architectures, including Clos, PPS(Parallel Packet Switch), and load-balanced switch architec-tures [2], [11][14]. They can also apply to fat-tree data centertopologies with seven stages [15][17], and more generally toany network topology with inputs, outputs, and load-balanc-ing with reordering in-between. While we do not expandthese for the sake of readability, we believe that our schemescan decrease resequencing delay in all these potential archi-tectures. (For instance, to avoid reordering, data center linksare currently oversubscribed by factors of 1:5 or more[15][17]. If reordering did not incur such a high resequencingdelay, it may be easier to efciently load-balance packets andfully utilize link capacities.)

    1.4 Related Work

    Resequencing schemes for switches are usually divided intotwo main categories. First, counter-based schemes, which relyon sequence numbers. For instance, Turner [18] describes an

    implementation of a counter-based scheme that correspondsto theBaselinescheme, as described later.

    Second, timestamp-based schemes, which rely on timestampsderiving from a common clock. Henrion [19] introduces sucha scheme with a xed time threshold. Turner [20] presents anadaptive timestamp-based scheme with congestion-baseddyamic thresholds. However, while timestamp-basedschemes canbe simpler to implement, their delayscan becomeprohibitive when internal delays are highly variable, as ex-plained in [20], because most packets experience a worst-casedelay. Therefore, we restrict this paper to counter-basedschemes.

    Several schemes for load-balanced switches [12], [13] attemptto prevent any reordering within the switch. [2], [4], [21]provide an overview of such schemes. In particular, in theAFBR (Application Flow-Based Routing) as well as in [5] and

    Fig. 2. Illustration of the Baselinescheme, Hashed-Counterscheme and the Multiple-Counterscheme. The single ow counter from the Baselinescheme is replaced with hashedow counters in the Hashed-Counterscheme and the Multiple-Counterscheme. A packet from a given ow is

    hashed to a single counter in theHashed-Counterscheme and to counters in theMultiple-Counterscheme.

    1264 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

  • 8/12/2019 Hard Soft Ref

    4/15

    [6] packets belonging to the same hashedow, are forwardedthrough the same route, thus preventing resequencing butalso changing theow paths and obtaining lowthroughput inthe worst case. For instance, if there are a few large ows, theachieved balancing is only partial.

    Resequencing schemes have also been considered in net-work processors. Wu et al. [22] describe a hardware mechanismin each ow gets its own counter. The mechanism remembersall previous ows and sequentially adds new ows to the listof chains. It then matches a packet from an existingow withthe right SRAM (Static Random-Access Memory) entry usinga TCAM (Ternary Content-Addressable Memory) lookup.Meitinger et al. [10] are the rst to suggest the use of hashingfor mapping ows into counters in network processors inorder to reduce reordering. They further discuss the tradeoffbetween the large number of counters and the possible colli-sions. However, all these works on network processors onlyconsider a single input and a single output, and therefore donot consider the complexity introduced by the switchows. In addition, the load-balancing might be contained innetwork processors with stateful algorithms, while it is not inswitches.

    Packet reordering is of course also studied in many addi-tional networking contexts, such as the parallel download ofmultiple parts of a given le [23], or in ARQ (AutomaticRepeat reQuest) protocols [24]. Recently, Leith et al. [25]considered the problem of encoding packets with indepen-dent probabilistic delays such that the probability that apacket can be decoded within a given decoding deadline ismaximized.

    2 THEHASHED-COUNTERSCHEME

    2.1 Background

    A commonly used scheme for preservingswitchoworder isto keep a sequence-number counter foreach switchow.Inthisscheme, denoted as theBaselinescheme, each packet arrivingat a switch input and destined to a switch output is assigned

    the corresponding sequence number. As illustrated in Fig. 3(a),in each switch input, we keep counters, one for each switchoutput. Then, the switch output simply sends packets fromthis switch input according to their sequence number. Refer-ring to the example of Table 1, all the rst six packetsthrough share the same counter as they belong to the sameswitch ow. Assume that packet gets sequence number 1,

    gets number , and gets number 6. Then packet candepart without waiting because its sequence number is thesmallest. Packet , and needto waitfor packets and .In the end, the departure order is , , , , , . Likewise,the packet uses a different counter since it belongs to asecond switch ow.

    It would seem natural to similarly preserve oworder bykeeping a counter for each ow. However, the potentialnumber of (source, destination) IPv4 ows goingthrough a high-performance switch is too large to keep acounter for each.

    We could also devise a solution in which counters wouldonly be kept for the most recent ows. But such a solutionmight be complex to maintain and not worth this cost andcomplexity. For instance, a 10 Gbps line withan average 400 B

    packet size would have to keep up to 3 million ows for thelast second. If each ow takes bits to store the (source,destination) IPv4 addresses and 10 bits for the counter, wewould need more than 200 Mb of memory, thus requiringexpensive off-chip DRAM (Dynamic Random-AccessMemory).

    Instead, the following algorithms rely on hashing to reducethe number of needed counters by several orders of magni-tude, in exchange for a small collision probability.

    2.2 Scheme Description

    Fig. 4(a) illustrates how theHashed-Counterscheme is imple-

    mented in the input port. arrays of packet counters exist ineach input port. For a given output port, the Hashed-Counterscheme uses an array of counters, instead of a singlecounterfor theBaselinescheme.

    Fig. 3. Baselinescheme.

    Fig. 4. Hashed-Counterscheme.

    ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1265

  • 8/12/2019 Hard Soft Ref

    5/15

  • 8/12/2019 Hard Soft Ref

    6/15

  • 8/12/2019 Hard Soft Ref

    7/15

  • 8/12/2019 Hard Soft Ref

    8/15

    The last column illustrates schematically the processingcomplexity needed to insert a packet in the resequencingbuffer. TheBaselinescheme only needs to search through onelist to nd the packet position. Likewise, theHashed-Counterscheme nds the correct list using a single hash function, thengoes through the list. However, the Multiple-Counterschemeneeds to do so for lists accessed in parallel. In addition, the

    Multiple-Counterscheme needs another hash function andparallel table lookups to understand the variable-increment

    information.Incidentally, note thatthese complexity measures might be

    misleading. For instance, the size of theBaselinescheme list ison average more than times larger than the lists of theHashed-Counter, Multiple-Counter and Multiple-Counter schemes,since their total resequencing buffer size is smaller and theyhave lists. Even thoughnding the largest of elements on

    parallel lists takes more time on average thanndingasingleone, in practice, the factor actually makes it signicantlyeasier to reach high access speeds with the Hashed-Counterthan with theBaselinescheme.

    6 NETWORKCODING

    6.1 Coding against Rare Events

    While the counter schemes above can reduce reorderingdelay, the (worst-case) total packet delay is necessarily low-er-bounded by the (worst-case) queueing delay. This is becausethe total delay of a packet is composed of its queueing delayand its reordering delay. If the rst packet in a ow of 100packets is delayed in an extremely long queue, then all theother packets will have to wait for it and suffer as well.

    Therefore, reordering delay is vulnerable to rare events.To solve this problem, we suggest to consider an intriguingidea: using network coding to reduce reordering delay. Whilenetwork coding has been often used in the past, it has mainlybeen destined to address packet losses, and reordering hasoften only been a minor side effect [28], [29]. We suggest hereto use it exclusively to reduce reordering delay by addressingthe vulnerability to rare events. Interestingly, we will showthat in some sense, the total delay ofa packet can be smaller than itsqueueing delaybecause network coding enabled the switchoutput to reconstruct and send it before it actually arrived tothe output.

    Consider the switch architecture, as shown in Fig. 1, and

    assume for simplicity that all packets have the same size. Toimplement network coding, in each input port, we add onepacket buffer per switch ow, i.e. a total of packet buffersin all input ports. Then, for each switch ow, we keepcomputing the running XOR (exclusive or) function of theprevious packets. After a given number of slots (or packets),we send a redundant protection packet that contains thisXOR of the previous packets, and re-initialize the XORcomputation.

    Fig. 8 illustrates an example of use of the protection XORpacket. In the input port, the XOR packet is generated usingthe previous three packets , and , i.e. .

    Packet is delayed by a relatively long time so that it arrivesto the output port last. Without XOR packet, packets andshould wait for to arrive in order to depart the switch.However, when using network coding, packet is simply

    recovered by taking the XOR function of , and , becauseof the simple identity

    As we can see, packets , and can depart the switchbefore the original packet even arrives at the output. Thus, inthis example, the XOR packet mainly helps reducing s

    queueing delay

    and we do not really care about its impacton channel capacity, as in typical network-coding examples.

    To further reduce the total delay, more XOR packets can begenerated. However, there is no free lunch. The XOR packetswill increase the trafc load in the switch, which will result inhigher packet queueing delay in the central stage. Therefore,there should exist an optimal point beyond which XORs nolonger help.

    To implement the network coding, in each input port, wekeep one packet buffer per each output port. In this packetbuffer we keep the cumulative XOR of the packets of thecurrent switch ow which might include packets of severalows. The total number of all packet buffers in all input portsis clearly .

    Since for each recovered packet, we would like also toobtain the values of the corresponding counters, the packetbuffer should also keep the cumulative XOR of the corre-sponding counters of the packets, in addition to their data.Thus the required memory size for each pair of input andoutput port is the sum of the cell size and the total size of thecounters assignedto a singlepacket. This memory overhead issmaller when the xed size of the cells is smaller.

    After a delayed packet is recovered, when it nally arrivesat the output it is simply dropped. We can observe that such asituation occurred based on the counters at the switch output.

    6.2 Network Coding Schemes

    There are several possible network coding schemes to decidewhen the protection XOR packets should be sent. First, asshown in Fig. 9(a), a simple coding scheme is to generate theXOR packets every slots bytakingthe XOR of the packets inthe last slots, where in the gure. The XOR packet isthen inserted followingthe last of the slots,and coverstheslots. These resulting slots make up a frame. In itsheader, the protection XOR packet contains the sequencenumbers of the rst and last packets in the frame, so that theoutput will be able to know what packet it is supposed to

    protect.However, it can practically be further improved in two

    directions. First, if the trafc load of the ow is light, thenumber of packets in a frame could be low, and the frame

    Fig. 8. Network coding example.

    ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1269

  • 8/12/2019 Hard Soft Ref

    9/15

    could even be empty. Therefore, the contribution of XORpackets does not justify the high relative overhead and addi-tional delay caused by these XORs. It would make more senseto insert protection XOR packets every packets instead ofevery slots. Fig. 9(b) illustrates such a scheme with ,where the blank boxes stands for theregular packets and darkboxes for protection XOR packets. The scheme overhead isclearly .

    Interestingly, another possible improvement to the rstschemecouldbetouseXORsthatprotectavariablenumberofslots, e.g. with cumulative coding. For instance, frames couldbe grouped into large mega-frames. Within each mega-frame,every slots, the protection XOR packet cumulatively coversnot only the last slots, but the whole time since the start ofthemega-frame, excluding other XOR packets. Therefore,thereis a global coding instead of a mere local coding. Fig. 9(c)illustrates this cumulative coding scheme with a mega-frameconsisting of three frames, each frame having regularslots and one protection slot.

    This last scheme is interesting in that it illustrates aninteresting coding tradeoff. On the one hand, we would likethe coding to be efcient, so in some sense we could like towait as much as possible until the end of the mega-frame andthen use several protection packets, e.g. using a simple Reed-Solomon or Hamming code with xed packet dependencies.

    However, if we only release the protection packets at the endof the mega-frame, the coding becomes useless, because theprotected packets will probably have already arrived at theoutput. Therefore, the protection scheme needs some form oftime-constrained coding, in the sense that the protection packetsneed to be close to the packets they protect. Thats why the schemesdisplayed above are more effective for reordering delay thanmore complex schemes that might be closer to the Shannoncapacity bounds.

    7 PERFORMANCEANALYSIS

    7.1 Delay ModelsWe nowwant to modelthe performanceof theschemes underdifferent delay models. We assume a simple Bernoulli i.i.d. ina slotted time, so that the probability of havinga packetarrival

    for a given switchowat a given slotis equal to . Weanalyzethe performance of the schemes based on several delaymodels for the queueing delay , which is experienced byeach packet in the middle switch elements.

    First, we will consider a Rare-Event delay model in whichis either some large delay with probability , or 0 with

    probability . This delay distribution models the impact oflow-probability events in which some packets experiencevery large delays, and these delays impact other packets. Forinstance, this could be the case of a middle element with asignicant temporary congestion. It could also modela failureprobability of for middle elements, with a timeout value atthe output resequencing buffer and a negligible queueingtime. After time , an absent packet is declared lost, and thefollowing packets can be released.

    We then consider a General delay model with an arbitrarycumulative distribution function of the queueing delay ,so that . Then, we apply the analysis to ageometric delay model, such that the delay is geometricallydistributed, i.e. . The Geomet-

    ric delay model will help us get some intuition on the reorder-ing delay in a switch with load .

    In these models, we only take into account the queueingdelay in the middle elements and the resequencing delay

    in the resequencing buffer. Therefore, the total delay is. We neglect the propagation delays, as well as

    the additional queueing delays in the inputs and outputs. Wealso assume uniformly-distributed hash functions. For thesake of readability, the performance of the Multiple-Counterscheme is brought forward here only for part of the delaymodels.

    7.2 Delay Distributions7.2.1. Rare-Event Delay Model

    We now want to analyze the Baseline,Hashed-Counter,Multiple-Counterand Multiple-Counterschemes. Note that the per-formances of the Hashed-Counter, Multiple-Counter andMultiple-Counter schemes depend on theow sizedistribution.For instance, if a switch ow consists of a single large ow,then there is no point in adding counters to distinguishbetweenows. We have also developed a full model basedon the ow sizes. However, for the sake of readability, we willonly present below the much simpler case in which all owshave negligible size.

    In Delay Model 1, has one of two possible values. It isequal to 0 w.p. , and to w.p. . The queueing delay ischosen at random independently for each packet. If ,we say that the packet is delayed. Using this delay model, thetotal delay has an upper bound of . Thereason is that the queueing delay is either 0 or . Therefore,

    time slots afterits arrival at the switchinput, it is guaranteedthat a packet arrives to the output. In addition, due to theupper bound of on the queueing delay, all its earlierpackets are also guaranteed to reach the output no laterthan that time. More generally, the higher the smaller

    , and in particular in this delay model if then

    necessarily .We present the CDF (Cumulative Distribution Function)

    function of for the different counting schemes. To calculatethis function, we have to know theprobability of the arrival of

    Fig. 9. Network coding schemes.

    1270 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

  • 8/12/2019 Hard Soft Ref

    10/15

  • 8/12/2019 Hard Soft Ref

    11/15

    w.p. , and in that case is only delayed beyondw.p. , the result follows by multiply-ing all the probabilities that there is no late packetfrom slot over all such possible slots.

    ii. The result follows again directly from above whenconsidering a single counter out of .

    iii. The proof is again exactly the same as in the previoustheorem, and follows from the previous result.

    7.2.3 Geometric Delay Model

    Now we focus on the performance evaluation under a privatecase of the General delay model with exponential delay. Inthis case, we have a queuing delay of w.p.for . Therefore, . Theresults for this case are summarized in the followingtheorem.

    Theorem 3 (Geometric Delay Model).i. Using theBaselinescheme,

    ii. Using theHashed-Counterscheme,

    iii. Using theMultiple-Counterscheme,

    with .

    Proof. The results follow from the previous theorem usingthe expression of the geometrically-distributed delaymodel.

    8 SIMULATIONS

    We fully developed from scratch a switch simulator that canrun several use-cases, including the geometric delay model,the Clos switchand the data centerswitch. To better mimic the

    practical network behavior, we further added a possibility offeeding the ow source with real-life traces [30]. Simulationsfor each use-case are conducted as described below. Through-outthissection,delays arebrought in units of time-slots. In thesimulations,we rely on real-life 64-bit mixhash functions [31].In addition, for the evaluation of the Multiple-Counterscheme, we use the sequence

    which is a sequence for .

    8.1 Geometric Delay Model Simulations

    We run simulations for the geometric delay model discussed

    in Section 7.2. In the simulation, the number ofows equals, so that the assumption that the probability that two

    arbitrary packets share the same ow is negligible still holds.We generate the switch ow using Bernoulli i.i.d trafc with

    arrival rate , the parameter of geometric delay, and the number of total counters . In the

    Multiple-Counter scheme and the Multiple-Counter scheme,we use hash functions.

    Fig. 10(a)depicts the simulation results, compared withthetheoretical results from (8), (9)and (10). Thesimulation resultsmatch the theoretical results quite well. Furthermore, we cansee that changing the ordering guarantee from switch-oworder preservation to ow order preservation signicantly

    decreases the average and standard-deviation of both delays.Using theHashed-Counter scheme, the average total delay isdrastically reduced from 19.09 to 9.35 time slots, an improve-ment of 51%. In theMultiple-Counter scheme the average totaldelay is reduced again by additional 8.3% to 8.58 without anyadditional memory.

    Fig. 10(b) shows the average resequencing delay as afunction of the number of counters . TheHashed-Counterand the Multiple-Counter scheme signicantly reduce theresequencing delay of the Baselinescheme which is not affect-ed of course by the value of . For instance, for , theaverage resequencing delay for the Baseline, Hashed-Counter

    and Multiple-Counter schemes is 11.7315, 2.0312 and 1.2442,respectively. Furthermore, for , the delay of theHashed-Counterscheme is 0.2731 while the Multiple-Counterscheme reduces it to only 0.0302.

    Fig. 10. Geometric delay model.

    1272 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

  • 8/12/2019 Hard Soft Ref

    12/15

    8.2 Switch Simulations

    We now runsimulations with the switchstructure from Fig. 1.Therefore, the delay experienced in the middle switch ele-ments does not follow a specic delay model as above, but isinstead incurredby other packets.We alwayskeep and

    , yielding switch ows going throughdifferent paths. We set the number of total coun-

    ters to for the Hashed-Counter and Multiple-Counterschemes. For the Multiple-Counter scheme we set toaccount for the larger memory requirements. We also assume

    a uniform trafc matrix. The presented results are based onsimulations.

    We start by comparing the performance of hash-basedcounter schemes by using 4096 ows per (input, output) pair,i.e. a total of ows. The total load is set to

    . Eachow is generated using Bernoulli i.i.d. trafc ofparameter . In theMultiple-Counterand Multiple-Counterschemes, we have .

    Fig. 11(a) compares the results of the switch simulations tothe theoretical results from (8), (9) and (10), similarly to Fig. 10(a). For theBaselinescheme the optimal t was reached using

    , and for the Hashed-Counter scheme and Multiple-

    Counter schemeusing . The discrepancy between thetheory and thesimulation can be explained when we considerthat in the switch simulations the assumption that the owsare mice of negligible size does not hold.

    Fig. 11(b) plots the average resequencing delay in theswitch as a function of the trafc load . As the load increasesand the delays become larger, the impact of the Hashed-Counter scheme, Multiple-Counter scheme and Multiple-Counter scheme becomes increasingly signicant. TheMultiple-Counter scheme results in the lowest resequencingdelay. For instance, at high trafc load, , theMultiple-Counterscheme achieves a 24.2% reduction of the resequen-cing delay obtained by the Hashed-Counter scheme whilethe Multiple-Counter scheme further reduces it by addi-

    tional 16%.Fig. 11(c) shows the PDF (Probability Density Function) of

    the number of packets, which are blocked from exiting theresequencing buffers for theHashed-Counter,Multiple-Counterand Multiple-Counter schemes. The hash-based schemesvastly improve the average number of waiting packets in theBaseline scheme i.e, the Baseline scheme has an average of 22.47packets per time slot, while the Hashed-Counter scheme hasapproximately 5. The Multiple-Counter scheme reduces thisnumber to 4.6 packets per time slot, and nally theMultiple-Counterscheme has only 4.07 packets per time slot.

    8.3 Switch Simulations Using Real-Life TracesWe now conduct experiments using real-life traces recordedon a single direction of an OC192 backbone link [30]. We use areal hash function [31] to match each (source, destination)

    Fig. 11. Switch simulation results.

    ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1273

  • 8/12/2019 Hard Soft Ref

    13/15

    IP-address ow with a given counter bin. In the Multiple-Counterand Multiple-Counterschemes, we have .

    As expected, Fig. 11(d) shows how the Hashed-Counterscheme drastically reduces the resequencing delay on thisreal-life trafc, from 1 to 0.1. Again, the reduction in totaldelay is more modest, from 2.4 to 1.5 time slots.

    8.4 Network Coding Simulations

    Fig. 12 presents simulation results of the CDF of the total delayfor the Rare-Event delay model with the parameters ,

    . We assume a trafc load of and that theows of the packets are uniformly distributed among a set of4096 possible ows. The network coding parameter was

    , i.e. a protection XOR packet was inserted every10 packets. In order to have a fair comparison, we assumedthatwhenthenetworkcodingisusedandtheloadisincreasedby factor , is increased by the same factor. Thus, themaximal queueing delay is 110. Fig. 12(a) presents the CDF

    of the total delay without using the network coding andFig. 12(b) shows the CDF with coding.

    CDF of 0.9 is achieved in the Baseline without coding only fora total delay of 90. For Baseline with coding and for all otherschemes, this CDF is achieved even for delay of 0. Withoutnetwork coding, the mean of the total delay for the Baseline,Hashed-Counter, Multiple-Counter and Multiple-Counterschemes was 36.12, 3.33, 1.53 and 1.14, respectively. Using thenetwork coding, the mean delay was signicantly reducedto 3.30, 0.41, 0.23 and 0.17. An improvement by approxi-mately an order of magnitude.

    In the delay models from Section 7 and in the simulationsso far, we have assumed that a packet can leave the switchimmediatelywhen the packet is available at the switch outputand in order. In practice, in many switches at most one packetmight leave the same switch output in a single time slot. Thusa packet that is ready to leave the switch at the same time as asecond packet, might experience additional delay. We denotethis delay by exit delay and assume that the switch givesprecedence to packets with earlier arrival time. Fig. 12(c)presents the CDF of the sum of the total delay and the exitdelay in the Rare-Event delay model (without network cod-ing). The mean of this sum of delays for the Baseline,Hashed-Counter, Multiple-Counter and Multiple-Counter schemeswas 94.00, 6.64, 2.50 and 1.52, respectively. With networkcoding the means were reduced to 42.24, 1.30, 0.54 and 0.37,respectively. (Wedo notpresent the full CDF forthis case withnetwork coding due to space limits.)

    8.5 Data Center Simulations Using Real-Life TracesWe also perform simulations on a 5-stage data center structuredescribed in [17]. The input ports are fed with a real-life traceas in Section 7.3. The total number of counters in theHashed-Counter scheme andMultiple-Counterscheme is whilefor the Multiple-Counter scheme we have . In theMultiple-Counter scheme and Multiple-Counter scheme, thenumber of multiple counters is .

    Fig. 13 illustrates the simulation results of the resequencingdelay. As expected, Hashed-Counter scheme noticeably re-duces the resequencing delay, from 16.3 to 6.73 time slots.In this data-center simulation, the improvement from the

    Hashed-Counter scheme to the Multiple-Counter scheme ismoresignicant. Theaverage resequencing delay is decreasedfrom 6.73 to 3.30 time slots (a 51.0% improvement). The reasonis that as the packet goes through more stages of switchelements, the variance of the queueing delay is larger andthe reordering caused by other ows becomes more severe.This delay can be improved furthermore by the Multiple-Counter scheme. As in the previous simulations, the bestresults are obtained by using the Multiple-Counter scheme,for which the resequencing delay is only 1.84 time slots.

    9 CONCLUSION

    In this paper, we provided schemes to deal with packetreordering, an emerging key problem in next-generationswitch designs. We argued that current packet order

    Fig. 12. CDF of the total delay and the total delay + exit delay for the rare-event delay model with and without network coding. Theresults are based onsimulations.

    Fig. 13. CDF of the resequencing delay for a data center,using a real-lifetrace (based on simulation).

    1274 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014

  • 8/12/2019 Hard Soft Ref

    14/15

    requirements for switches are too stringent, and suggestedonly requiringow order preservation instead of switch-oworder preservation.

    We then suggested several schemes to reduce reordering.We showed that hash-based counter schemes can help pre-vent inter-ow blocking. Then, we also suggested schemesbased on network coding, which are useful against rare eventswith high queueing delay, and identied a time-constrainedcoding problem. We also pointed out an inherent reorderingdelay unfairness between elephants and mice, and suggestedseveral mechanisms to correct this unfairness. We nallydemonstrated in simulations reordering delay gains by fac-tors of up to an order of magnitude.

    In future work, we intend to further investigate the opti-mality of our schemes. We intend to nd whether there arefundamental lower bounds to the average delay caused byreordering in a switch, given any possible scheme.

    ACKNOWLEDGMENTS

    The authors thank Alex Shpiner for his helpful suggestions.This work was supported in part by the European ResearchCouncil Starting Grant 210389, in part by the Jacobs-Qual-comm fellowship, in part by an Intel graduate fellowship, inpart by a Gutwirth Memorial fellowship, in part by an Intelresearch grant on Heterogeneous Computing, in part by theIntel ICRI-CI Center, in part by the Technion Funds forSecurity Research, and in part by the Hasso Plattner Centerfor Scalable Computing.

    REFERENCES

    [1] F. Abel et al., Design issues in next-generation merchant switchfabrics,IEEE/ACM Trans. Netw., vol.15,no.6, pp. 16031615, 2007.

    [2] I. Keslassy et al., Scaling internet routers using optics, ACMSIGCOMM, vol. 33, no. 4, pp. 189200, 2003.

    [3] N. Chrysos et al., End-to-end congestion management for non-blocking multi-stage switching fabrics,in Proc. IEEE/ACM Archi-tectures Netw. Commun. Syst. (ANCS), 2010, p. 6.

    [4] I. Keslassy,The Load-Balanced Router. Saarbrcken, Germany: VDMVerlag, 2008.

    [5] Cisco Catalyst 6500 Documentation[Online]. Available: http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.2SXF/native/conguration/guide/cef.html.

    [6] JuniperT Series Core Routers Architecture Overview[Online]. Avail-able: http://www.juniper.net/us/en/local/pdf/whitepapers/2000302-en.pdf.

    [7] Cisco CRS-1 Overview [Online]. Available: http://cseweb.ucsd.edu/varghese/crs1.ppt.

    [8] Cisco CRS Carrier Routing System 16-Slot Line Card Chassis SystemDescription [Online]. Available: http://www.cisco.com/en/US/docs/routers/crs/crs1/16-slot-lc/system-description/reference/guide/sysdsc.pdf.

    [9] F. Baker.(1995, Jun.). RFC 1812: Requirements for IP Version 4 Routers[Online]. Available: http://www.faqs.org/rfcs/rfc1812.html.

    [10] M. Meitinger et al., A hardware packet resequencer unit fornetwork processors, in Proc. Architecture Comput. Syst. (ARCS),2008, pp. 8597.

    [11] S. Iyer and N. McKeown, Analysis of the parallel packet switcharchitecture,IEEE/ACM Trans. Netw., vol. 11, no. 2, pp. 314324,Apr. 2003.

    [12] C.-S. Chang, D.-S. Lee, and Y.-S. Jou, Load balanced Birkhoff-VonNeumann switches, part I: One-stage buffering,Comput. Commun.,vol. 25, no. 6, pp. 611622, 2002.

    [13] C.-S. Chang, D.-S. Lee, and C.-M. Lien, Load balanced Birkhoff-Von Neumann switches, part II: Multi-stage buffering, Comput.Commun., vol. 25, no. 6, pp. 623634, 2002.

    [14] W. Shi, M. H. MacGregor, and P. Gburzynski, Load balancing forparallel forwarding, IEEE/ACMTrans.Netw.,vol.13,no.4,pp.790801, Aug. 2005.

    [15] A. G. Greenberg et al., VL2: A scalable and exible data centernetwork,in Proc. ACM SIGCOMM, 2009, pp. 5162.

    [16] M. Al-Fares, A. Loukissas, and A. Vahdat, A scalable, commoditydata center network architecture,in Proc. ACM SIGCOMM, 2008,pp. 6374.

    [17] R. N. Mysore et al.,Portland: A scalable fault-tolerant layer 2 datacenter network fabric,in Proc. ACM SIGCOMM, 2009, pp. 3950.

    [18] J. S. Turner, Resequencing cells in an ATM switch,Comput. Sci.Dept., Washington Univ., Washington, DC, Tech. Rep. WUCS-91-21, Feb. 1991.

    [19] M. Henrion, Resequencing system for a switching node, U.S.Patent #5,127,000, Jun. 1992.

    [20] J. S. Turner, Resilient cell resequencing in terabit routers,inProc.Allerton Conf. Commun. Control Comput., 2003, pp. 11831192.

    [21] S. Kandula et al., Dynamic load balancing without packet reorder-ing,SIGCOMMComput. Commun. Rev.,vol.37,no.2,pp.5162, 2007.

    [22] B. Wu et al., A practical packet reordering mechanism with owgranularity for parallelism exploiting in network processors, inProc. IEEE Int. Parallel Distrib. Process. Symp. (IPDPS), 2005, p. 133.

    [23] Y. Nebat and M. Sidi, Parallel downloads for streamingapplicationsA resequencing analysis, Perform. Eval., vol. 63,no. 1, pp. 1535, 2006.

    [24] Y. Xia and D. N. C. Tse, Analysis on packet resequencing forreliable network protocols, Perform. Eval., vol.61,no. 4,pp.299328,2005.

    [25] D. J. Leith and D. Vasudevan, Coding packets over reorderingchannels, inProc. IEEE Int. Symp. Inf. Theory (ISIT), 2010, pp. 24582462.

    [26] S. Graham, sequences, in Analytic Number Theory, Springer,1996, pp. 431449.

    [27] O. Rottenstreich, Y. Kanizo, and I. Keslassy, The variable-incre-ment counting Bloom lter,in Proc. IEEE Conf. Comput. Commun(INFOCOM), 2012, pp. 18801888.

    [28] O. Tickoo et al., LT-TCP: End-to-end framework to improve TCPperformance over networks with lossy channels, in Proc. IEEE/

    ACM Int. Workshop Quality of Service (IWQoS), 2005, pp. 8193.[29] V. Sharma et al., MPLOT: A transport protocol exploiting multi-

    path diversity using erasure codes, in Proc. IEEE Conf. Comput.

    Commun (INFOCOM), 2008, pp. 121

    125.[30] C. Shannon et al.CAIDA Anonymized 2008Internet Trace [Online].Available: http://imdc.datcat.org/collection/.

    [31] T. Wang. Integer Hash Function [Online]. Available: http://www.concentric.net/Ttwang/tech/inthash.htm.

    Ori Rottenstreich received the BS degree incomputer engineering (summa cum laude) andthe PhD degree from the Electrical EngineeringDepartment, Technion, Haifa, Israel, in 2008 and2014, respectively. His research interest includesexploring novel coding techniques for networkingapplications. He is a recipient of the GoogleEurope Fellowship in Computer Networking, theAndrew Viterbi graduate fellowship, the Jacobs-

    Qualcomm fellowship, the Intel graduate fellow-ship, and the Gutwirth Memorial fellowship. He

    also received the BestPaper Runner Up Award atthe IEEE Infocom 2013conference.

    Pu Li received the BS and MSc degrees fromZhejiang University, China, and Technical Univer-sity Eindhoven, The Netherlands, in 2007 and2009, respectively. He was in the Departmentof Electrical Engineering, Technion, betweenNovember 2009 andJuly 2010. He is nowworkingas new technologyintroduction engineer in ASML,The Netherlands. He conducted multiple researchprojects in the Chinese University of Hong Kong,

    TNO Netherlands, and Philips Research Europe,in the eld of communication network. His

    research interests include wireless technologies, high-performanceswitching network, and network-on-chip.

    ROTTENSTREICH ET AL.: SWITCH REORDERING CONTAGION 1275

  • 8/12/2019 Hard Soft Ref

    15/15

    Inbal Horev received the BSc degree in bothphysics and electrical engineering from theTechnionIsrael Institute of Technology, Haifa,Israel. Sheis an MSc student at the DepartmentofMathematicsand Computer Sciences, WeizmannInstitute of Science, Rehovot, Israel. Currentresearchelds include active learning, computervision, and geometry processing.

    Isaac Keslassy (M02SM11) received the MSand PhD degrees in electrical engineering fromStanford University, California, in 2000 and 2004,respectively. He is currently an associate profes-sor in the Electrical Engineering Department,Technion, Israel. His research interest includesthe design and analysis of high-performancerouters and multi-core architectures. He is therecipient of the European Research Council Start-ingGrant, theAlon Fellowship, the Mani TeachingAward, and the Yanai Teaching Award.

    Shivkumar Kalyanaraman received the BTechdegree from Indian Institute of Technology,Madras, India, theMS andPhD degrees from OhioState University, Columbus, and the executiveMBA degree from Rensselaer Polytechnic Insti-tute (RPI),Troy,New York. Heis a seniormanagerat IBM Research, Bangalore, India.Previously, hewas a professor at the Department of Electrical,Computer and Systems Engineering, at RPI. Hisresearch in IBM is at the intersection of emergingwireless technologies, smarter energy systems

    and IBM middleware, and systems technologies. He is an ACMDistinguished Scientist.

    For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    1276 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014