Understanding the Packet Processing Capability of Multi-Core … · 2017-10-24 · Understanding...

Understanding the Packet Processing Capabilityof Multi-Core Servers

Norbert Egi‡, Mihai Dobrescu†, Jianqing Du†, Katerina Argyraki†, Byung-Gon Chun§, Kevin Fall§,Gianluca Iannaccone§, Allan Knies§, Maziar Manesh§, Laurent Mathy‡, Sylvia Ratnasamy§

§ Intel Research, † EPFL, ‡ Lancaster University

AbstractCompared to specialized network equipment, softwarerouters running on commodity servers allow program-mers to rapidly build and (re)program networks using thesoftware and hardware platforms they tend to be most fa-miliar with—that of the general-purpose computer. Un-fortunately, the Achilles’ heel of software routers hasbeen performance; commodity servers have tradition-ally proven incapable of high-speed packet processingthereby motivating an entire industry around the devel-opment of specialized network hardware and software.

However, recent advances in PC technology promisesignificant speed-ups for applications amenable to paral-lelization; router workloads appear ideally suited to ex-ploit these advances. This leads us to question whetherit is now (or soon will be) plausible to scale softwarerouters to current high-speed networks. As a first steptoward answering this question, we study the packet-processing capability of current commodity multi-coreservers: we identify performance bottlenecks, evalu-ate tradeoffs between performance and programma-bility, and discuss what changes are needed to fur-ther scale the packet-processing capability of general-purpose servers.

1 Introduction

To what extent are general-purpose processors capable ofhigh-speed packet processing? The answer to this ques-tion could have significant implications for how futurenetwork infrastructure is built. To date, the developmentof network equipment (switches, routers, various middle-boxes) has focused primarily on achieving high perfor-mance for relatively limited forms of packet processing.However, as networks take on increasingly sophisticatedfunctionality (e.g., data loss protection, application ac-celeration, intrusion detection), and as major ISPs com-pete in offering new services (e.g., video, mobility sup-port services), there is an increasing need for networkequipment that is programmable and extensible. Both in-dustry and research have indeed started tackling the is-sue [4, 10, 12, 13, 15].

In current networking equipment, high performanceand programmability are competing goals—if not mu-

tually exclusive. On the one hand, we have high-end switches and routers that offer very high perfor-mance but, because they rely on specialized and closedhardware and software, are notoriously difficult to ex-tend, program, or otherwise experiment with. On theother hand, we have “software routers,” where packet-processing is performed in software running on general-purpose platforms; these are easily programmable, buthave so far been suitable only for low-packet-rate envi-ronments such as small enterprises [12].

The challenge of building network infrastructure thatis programmable and capable of high performance canbe approached from one of two extreme starting points.

One approach would be to start with existing high-end,specialized devices and retro-fit programmability intothem. For example, some router vendors have announcedplans to support limited APIs that will allow third-partydevelopers to change/extend the software part of theirproducts (which does not typically involve core packetprocessing) [4, 15]. A larger degree of programmabil-ity is possible with network-processor chips, which of-fer a “semi-specialized” option, i.e., implement only themost expensive packet-processing operations in special-ized hardware and run the rest on conventional proces-sors. While certainly an improvement we note that, inpractice, network processors have proven hard to pro-gram: in the best case, the programmer needs to learna new programming paradigm; in the worst, she must beaware of (and program to avoid) low-level issues like re-source contention during parallel execution or expensivememory accesses [19, 21].1

From the opposite end of the spectrum, a different ap-proach would be to start with software routers and opti-mize their packet-processing performance. The allure ofthis approach is that it would allow developers to buildand program networks using the operating systems andhardware platforms they tend to be most familiar with.Such networks also promise greater extensibility: dataand control plane functionality can be modified as de-sired through a software-only upgrade and router devel-

1This remains true even with recent commercial products [2,3] thatcombine specialized and general-purpose processors; based on discus-sion with vendors, these products do allow some portion of the trafficto be processed in software running on general-purpose processors butoffer no performance guarantees for such processing.

1

opers are spared the cost and burden of hardware de-sign and development. By contrast, in network equip-ment that implements per-packet processing functional-ity using custom ASICs, a change to the router data-planecould require a hardware redesign and upgrade.

In addition, a network infrastructure built from com-modity servers would allow networks to inherit themany desirable properties of the PC-based ecosystemsuch as lower costs due to large-volume manufacturing,a widespread supply/support chain, rapid advances insemiconductor technology, state-of-the-art power man-agement features, and so forth. In other words, if feasi-ble, this approach could enable a network infrastructurethat is built and programmed in much the same way asend-systems are today. The challenge, of course, lies inscaling this approach to high speed networks.

It is perhaps too early to know which approach torouter programmability is superior. In fact, it’s likelythat each approach offers different tradeoffs between pro-grammability and performance and these tradeoffs willcause each to be adopted where appropriate. As yet how-ever, there has been little research exposing what trade-offs are achievable.

A legitimate question at this point is whether the per-formance requirements for network equipment are justtoo high and our exploration is a fool’s errand. The baris indeed high. In terms of individual link/port speeds,10Gbps is already widespread and 40Gbps is being de-ployed at major ISPs; in terms of aggregate switchingspeeds, carrier-grade routers range from 10Gbps up to92Tbps! Two developments, however, lend us hope.

The first is a recent research proposal [16] thatpresents a solution whereby a cluster of N servers canbe interconnected to achieve aggregate switching speedsof N×R bps, provided each server can process packets ata rate on the order of R bps. This result implies that, inorder to scale software routers, it is sufficient to scale asingle server to individual line speeds (10-40Gbps) ratherthan aggregate speeds (40Gbps-92Tbps). This reductionmakes for a much more plausible target.

Second, we expect that emerging technology trendswill favor packet-processing workloads. For example,packet processing appears naturally suited to exploitingthe tremendous computational power that multicore pro-cessors offer parallel applications. Similarly, I/O band-width has gained tremendously by the transition fromPCI-X to PCIe which has led to the introduction of10GigE NICs into the PC market [5]. Finally, the ar-rival of multiprocessor architectures with multiple inte-grated memory controllers is expected to significantlyboost system memory bandwidth.

While there is widespread awareness of these ad-vances in server technology, we find little comprehensiveevaluation of whether/how these translate into perfor-

mance improvements for packet-processing workloads.Hence, our goal in this paper is to understand thepacket processing capability of modern server-class PCs.Specifically, we focus on the following questions:

1. What packet-processing performance are currentcommodity multi-core servers capable of?

2. What hardware bottleneck limits performance?3. How effective is network software at exploiting the

available hardware resources?4. What tradeoffs between performance and per-

packet processing are feasible?5. What (hardware or software) architectural changes

might further improve performance?

We tackle the above questions in this paper. We startby describing our experimental setup in Section 2 andthen address questions 1-5 in Sections 3-7, respectively.

As we shall see, answering even our seeminglystraightforward questions requires a surprising amountof sleuthing. Modern processors and operating systemsare both beasts of great complexity. And while currenthardware/software offer extensive hooks for measure-ment and system profiling, these can be equally over-whelming. For example, Intel processors offer over 400performance counters that can be programmed for de-tailed tracing of everything from branch mispredictionsto I/O data transactions. Part of our contribution is thusa methodology for performing such an evaluation. Ourstudy adopts a top-down approach in which we start withblack-box testing and then recursively identify and drilldown into only those aspects of the overall system thatmerit further scrutiny.

Our study thus differs from prior work on softwarerouters in two ways: (1) we focus on recent advances inPC technology (e.g., multicore, PCIe, non-uniform mem-ory architectures) that, as we will see, affect the natureof the performance bottlenecks we encounter and (2) wedelve “inside the box” to better understand the black-boxperformance figures.

Our results show that although modern multi-coreservers do offer commendable performance, there is stilla need for additional improvement. We explain why cur-rent multi-core servers still fall short and show that: (1)on the hardware side, the required improvements arewell-aligned with emerging server technology trends and(2) the improvements required on the software side areachievable with relatively incremental changes to oper-ating systems and NIC firmware.

Finally, we note that even though our study stemmedfrom an interest in programmable networks, packet pro-cessing is an instance of the more general class of stream-based applications (real time video delivery, stock trad-ing, continuous query processors); our findings shouldapply equally to such contexts.

2

Sockets Cores/Socket Clock rate FSB Speed Cache Memory PCIe1.1 slots2 4 1.6GHz 1.066 GHz 8 MB 8 GB DDR2 667 MHz 1x8-lane; 3x4-lane

Table 1: Xeon Server Characteristics

Figure 1: Traditional shared bus architecture

2 Experimental Setup

We start by briefly describing our hardware and softwareplatforms, as well as the workloads used in our tests.

Hardware We focus our study on an Intel Xeon-basedserver, because it is reportedly the most widely deployedmulti-core server today and because of our access to de-tailed profiling tools for this platform (we briefly discussalternate architectures in §7). Fig. 1 presents a high-levelview of the Xeon shared-bus architecture: Multiple pro-cessing cores2 are arranged in “sockets” and all com-munication between the sockets, memory, and I/O de-vices is routed through the “chipset” (a.k.a. the “north-bridge”), which includes the memory and I/O-bus con-trollers. There are three main system buses: the FrontSide Bus (FSB) interconnects the different sockets, aswell as each socket and the chipset; the PCIe bus con-nects I/O devices to the chipset via one or more chan-nels known as “lanes,” and the memory bus connects thememory to the chipset.

For our experiments, we use a mid-level server withhardware specifications as listed in Table 1. We equipthe server with four quad-port 1Gbps NICs, i.e., a totalof sixteen 1Gbps ports. To source/sink traffic, we use twoadditional machines each connected to 8 of the server’s16 ports. With this setup, we can generate a maximuminput traffic load of 16Gbps.

Software Our server runs Linux 2.6.19 with SMPClick in polling mode—i.e., the CPUs poll for receivedpackets rather than being interrupted when new packetsarrive [23]. We instrument our server with a performance

2We use the terms “CPU,” “core” and “processor” interchangeably.

100 200 300 400 500 600 700 800 900 1000110012001300140015000

0.5

1

Packet size range

Freq

uenc

y

Figure 2: Abilene distribution of packet sizes (100 bytebins)

monitoring tool similar to Intel VTune [8], and a chipset-specific tool that monitors memory-bus usage.3

Workloads A packet-processing workload can beroughly characterized by (1) the distribution of packetsizes, (2) the incoming packet rate, measured in pack-ets/sec, and (3) the type of processing required perpacket (e.g., encryption, classification, etc).

With regard to packet size, we consider both syntheticworkloads, in which every packet has a fixed size of Pbytes, as well as a trace-driven workload generated fromthe “Abilene-I” packet trace collected on the Abilene net-work [9] (see Fig. 2 for the distribution of packet sizes).For our fixed-size workloads, we consider a range ofpacket sizes but focus primarily on two extreme cases:(1) all min-sized packets (P =64B) and (2) all largepackets (P =1024B). Although unrealistic, a workload ofall min-sized packets is important, as it has historicallybeen the reference benchmark used by network equip-ment vendors.

With regard to packet-rate, we repeat our tests underincreasing input packet rates until we arrive at the maxi-mum input rate that the server can sustain without drop-ping packets. This maximum loss-free forwarding rate,or MLFFR, is the primary performance metric we re-port (similar to previous studies [23]). As appropriate,we report the MLFFR in terms of either bits-per-second(bps) or packets-per-second (pps). For lightweight packetprocessing applications—e.g., simple forwarding or IProuting—we use an MLFFR of 10Gbps as our informalminimum target for “good” performance. This seems areasonable target since links of this speed are currentlywidely deployed by ISPs, are still in the process of beingdeployed within larger enterprises and data centers, andare typically not yet being deployed at the very edges ofthe network (homes, small businesses, etc.). In addition,as mentioned in §1, achieving a per-server processingrate of 10Gbps would enable a cluster-based approach

3Although our tools are proprietary, many of the measures they re-port are derived from public performance counters and, in these cases,our tests are reproducible. In an extended technical report, we willpresent in detail how our measures, where possible, can be derived fromthe public performance counters available on Intel processors.

3

to further scaling software routers as described in [16].4

Finally, to select what types of packet processingto test, we start from the following observation: Whatmakes packet processing challenging (compared to otherapplications) is the need to rapidly stream large volumesof cache-unfriendly data in and out of the system. Hence,we start with a minimal forwarding test that reveals thefundamental capability of the server to move packets be-tween input and output ports: traffic arriving at port i isjust forwarded to port j—there is no routing-table lookupnor any form of packet processing. This simple test ex-ercises the minimal subset of operations that any packet-processing workload typically incurs—consequently, theperformance achieved for this workload serves as an up-per bound on the achievable performance for all packet-processing workloads.

Having profiled in detail this baseline workload (in §4and §5), we then consider (in §6) four specific router ap-plications: (1) full IP routing, (2) IPsec packet encryp-tion, (3) filtering based on Access-Control Lists and (4)monitoring flow statistics.

A Note on Scheduling When multiple cores sharea packet-processing workload, there are two potentialpoints of contention: (1) the same network-interfacequeue may be accessed by multiple cores, resulting insynchronization overhead; (2) the same packet may behandled by multiple cores with different caches, result-ing in unnecessary cache misses and memory accesses.Both can be avoided by assigning (1) the polling or up-date of each queue to a single core and (2) all compo-nents that handle the same packet (including queues) toa single core. In our tests, we satisfy these rules by usingfixed and mutually-exclusive input-output port pairs—traffic from port 1 is always sent to port 2 (and viceversa), from port 3 to port 4, and so on—and assigningeach pair to a different core. However, these rules can beeasily satisfied for any traffic matrix, if we have multiple(as many as cores) queues per network interface. Indeed,modern NICs support as many as 32 independent re-ceive and trasmit queues [5,6]; together with receive-sidescaling [7], this allows multiple cores to process packetsfrom/to the same interfaces without contention.

We verify this with the following simple experiment:First, two cores send traffic, each through a different out-put port. Then two cores send traffic through the sameoutput port: when we use one trasmit queue for bothcores, MLFFR drops approximately by 30% comparedto the one-port-per-core case (due to the locking over-

4Note that we use NICs with multiple 1Gbps ports to drive ourserver to rates of 10Gbps (and higher); we do so because of their lowcost, availability of Click-compatible drivers, and ease of traffic gener-ation. For our focus on the internal forwarding capability of a server,using ten 1Gbps ports is largely equivalent to using a single 10Gbpshigh-speed NIC.

64 128 256 512 1024 1500 Abilene0

5

10

15

20

Packet size

Gbp

s

64 128 256 512 1024 1500 Abilene0

2

4

6

8

Packet size

Mpp

s

Figure 3: Maximum loss-free forwarding rate for differ-ent workloads. “Abilene” corresponds to the packet-sizedistribution of the Abilene-I trace. All other workloadsconsist of same-size packet streams.

1 2 3 4 5 6 7 81

2

3

4

5

6

7

8

Cores

Mpps

Figure 4: Maximum forwarding rate as a function of thenumber of cores with 64B packets.

head); in constrast, using two transmit queues, one percore, yields the same MLFFR as the one-port-per-corecase.

3 Black-Box Forwarding Performance

Before presenting our results, it is worth noting the his-torical trend in software-router forwarding performance:the early NSF nodes achieved forwarding rates of 1Kpackets/sec (circa 1986), Click [23] achieved a maxi-mum forwarding rate of 333Kpps (1999), which SMPClick [18] improved to 500Kpps (2001). Our experi-ments use the SMP Click software on recent hardware.

We start by measuring our server’s MLFFR underminimal forwarding. Fig. 3 shows the results: (1) For512B and larger packets, as well as the real-world Abi-lene workload, the server scales to 16Gbps; this is themaximum input traffic load we can generate, meaningthat performance is, in this case, limited by the numberof NICs our server can fit—we do not hit any bottle-neck inside the server. (2) In contrast, for 64B packets,the server saturates at around 3.6Gbps or 7Mpps; the fig-ure suggests that the problem is the packet rate (bottomfigure) rather that the bit rate (top figure).

Fig. 3 shows the server’s MLFFR when all 8 cores are

4

forwarding traffic; we are also interested in understand-ing how performance scales with the number of cores.Thus, we repeat the above test with only a subset of thecores enabled for each test run. Fig. 4 shows the resultingMLFFR for the 64B packet workload as a function of thenumber of active cores (and, therefore, interfaces). Wesee two clear trends: (1) the MLFFR scales close to lin-early up to 6 cores—the ideal behavior. (2) Beyond this,the increase in MLFFR tapers off, suggesting that somesystem component is approaching saturation. We inves-tigate potential bottlenecks in the following section.

4 Hardware Bottleneck Analysis

The results from the previous section show that, althoughforwarding performance scales well for large packetsizes, it falls far short of our target 10Gbps for smallpacket sizes. Hence, in this section, we look inside ourserver and search for the bottleneck that limits forward-ing performance for a worst-case 64B packet workload.

Approach We look for the bottleneck through a pro-cess of elimination, starting with the four major systemcomponents—the CPUs and the three system buses—and drilling deeper as and when it appears warranted.For each system component, our typical methodology isto examine the maximum load the component sustainsfor the 64B packet workload, then try to construct a (po-tentially unrelated) benchmark that imposes even greaterload on the component in question. If we succeed, weconclude that there is “room for growth” on that compo-nent and, hence, it is unlikely to be the bottleneck. Fornow, we only try to discover which system componentis the bottleneck without explaining the load we see oneach component—we leave that for the next section.

CPU The traditional metric of CPU utilization revealslittle in our test setup, because Click operates in pollingmode, where the CPUs are always 100% utilized. Fig. 4already suggests that the CPUs are unlikely to be the bot-tleneck as forwarding performance levels off despite theaddition of more cores. In addition, we look at the num-ber of empty polls (the number of times the CPU pollsfor packets to process but none are available in mem-ory): Even at saturation rate (3.4Gbps for 64B pack-ets), we still see a non-trivial number of empty polls—approximately 62,000 per second, per core. Hence, weeliminate CPU processing as a candidate bottleneck.

System buses Our tools allow us to directly measurethe load in bps on the FSB and memory bus; the loaddifference between these two buses gives us an esti-mate for the PCIe load. Note that this is not always agood estimate, as FSB bandwidth can be consumed byinter-socket communication, which does not figure onthe memory bus; however, this does not happen in our

0 0.5 1 1.5 2 2.5 3 3.5 40

10

20

30

40

50

offered network load (Gbps)

obse

rver

d lo

ad (G

bps)

MemoryFSB

Figure 5: Bus bandwidths for 64B packets.

0 2 4 6 8 10 12 14 160

10

20

30

40

50

60

70

80

90

100

sustained load (Gbps)

bus

utiliz

atio

n (%

)

FSB Data (64B)FSB Address (64B)FSB Data (1024B)FSB Address (1024B)

Figure 6: FSB bus utilization for 64B and 1024B packets.

particular setup (with each input/output port pair con-sistently served by the same socket), which yields littleinter-socket communication. Fig. 5 shows the load on theFSB and memory buses for 64B packets under increasinginput rates. Next, we examine each of these loads fromFig. 5 more closely.

FSB Under the covers, the FSB consists of separatedata and address buses, and our tools allow us to sep-arately measure the utilization of each. Fig. 6 shows theresults: while it is clear that the data bus is under-utilized,it is not immediately obvious whether this is the case forthe address bus as well. To gauge the maximum attain-able utilization on each bus, we wrote a simple bench-mark (we will refer to it as the “stream” benchmark fromnow on) that creates and writes to a very large array. Thisbenchmark consumes 50Gbps of FSB bandwidth thattranslate into 37% data-bus and 74% address-bus utiliza-tion. These numbers are well above the utilization levelsfrom our packet-forwarding workload, which means thatthe latter does not saturate the FSB. Hence, we concludethat the FSB is not the bottleneck.

PCIe Unlike the FSB where all operations are in fixed-size units (64B—a cache line), the PCIe bus supportsvariable-length transfers; hence, if the PCIe bus is thebottleneck, this could be due either to the incoming bitrate or to the requested operation rate (which depends onthe incoming packet rate). We easily reject the former,

5

because, for 1024B packets, our minimal forwarding testconsumes 36Gbps of PCIe bandwidth—more than the20Gbps consumed by the same test for 64B packets.Hence, the PCIe bit-rate is not the problem.

Regarding the PCIe packet rate: From Fig. 3, weknow that, when all four NICs are forwarding traffic,MLFFR settles around 1.7Mpps per NIC. Given thatPCIe slots are independent from each other, if we cansuccessfully drive the packet rate on a single NIC be-yond this rate, then we will have shown that the packetrate of our workload does not saturate the PCIe bus. Tothis end, we run an experiment where we first send 64Bpackets through a single pair of input/output ports (at thehighest possible rate we can send through this pair), thengradually add more port pairs until the server starts drop-ping packets. At that point, a single NIC sustains approx-imately 3.5Mpps—well above 1.7Mpps. Hence, the PCIepacket rate is not the problem either.

Memory This leaves us with the memory bus as theonly potential culprit. There are two possibilities:

One is that our workload requires more aggregatememory bandwidth than our server can provide. Weeasily reject this, as the stream benchmark (describedabove) consumes 51Gbps of memory-bus bandwidth—35% higher than the 33Gbps maximum consumed byour 64B minimal-forwarding workload. One could arguethat the difference is due to the different memory-accesspatterns followed by the two workloads: the forward-ing workload follows an irregular access pattern com-pared to the nicely in-sequence pattern of the streambenchmark; it is possible that this difference allows thestream benchmark to benefit more from techniques likememory prefetching. We also reject this: we modify thestream benchmark so that it writes to random (rather thansequential) locations; this “randomized stream” bench-mark does see lower memory throughput, but the dropis modest—from 51Gbps to 46Gbps, still above the33Gbps from the forwarding workload. Hence, aggre-gate memory bandwidth is not the bottleneck.

The other possibility is that our workload’s memory-access pattern makes sub-optimal use of the physicalmemory space: The memory system is internally orga-nized as a grid of “ranks” and “banks,” accessed throughmultiple memory “channels.” The memory controllersmap an address to a channel and rank-bank elementbased on a subset of address bits—the channel-select,rank-select and bank-select bits—chosen so that the log-ical address space is well distributed across physicalmemory elements. Accesses to different rank-bank ele-ments (even within one channel) can proceed in parallel,but accesses to a single rank-bank element are serialized.If, for some reason, our workload ends up accessing onlya small subset of the rank-bank elements, it fails to lever-age the full parallelism of the memory system.

Our server has two memory controllers, each control-ling two memory channels; each memory controller seesa grid of 4 ranks × 4 banks (2 ranks × 4 banks per chan-nel), and our chipset tool reports the traffic to each rank-bank pair. Fig. 7 shows the distribution of memory traf-fic over the grid, aggregated across both controllers, for(1) the stream benchmark (Fig. 7(a)), (2) 1024B packetsat 16Gbps (Fig. 7(b)), and (3) 64B packets at the satu-ration rate of 3.4Gbps (Fig. 7(c)). While memory traf-fic is perfectly balanced for the stream benchmark (andreasonably balanced for the 1024B packet workload), forthe 64B packet workload, it is concentrated on four rank-bank elements, two on each controller (recall that the fig-ure shows the load aggregated across two controllers).

This result suggests that the bottleneck is the band-width to the individual rank-bank elements that, for somereason, end up carrying most of the 64B packet work-load. To verify this, we measure the maximum attainablebandwidth to a single rank-bank element: we use a sim-ple “one-location stream” benchmark that runs multiplethreads on all cores, all of which continuously read andwrite a single location in memory. The result is 7.2Gbpsof memory traffic (all on a single rank-bank pair), whichis almost equal to the maximum rank-bank load recordedat saturation for the 64B packet workload. We shouldnote that both the CPUs and the FSB are under-utilizedduring the one-location stream test. Hence, we concludethat the bottleneck is the memory system, not becauseit lacks the necessary capacity, but because of the loadimbalance across the physical memory organization.

Explaining the Imbalance Fig. 7 shows that, for 1024Bpackets, the memory load is better distributed than for64B packets; this leads us to suspect that the imbalanceis related to the manner in which packets are laid out ontothe rank-bank grid. To further investigate, we run an ex-periment where we maintain a constant packet rate andmeasure the resulting memory-load distribution for dif-ferent packet sizes (64B to 1500B); we choose a (low)packet rate of 400Kpps so that we can maintain it for allpacket sizes. Fig. 8 shows the outcome: ignoring the loadon {bank 2, rank 3}5, as packet size increases, the addi-tional memory load is distributed over increasing num-bers of rank-bank pairs.

This observation leads us to the following theory: thedefault packet-buffer size in Linux is 2KB; each suchbuffer spans the entire rank-bank grid, which would al-low high memory throughput if we were using the entire2KB allocation; however, our 64B packet workload endsup using only one rank-bank element on each memorychannel, leading to the two spikes we see in Fig. 7(c).

5we see this load even when the server is forwarding no packets atall, and it drops with increasing packet rate; we believe it is due to thenormal polling-mode operations of Click and the operating system.

6

12

34

12

34

0

5

10

rankbank

Gbps

(a) Stream benchmark

12

34

12

34

0

5

10

rankbank

Gbps

(b) 1024B-packet workload

12

34

12

34

0

5

10

rankbank

Gbps

(c) 64B-packet workload

12

34

12

34

0

5

10

rankbank

Gbps

(d) 64B-packet with 1KB buffers

Figure 7: Memory load distribution across banks and ranks.

To verify this, we reverse-engineer the algorithm usedby the memory controllers to map addresses to channels,ranks, and banks: we find that the 7th and 8th addressbits select the channel, the 9th bit selects the rank, whilethe 10th and 11th bits select the bank. This means thata contiguous 2KB address range indeed covers the en-tire rank/bank grid, while a 64B range can fit into a sin-gle rank-bank element; hence, if multiple 2KB buffersare allocated contiguously, and all buffers contain a 64Bpacket each at the same offset, a 64B packet workloadmay even hit a single rank-bank element.

As a confirmation, we repeat our earlier experimentwith 64B packets from Fig. 7(c), but now change thedefault buffer allocation size to 1KB. If a 2KB addressspace spans the entire grid, then 1KB should span halfthe grid and, hence, the two “spikes” in Fig. 7(c) shouldsplit into 4 spikes; Fig. 7(d) shows that this is indeed thecase. Finally, if our conjecture that this imbalance is theperformance bottleneck is right, then reducing the imbal-ance should translate to higher packet-forwarding rates;happily, using 1KB buffers increases the MLFFR by22.2%, from 3.6Gbps to 4.4Gbps (or 7 to 8.2 Mpps).

A general solution to this issue (inspired by similarapproaches in hardware routers [26]) would be to havepacket buffers of various sizes (e.g., 64B, 256B, 1024Band 2048B). This can be implemented by creating multi-ple descriptor rings—one per buffer size; on receiving anincoming packet, the NIC would have to determine thepacket size and use the appropriate descriptor ring. Un-fortunately, implementing this architecture requires mod-ifying the NIC firmware and internal design to make itaware of (and use) the transmit/receive queues that de-pend on the packet size. Nonetheless, our experimentwith 1024B buffers clearly shows the cause (and poten-tial to remedy) the problem of imbalanced memory-loaddistribution when forwarding small packets.

The Next Bottleneck Having overcome our initial bot-tleneck, we look for the subsequent one that limits furtherperformance improvement. We find that it is the FSB ad-dress bus—at the improved 4.4Gbps MLFFR, we recordan FSB address-bus utilization of 73%; discussions witharchitects and prior work [28] reveal that an address busis regarded as saturated at approximately 75% utiliza-

1 2 3 41 2 3 4

024

rank

64 bytes

bank

Gbp

s

1 2 3 41 2 3 4

024

128 bytes

1 2 3 41 2 3 4

024

256 bytes

1 2 3 41 2 3 4

024

512 bytes

1 2 3 41 2 3 4

024

1024 bytes

1 2 3 41 2 3 4

024

1500 bytes

Figure 8: Memory load distribution across banks andranks for different packets sizes and a fixed packet rate.

tion. Unfortunately, this is a fairly fundamental bottle-neck for shared-bus servers: the load on the address busscales rapidly with input traffic rate, because, in additionto regular CPU-driven transactions, the address bus alsocarries cache snoop messages caused by DMA transac-tions from I/O to memory—for our workload that haslittle inter-CPU communication, the load on the FSB ad-dress bus effectively scales in proportion to the numberof transactions on the memory bus. In §7, we discuss twoapproaches to moving past this bottleneck—one basedon hardware technology trends and the other based on animproved software architecture.

Summary This section crafted special-purpose bench-marks to estimate the load/rates that can be sustained byindividual system components. We then compared theseto the corresponding rates measured for our 64B packetworkload at saturation. Table 2 summarizes our findings:for each component, we define the “room for growth”as the percentage increase in usage that could be ac-commodated on the component before we hit the up-per bound recorded for our custom benchmark. For ex-ample, for the stream benchmark, we measured 51Gbpsof maximum aggregate memory bandwidth; for our 64Bpacket workload, at saturation and with 1KB buffers, wemeasured 46Gbps of aggregate memory bandwidth; thus,ignoring other bottlenecks (such as the per rank/bankload), there is room to increase memory-bus usage by

7

system attainable load w/ 64B and load w/ 64B and room-to-growcomponent limit (Gbps) 2KB buffers (Gbps) 1KB buffers (Gbps) (w/ 1KB buffers)1 rank-bank 7.2 7.168 3.6 100%

FSB (address bus) 74% 62% 73% 0%aggregate memory 51 33 46 10%

PCIe 37 19 24 65%

Table 2: Room for growth on each of the system components.

about 10% ((51-46)/46) before hitting the 51Gbps up-per bound. Missing from this table, is an estimate of theavailable CPU cycles; we evaluate this in §6 where wealso evaluate the extent to which this leftover capacity ac-commodates additional processing in the context of var-ious applications.

5 Software Efficiency

The previous section measured the load on each systemcomponent but made no attempt to explain it. We now tryto deconstruct these measured loads as a way of assessingthe packet-forwarding efficiency of our system. For this,we attempt to understand how the external input traffictranslates into load on the internal system components ofthe server. Our result here are based on fixed-size packetworkloads, repeated for different packet sizes: P =64,128, 256, 512 and 1024 bytes.

Bus Overheads We start by looking at the three sys-tem buses. Fig. 5 plotted the load on the system buses forincreasing input traffic rates. To calibrate our expecta-tions, we note that for our minimal forwarding test, eachincoming packet must result in at least the following setof operations:

1. the incoming packet is DMA-ed from the networkcard (NIC) to main memory (incurring one transac-tion on the PCIe and memory bus).

2. the packet is DMA-ed from memory to NIC (onetransaction on the memory and PCIe bus).

Note that we do not include any transfer of the packetto/from the CPU since, for our minimal forwarding test,the CPU does not even need to read packet headers todetermine the output port. Thus, by this argument, a sin-gle packet traverses at least twice on each of the memoryand PCIe bus and does not traverse the FSB (for minimalforwarding). Thus, for a line rate of R bps we estimate aminimum required load of 2R, 2R, and 0 on each of thememory, PCIe and FSB buses. These are the minimumand unavoidable bus loads required to just bring pack-ets in/out of the system. As is clear from Fig. 5, each bussees loads that are significantly higher than this minimum

0 1 2 3 4 5 6 70

2

4

6

8

10

12

offered network load (Mpps)

over

head

ratio

MemoryFSB

Figure 9: Per-packet overhead ratio as a function of in-coming packet rate for different packet sizes

which indicates (not surprisingly) that all three buses in-cur an extra per-packet overhead, beyond just movingpackets around. The extent of this overhead is indicativeof the efficiency of packet-processing with respect to bususage. We quantify this overhead as the number of extraper-packet transactions (i.e., transactions that are not dueto moving packets between NIC and memory) performedon each bus as follows:

measured load− estimated loadpacket rate · transaction size

where the estimated load is the minimum required loadwe estimated above and the transaction size is 64B sincethis is the unit of transactions on the memory and FSBbus. Fig. 9 plots this number for the FSB and memorybus as a function of the packet rate and size; the PCIeoverhead is simply the difference between the other two.So, the FSB and PCIe overheads start around 6, whilethe memory-bus overhead starts around 12; all overheadsdrop slightly as the packet rate increases.

These overheads make sense once we consider thetransactions for book-keeping socket-buffer descriptors:For each packet transfer from NIC to memory, there arethree such transactions on each of the FSB and PCIe bus:the NIC updates the corresponding socket-buffer descrip-tor, as well as the descriptor ring (two PCIe and memory-bus transactions); the CPU reads the updated descriptor,writes a new (empty) descriptor to memory, and updatesthe descriptor ring accordingly (three FSB and memory-bus transactions); finally, the NIC reads the new descrip-tor (one PCIe and memory-bus transaction). Each packettransfer from memory to NIC involves similar transac-tions and, hence, descriptor book-keeping accounts for

8

the 6 extra per-packet transactions we measure on theFSB and PCIe bus—and, hence, the 12 extra transactionsmeasured on the memory bus. The slight overhead dropas the packet rate increases is due to the cache that op-timizes the transfer of multiple (up to four) descriptorswith each 64B transaction (each descriptor is 16B); thisoptimization kicks in more often at higher packet rates.

We should note that these overheads are surprisinglyhigh, especially for small packets: for 1024B packets,12 per-packet transactions on the memory bus translateinto a 37.5% overhead; for 64B packets, this representsan overhead of 600%! This suggests that software archi-tectures that reduce these overheads could significantlyimprove performance; we discuss this in Section 7.

CPU Overhead To quantify the CPUs’ efficiencyin processing packets, we record the instructions-per-packet (IPP) and cycles-per-instruction (CPI) consumed.For our minimal forwarding workload, we record an IPPof 877 and a CPI of 1.25. We record these at an in-put packet rate well below the MLFFR to avoid FSBsaturation—FSB saturation would result in high CPI (asthe CPUs would wait longer for FSB accesses to com-plete); such a high CPI would reflect the (in)efficiency ofthe FSB rather than that of the CPUs. The low IPP indi-cates that, as expected, our minimal forwarding workloadrequires little CPU processing, while the low CPI showsthat these instructions are efficiently dispatched and ex-ecuted. The following section looks at packet-prcessingapplications that make greater demand of CPU resources.

6 Packet Processing and Programmability

So far, our test workload involved minimal per-packetprocessing. we now examine certain trade-offs betweenperformance and additional per-packet processing (i.e.,over-and-above minimal forwarding). Such additionalprocessing can lead to a need for more (1) CPU cyclesand/or (2) memory or I/O bandwidth—hence, how much“room” is available for such processing depends on thesystem capacity that is leftover after minimal forwarding.We first quantify this leftover capacity in an application-agnostic context (§6.1) and then examine specific appli-cations (§6.2).

6.1 Available Capacity

CPU cycles We have already mentioned that we cannotuse CPU utilization as our metric—our servers operate inpolling mode, so their CPUs are always 100% utilized.Instead, we estimate the number of available CPU cyclesas the number of CPU cycles “wasted” in empty polling:we select an instruction I that involves only CPU pro-cessing, and invoke this instruction k times per packet;

0 1 2 3 4 5 6 7 8102

103

104

105

106

load (Gbps)

spar

e cy

cles/

pack

et

64B1500B

Figure 10: Spare cycles per packet on Xeon server

given a certain workload, we increase k until the num-ber of empty polls approaches zero while maintainingthe same forwarding rate. Let kmax denote the number ofinvocations of I at this point; we compute the availablecycles-per-packet as kmaxCi, where Ci is the number ofcycles required to execute instruction I. For I, we selectthe instruction that reads a 64-bit Model Specific Regis-ter and consumes 223 cycles on our system.

Fig. 10 shows the resulting available-cycles per packetfor increasing incoming traffic rates and two packet sizes;these roughly correspond to the available instructions perpacket since we typically record cycle/instruction valuesof 1.0− 1.3 in our tests. As a point of reference, recallthat our minimal forwarding workload uses 877 instruc-tions; in what follows we also consider applications thatmake full use of the available CPU cycles. Finally, wenote that these numbers can be viewed as conservativesince our 1.6GHz CPUs are relatively slow by currentstandards (e.g., quad-core 3.4GHz Xeon chips are al-ready available), while servers with 32 cores are reportedto be due for release within the next two years.

Bus Bandwidths We evaluated the available memoryand I/O capacity in Section 4 – summarized in Table 2.To study the extent to which this leftover capacity can beexploited by realistic applications, we now look at fourdifferent packet-processing applications.

6.2 Packet Processing ApplicationsWe consider the following applications for evaluation:

1. IP routing2. IPsec packet encryption3. Filtering based on Access Control Lists (ACLs)4. Monitoring flow statistics

Our selection is intended to represent packet process-ing applications that are both, commonly found in cur-rent deployments, and sufficiently diverse in their com-putational needs. For example, IP routing and filteringcan require multiple accesses to data structures other thanthe packet in question; packet encryption involves CPU-intensive processing of the packet being processed but

9

Test Descriptionsmall_table 4 routing-table entries,

64B packets,random destination addresses

small_table+ as above,but with 1KB packet buffers

big_table 256K routing-table entries,64B packets,random destination addresses

Table 3: Experiments with full IP-forwarding.

little manipulation of additional data structures, and soforth. Our applications are all implemented in Click andwe run our tests for the synthetic 64B and 1024B packetworkloads; results for the abilene workload were identi-cal to those for 1024B packets.

(1) IP routing We augment our minimal forwardingClick configuration for full IP routing which includes achecksum calculation, updating packet header fields andperforming a longest prefix lookup of the packet destina-tion address in the IP routing table. The IP lookup algo-rithm we use is the D-Lookup algorithm from [22]. Werun three different test cases as described in Table 3 andFig. 11 shows the corresponding MLFFR.

In the first small_table test, our aim is to measurethe effect of processing packet headers (which our min-imal forwarding test didn’t do) and hence we use a verysmall routing table which ensures that all route lookupshit the cache. Relative to minimal forwarding, we see thatfull IP routing lowers the MLFFR for this test by about38%; this is because IP routing requires that the CPUsread/write packet headers from/to memory, and theseadditional memory accesses result in earlier saturationof our memory rank-bank bottleneck. To validate this,the small_table+ test repeats small_table forpacket-buffer sizes of 1KB and we see that (as expected)this improves throughput to roughly 5.6Mpps and nowthe bottleneck is the FSB. The third big_table exper-iment uses a standard full-sized IP routing table contain-ing 256K entries. We see that using a larger routing tabledoes not affect performance significantly, suggesting thatour (8MB) cache absorbs most route lookups. For 1024Bpackets, IP routing scales to 16Gbps, the maximum inputtraffic we can generate.

(2) IPsec packet encryption In this test, we implementIPsec encryption of the packet contents, as typically usedby VPNs. For workloads of 64 and 1024B packets werecord an MLFFR of 1.52Gbps (1.66Mpps) and 3.0Gbps(349Kpps) respectively. Not surprisingly, we find that thebottleneck for this test is the CPU because of the compu-tationally intensive encryption. This is clear from Fig. 13which plots the MLFFR for increasing numbers of cores– we see that the MLFFR scales linearly from 1 up to

minimal fwd Xeon small Xeon big Xeon small+0

2

4

6

8

Mpp

s

Figure 11: IP Routing performance with 64B packets

1 2 3 4 5 6 7 80

0.5

1

1.5

2

Cores

Mpp

s

64 Bytes1024 Bytes

Figure 12: Packet encryption performance

8 cores (rather than saturating beyond 4-6 cores as withminimal forwarding). The 1024B packet workload per-forms worse than the 64B one since the per-packet costof encryption is higher.

(3) Filtering based on ACLs In this test, the routerstores a pre-defined set of rules that are defined onpacket header fields. When a packet arrives, its headeris matched against each rule in sequence; a packet thatmatches any rule is dropped, otherwise it is forwardedto the appropriate output port. For uniformity, we con-struct our input traffic workload such that packets do notmatch on any rule thus forcing a complete traversal of therule-table for every packet. Fig. 13 shows the resultantMLFFR for different numbers of rules. We see that forsmall rule-table sizes and 64B packet workloads, perfor-mance is lower than minimal forwarding and essentiallyidentically to that of IP routing. The reason performancedrops is the same as for IP routing, namely that the CPUnow reads the actual packet headers. For larger 1024Bpackets and small rule-tables, performance is again onlylimited by the total traffic we can feed the server and notby any bottleneck within the server itself. Finally, we seethat as the number of rules increases, the cost of the lineartraversal of the rule-table eventually causes performanceto drop (e.g., at 1000 rules) and, the CPU is the bottle-neck at this point.

(4) Monitoring Flow Statistics Our final application is aper-flow packet counter – a basic building block for traf-fic monitoring and measurement systems. For this, wemaintain a hash table that stores an entry for every traf-fic flow as defined by the standard 5-type (source anddestination IP addresses and ports and protocol field).When a packet arrives, its corresponding flow entry is up-

10

minimal fwd 10 rules 100 rules 1000 rules0

2

4

6

8

64B packets

Mpp

s

minimal fwd 10 rules 100 rules 1000 rules0

2

4

6

8

1024B packets

Mpp

s

Figure 13: Filtering performance

1 2 3 4 5 6 7 80

1

2

3

4

5

6

Cores

Mpps

Figure 14: Per-flow monitoring performance

dated (and created if it doesn’t exist). Fig. 14 shows thepackets-per-second forwarded as a function of the num-ber of cores. We see that performance is similar to thecase of IP routing and filtering and can be attributed tothe same causes and bottlenecks.

7 Projected Improvements and Summary

Our results identified two major causes that limit the for-warding performance of our server: (1) the shared na-ture of the FSB that causes it to be easily saturated and(2) the inordinately high per-packet overheads that resultfrom current packet descriptor handling. In this section,we discuss two approaches to tackling these issues. Westress that the results presented here are not intended tobe comprehensive or conclusive; our intent is merely tooffer preliminary evidence of the potential benefits of thesolutions we discuss.

Replacing shared buses by point-to-point links Thefirst “solution” is simply to use server architectures thataltogether eschew the use of a shared FSB. Fortunately,this is exactly the design philosophy being adopted bynewer server architectures. For example, Fig. 7 showsthe high-level “mesh” architecture to be supported by theemerging generation of multi-core servers [11].6

6These architectures are essentially the multi-socket configurationfor the recently released single-socket Nehalem platform. Unfortu-nately the currently available single-socket machines do not offer thebenefit of multiple memory controllers; the multi-socket machines are

Figure 15: Mesh server architectures based on point-to-point inter-socket links

These mesh architectures bring two fundamentalchanges relative to the shared-bus one: First, the FSBis replaced by a mesh of dedicated high-speed point-to-point links between sockets (e.g., AMD’s Direct Con-nect(tm) or Intel’s QuickPath(tm) links), thus removing apotential bottleneck for inter-CPU communication. Sec-ond, the single external memory controller (previouslyshared across CPUs) is replaced with a memory con-troller integrated within each CPU; this leads to a dra-matic increase in aggregate memory bandwidth, as eachCPU now has a dedicated link to a portion of the mem-ory space. We use the term “k-mesh” to refer to a mesharchitecture with k sockets; the figure shows a k = 2 andk = 4-socket example. We note that the mesh architecturedescribed above is inspired by earlier designs of AMDOpteron-based PCs [1] but extends them in two key waysto improve scalability: (1) sockets are connected in a fullmesh rather than a grid and (2) each socket has its ownI/O link.7

Compared to the shared-bus Xeon, in a mesh architec-ture, no single bus sees its load grow with the aggregateof memory transactions. Instead, each inter-socket linkonly carries load driven by the CPUs and local memorysystems it interconnects. Thus, for the same external linerate, we can expect the per-bus load to be lower in a mesharchitecture than in a shared-bus one.

Using reasoning similar to that from Section 5, onecan argue that for the 4-mesh, each packet contributes atleast 2 memory-bus transactions, 0 inter-socket commu-nication transactions and 2 PCIe transactions. In a quad-

expected to be available by late 2009.7We repeated our minimal-forwarding test on a Sun x4500 with 4

dual-core 3.0GHz Opteron 8222 CPUs and 16GB of system memorybut achieved a performance of only 6.4Mpps, lower than with our Xeonserver. We suspect the poor performance is due to the two scaling liabil-ities mentioned above but lack the necessary tools to carefully analyzeperformance on this machine. We thus continued with our focus on In-tel mesh-based servers.

11

Xeon 64B Nehalem 64B Xeon 1024B Nehalem 1024B0

10

20

30

Architecture − Packet size

Gbp

s

Xeon 64B Nehalem 64B Xeon 1024B Nehalem 1024B05

10152025

Architecture − Packet size

Mpp

s

Figure 16: Nehalem v. Xeon throughput for differentworkloads

socket server, we have 4 memory buses, 6 inter-socketlinks and 4 PCIe links; hence, assuming uniform loaddistribution across the system, a line rate of R bps leadsto a minimum bus load of R/2, 0, and R

2 on each of thememory, inter-socket, and PCIe buses compared to 2R, 0and 2R in the shared-bus case representing a 4-fold loadreduction. Generalizing the argument, one might hopethat a k-socket mesh yields a k−fold reduction in per-busloads and hence a k−fold performance improvement.

As a sanity check on these estimates, we obtained ac-cess to an evaluation prototype of a 2-socket Intel Ne-halem(tm) platform [14]. The server’s hardware charac-teristics are specified in Table 4 and it runs the same soft-ware installation as our Xeon server. The Nehalem servertakes three 10Gbps cards, each connected to an addi-tional machine for sourcing/sinking traffic through whichwe can generate a maximum of approximately 27.5Gbps.

We are restricted to black-box testing due to our lim-ited access to the machine, its prototype status and be-cause the server does not yet support our monitoringtools. We measure the MLFFR for the same minimal for-warding test and workloads as before. Fig. 7 shows theresults (for convenience, we also replot the Xeon rates):we see that for 64B packets, the 2-socket Nehalem doesindeed more than double the throughput achieved by theXeon; for larger packets, performance is once again lim-ited only by the maximum input traffic we can generate.

While comprehensive testing is required to vali-date our estimates, these results suggest that packet-forwarding workloads do benefit from the trend towardsincreasingly parallel server architectures and that for-warding performance appears to scale as expected withthe degree of parallelism.

Reducing per-packet overheads As suggested bySection 5, one approach to improving forwarding per-formance even with a shared-bus server would be toreduce the per-packet descriptor overhead. This mightbe achieved by having a single descriptor summarize n

packets; n can be fairly small (≤ 10) to not affect NICstorage requirements and can even be made adaptivebased on incoming packet rate. Amortization can how-ever impose increased delay which can be controlled byhaving a timeout that regulates the maximum time periodthe NIC can wait to transfer packets. Setting the intervalto a small multiple (e.g., 2n times) would guarantee anacceptable delay penalty. Such a scheme would repre-sent a fairly modest change to existing NIC and softwarearchitectures and does not affect the programmability ofthe system; yet it does require modifying NIC logic towhich we have no access and hence cannot experimentwith. Instead, in what follows, we use our measurementsfrom the previous sections to estimate the potential per-formance improvement such a scheme might offer.

The relative performance improvement we can expectdepends on the reduction in the per-component over-heads (due to amortization) and the room-for-growthavailable on each system component. From the per-packet bus overheads we recorded earlier—§5, Fig. 9—we can compute the reduction in per-bus load thatwould result from using amortized descriptors. For in-stance: from Fig. 9, we can approximate the loadon the memory bus as 2 · bit rate + 10 · packet rate ·transaction size (taking the average of 8x to 12x inFig. 9); were we to transfer, say, n = 10 descriptorswith a single transaction, this load would reduce to2 · bit rate+ packet rate · transaction size, which corre-sponds to a 4-fold reduction in the load on the memorybus. Since FSB address-bus utilization scales with mem-ory transactions, we likewise expect a 4-fold reduction inload on the FSB address bus. Applying a similar line ofreasoning to the PCIe, we can show that, for 64B pack-ets, descriptor amortization stands to reduce PCIe loadby a factor of 2.5.

We next consider the “room-for-growth” available oneach system component. This was evaluated in §4 andsummarized in Table 2; we showed a room-for-growthof 0, 10 and 65% (a factor improvement of 1.0, 1.1 and1.65) on the FSB, memory and PCIe buses respectively.

Thus combining the effect of reduced overheads withthe available room-for-growth, we estimate that one canexpect to accommodate a factor increase of 4 (4× 1.0),4.4 (4× 1.1), and 4.12 (2.5× 1.65) on each of the FSB,memory, and PCIe buses respectively. Since the max-imum improvement that can be accommodated on allbuses is a factor of 4, we argue that a 10-to-1 amorti-zation of descriptor-related overheads could offer a 4-fold improvement in the minimal forwarding rate—from4.4Gbps to 17.6Gbps. Note that one could make a similarargument for mesh-based servers as well thus combiningthe benefits of parallelism and more efficient packet de-scriptor architectures.

12

Sockets Cores/Socket Clock rate Inter Socket BW Memory PCIe slots Cache2 4 2.4 GHz 256 Gbps DDR3 667 MHz 2x8, 1x4 16MB

Table 4: Hardware Characteristics of the experimental 2-socket Nehalem server

0

5

10

15

20

25

30

Gbp

s

XeonNehalem

64 bytes

abilene

64 bytesw/amortizeddescriptors(estimate)

64 bytes

abilene

Figure 17: Summary of minimal forwarding perfor-mance.

Summary We return to our questions from Section 1and recap our findings.

Q1: What packet-processing performance are currentcommodity multi-core servers capable of?

We found that for realistic packet-size workloads, cur-rent multi-core architectures with unmodified softwarestacks easily scale to meet our target rate of 10Gbps butfall short for worst-case workloads with only small pack-ets Fig. 7 presents our key performance results for mini-mal forwarding.

Q2: What hardware bottlenecks limit performance?We found that the initial bottleneck was at the memory

system due to an unfortunate combination of packet lay-out and memory-chip organization. After addressing theissue, we found that the next bottleneck was at the FSBaddresss bus.

Q3: How effective is network software at exploiting theavailable hardware resources?

We showed that the “chatty” manner in which packetdescriptors are managed imposes inordinately high per-packet overhead on the memory and FSB buses.

Q4: What tradeoffs between performance and per-packetprocessing are feasible?

We presented performance results (with bottleneckanalysis) for four representative packet-processing appli-cations in Section 6; in each case, our results were con-sistent with (and aided by!) our findings from the studyof our baseline minimal forwarding workload.

Q5: What (hardware or software) architectural changesmight further improve performance? We discussed theuse of new mesh-based server architectures as a way ofavoiding the bottleneck due to the FSB and presented ini-tial results showing a more than two-fold performanceincrease from a prototype mesh server. We discussed theuse of aggregate packet descriptors to improve the effi-

ciency of network software; back-of-the-envelope anal-ysis suggests this might offer a significant performanceincrease—e.g., a 4x improvement using descriptors thatsummarize 10 packets.

Taken together, these findings lead us to be cautiouslyoptimistic about the ultimate viability of scaling soft-ware routers. They show that, although current com-modity servers do offer commendable performance, justhaving multiple cores is not enough. Instead the paral-lelism in CPUs should be accompanied by a correspond-ing parallelism in memory capacity and by more effi-cient packet descriptor architectures. Fortunately, the for-mer requirement is well aligned with current server tech-nology trends and the latter requires only incrementalchanges to NICs and operating systems.

Finally, we note that the above analyses are all basedon min-sized packet workloads. However this workloadis a particularly punishing one for software routers; forrealistic workloads even current off-the-shelf hardwareand software scales to over 10Gbps. Moreover this work-load has remained important more due to its traditionalrole in benchmarking commercial network equipmentthan because of its realism. This leads us to suggest thatit might be valuable for the community to revisit its tra-ditional emphasis on min-sized packet workloads as thebenchmark for router performance.

8 Related and Future WorkClick [23] and Scout [25] explored questions of archi-tecting router software for extensibility; SMP Click [18]extends the early Click architecture to better exploit mul-tiprocessor PCs. These efforts focused primarily on soft-ware design for extensibility and stop at black-box per-formance evaluations. Our work focuses primarily onperformance – we assume Click’s software architectureand take an in-depth look at the scaling limits of recentmulti-core servers.

There is an extensive body of work on benchmark-ing various application workloads on general-purposeservers. The vast majority of this work is in the con-text of end-system workloads and (consequently) bench-marks such as TPC-C. There is also a large body of workon packet processing using a combination of general-purpose processors and specialized hardware (e.g., see[24] and the references therein). Most recently, Turneret al. describe a Supercharged Planetlab Platform [27]for high-performance overlays that combines general-purpose servers with network processors (for slow andfast path processing respectively); they achieve forward-ing rates of up to 5Gbps for 130B packets. We focus in-

13

stead on general-purpose servers and our results suggestthat these offer competitive performance.

In an earlier study, Bianco et al. also measured therouting performance of an Intel Xeon-based server, al-beit a single-core one, equipped with a PCI-X (ratherthan PCIe) I/O bus. Not surprisingly, they report differ-ent bottlenecks: the (single) CPU and PCI-X bus; thesebottlenecks were inferred through black-box testing [17].

Recent work by some of our authors [20] exploresthe use of commodity hardware to build virtual routers.That work focuses primarily on questions of virtualiza-tion rather than detailed performance analysis. Finally,our work also builds on a recent position paper makingthe case for cluster-based software routers [16]; the paperidentifies the need to scale servers to line rate but doesn’texplore the issue of performance in any detail.

We plan to extend our work along three main direc-tions. First, we are discussing the possibility of imple-menting the aggregate descriptor scheme with Intel NICdevelopment groups. Second, we hope to repeat our anal-ysis on the multi-socket Nehalem servers once available[14]. Finally, we’re currently working to build a cluster-based router prototype as described in earlier work [16].

9 ConclusionA long-held perception has been that general-purposeprocessors are incapable of high-speed packet forward-ing motivating an industry around the development ofspecialized (and often expensive) network equipment.However it is valuable to recalibrate our perceptions asmodern PC technology evolves. In this paper, we revisitold questions about the scalability of in-software packetprocessing in the context of recent and emerging servertechnology trends. Our results show that although currentmulti-core servers do offer commendable performance,there is still a need for additional improvement. We re-veal why this is the case and show that: (1) on the hard-ware side, the improvements required are well-alignedwith emerging server technology trends and (2) the im-provements required on the software side are achievablewith relatively incremental changes to operating systemsand NIC firmware. We hope that our results, taken to-gether with the growing need for more flexible networkinfrastructure, will spur further exploration into the roleof commodity servers and operating systems in buildingfuture networks.

References

[1] AMD Direct Connect Architecture. amd.com.[2] Arista Networks. aristanetworks.com.[3] Cavium Networks. caviumnetworks.com.[4] Cisco Opening Up IOS. networkworld.com/

news/2007/121207-cisco-ios.html.

[5] Intel 10 Gigabit XF SR Server Adapters. intel.com.

[6] Intel Gigabit Quad Port Adapter. intel.com.[7] Intel I/O Acceleration Technology. intel.com/

technology/ioacceleration.[8] Intel VTune Analyzer. intel.com.[9] National Laboratory for Applied Network Re-

search. http://pma.nlanr.net.[10] NetFPGA. stanford.edu/NetFPGA/.[11] Next-Generation Intel Microarchitec-

ture. intel.com/technology/architecture-silicon.

[12] Open Source Networking. vyatta.com/.[13] Workshop on Programmable Routers for Extensi-

ble Services. sigcomm.org/workshops.[14] Intel Demonstrates Industry’s First 32nm Chip

and Next-Generation Nehalem Microprocessor Ar-chitecture., 2007. intel.com/pressroom/releases/20070918corp_a.htm.

[15] Juniper Open IP Solution Development Program.juniper.net/company/presscenter/pr/2007/pr-071210.html, 2007.

[16] K. Argyraki et al. Can Software Routers Scale? InACM Sigcomm PRESTO Workshop, Aug. 2008.

[17] Bianco et al. Click vs. Linux: Efficient Open-Source IP Network Stacks for Software Routers. InIEEE Workshop on High Performance Switching,2005.

[18] B. Chen and R. Morris. Control of Parallelism in aMultiprocesor PC Router. In USENIX Tech., 2001.

[19] D. Comer. Network Processors. cisco.com/web/about/ac123/ac147/archived/ipj_7-4/.

[20] N. Egi, A. Greenhalgh, M. Handley, M. Hoerdt,F. Huici, and L. Mathy. Towards high performancevirtual routers on commodity hardware . In ACMCoNEXT, Dec. 2008.

[21] R. Ennals et al. Task Partitioning for Multi-CoreNetwork Processors. In ICCC, 2005.

[22] P. Gupta, S. Lin, and N. McKeown. RoutingLookups in Hardware at Memory Access Speeds.In IEEE Infocom, Mar. 1998.

[23] Kohler et al. The Click Modular Router. ACMToCS, 18(3):263–297, Aug. 2000.

[24] Mudigonda et al. Reconciling Performance andProgrammability in Routers. In SIGCOMM, 2007.

[25] Spalink et al. Building a Software-Based RouterUsing Network Processors. In SOSP, 2001.

[26] C. Systems. Introduction to Cisco IOS.[27] J. Turner et al. Supercharging PlanetLab—A High

Performance Overlay Platform. In SIGCOMM,2007.

[28] B. Veal and A. Foong. Performance Scalability ofa Multi-Core Web Server. In ACM ANCS, 2007.

14

Understanding the Packet Processing Capability of Multi-Core … · 2017-10-24 · Understanding...

Documents

Transcript of Understanding the Packet Processing Capability of Multi-Core … · 2017-10-24 · Understanding...