Maintaining High Bandwidth under Dynamic Network Conditions · Maintaining High Bandwidth under...

2005 USENIX Annual Technical Conference USENIX Association 193

Maintaining High Bandwidth under Dynamic Network Conditions

Dejan Kostic, Ryan Braud, Charles Killian, Erik Vandekieft, James W. Anderson,Alex C. Snoeren and Amin Vahdat ∗

Department of Computer Science and EngineeringUniversity of California, San Diego

{dkostic, rbraud, ckillian, evandeki, jwanderson, snoeren, vahdat}@cs.ucsd.edu

AbstractThe need to distribute large files across multiple wide-area

sites is becoming increasingly common, for instance, in sup-port of scientific computing, configuring distributed systems,distributing software updates such as open source ISOs or Win-dows patches, or disseminating multimedia content. Recentlya number of techniques have been proposed for simultaneouslyretrieving portions of a file from multiple remote sites withthe twin goals of filling the client’s pipe and overcoming anyperformance bottlenecks between the client and any individ-ual server. While there are a number of interesting tradeoffsin locating appropriate download sites in the face of dynami-cally changing network conditions, to date there has been nosystematic evaluation of the merits of different protocols. Thispaper explores the design space of file distribution protocolsand conducts a detailed performance evaluation of a numberof competing systems running in both controlled emulation en-vironments and live across the Internet. Based on our experi-ence with these systems under a variety of conditions, we pro-pose, implement and evaluate Bullet′ (Bullet prime), a meshbased high bandwidth data dissemination system that outper-forms previous techniques under both static and dynamic con-ditions.

1 Introduction

The rapid, reliable, and efficient transmission of a dataobject from a single source to a large number of receiversspread across the Internet has long been the subject ofresearch and development spanning computer systems,networking, algorithms, and theory. Initial attempts ataddressing this problem focused on IP multicast, a net-work level primitive for constructing efficient IP-leveltrees to deliver individual packets from the source to eachreceiver. Fundamental problems with reliability, con-gestion control, heterogeneity, and deployment limitedthe widespread success and availability of IP multicast.

∗This research is supported in part by the Center for NetworkedSystems, Hewlett Packard, IBM, and Intel. Vahdat and Snoeren aresupported by NSF CAREER awards (CCR-9984328, CNS-0347949).

Building on the experience gained from IP multicast, anumber of efforts then focused on building application-level overlays [2, 9, 10, 12, 24] where a source wouldtransmit the content, for instance over multiple TCP con-nections, to a set of receivers that would implicitly act asinterior nodes in a tree built up at the application level.

While certainly advancing the state of the art, tree-based techniques face fundamental reliability and per-formance limitations. First, bandwidth down an overlaytree is guaranteed to be monotonically decreasing. Forinstance, a single packet loss in a TCP transmission highup in a tree can significantly impact the throughput toall participants rooted at the node experiencing the loss.Hence, in a large overlay, participants near the leaves arelikely to experience more limited bandwidth. Further,because each node only receives data from its parent, thefailure of a single node can dramatically impact overallsystem reliability and may cause scalability concerns un-der churn as a potentially large number of nodes attemptto rejoin the tree. Reliability in overlay trees is perhapseven more of a concern than in IP-level multicast, sinceend hosts are more likely to fail or to suffer from perfor-mance anomalies relative to IP routers in the interior ofthe network.

Thus, “third-generation” data dissemination infras-tructures have most recently focused on buildingapplication-level overlay meshes [3, 5, 13, 25]. The ideahere is that end hosts self-organize into a general meshby selecting multiple application-level peers. Each peertransmits a disjoint set of the target underlying data to thenode, enabling the potential to greater resiliency to fail-ure and increased performance. Reliability is improvedbecause the failure of any single peer will typically onlyreduce the transmitted bandwidth by 1/n where n is thetotal number of peers being maintained by a given node.In fact, with a sufficient numbers of peers, the failureof any one peer may go unnoticed because the remain-ing peers may be able to quickly make up for the lostbandwidth. This additional redundancy also enables each

2005 USENIX Annual Technical Conference USENIX Association194

node to regenerate its peer set in a more leisurely manner,eliminating the “reconnection storm” that can result fromthe failure of a node in an overlay tree. Mesh-based ap-proaches may also improve overall bandwidth deliveredto all nodes as data can flow to receivers over multipledisjoint paths. It is straightforward to construct scenarioswhere two peers can deliver higher aggregate bandwidthto a given node than any one can as a result of bottlenecksin the middle of the network.

While a number of these mesh-based systems havedemonstrated performance and reliability improvementsover previous tree-based techniques, there is currently nodetailed understanding of the relative merits of existingsystems nor an understanding of the set of design issuesthat most significantly affect end-to-end system perfor-mance and reliability. We set out to explore this space,beginning with our own implementation of Bullet [13].Our detailed performance comparison of these systemsboth live across PlanetLab [20] and in the ModelNet [28]emulation environment led us to identify a number of keydesign questions that appear to most significantly impactoverall system performance and reliability, including: i)whether data is pushed from the source to receivers orpulled by individual receivers based on discovering miss-ing data items, ii) peer selection algorithms where nodesdetermine the number and identity of peers likely to havea good combination of high bandwidth and a large vol-ume of missing data items, iii) the strategy for deter-mining which data items to request and in what quantitybased on changing network conditions, iv) the strategyemployed by the sender to transmit blocks to its currentset of peers based on the data items currently queued fortransmission, and v) whether the data stream should beencoded using, for instance, Erasure codes [17] or trans-mitted unmodified.

By examining these questions in detail, we designedand implemented Bullet′ (pronounced as Bullet prime), anew mesh-based data dissemination protocol that outper-forms existing systems, including Bullet, SplitStream,and BitTorrent, by 25-70% in terms of download times,under a variety of network conditions. Thus, the princi-pal contribution of this work is an exploration and ini-tial enumeration of the design space for a broad rangeof mesh-based overlays and the design, implementation,and evaluation of a system that we believe embodiessome of the “best practices” in this space, enabling goodperformance and reliability under a variety of networkconditions, ranging from the static to the highly dynamic.

It is becoming increasingly important to synchronizeupdates to a logical data volume across the Internet, suchas synchronizing FTP mirrors or updating installed soft-ware in distributed testbeds such as PlanetLab and theGrid. Today, the state of the art for disseminating updatesconsists of having all clients contact a central repository

known to contain the most current version of the data toretrieve any updates. Unfortunately, limited bandwidthand CPU at any central repository limit the speed and re-liability with which the data can be synchronized acrossa large number of sites. We designed and implementedShotgun, a set of extensions to the popular rsync toolto enable clients to synchronize their state with a central-ized server orders of magnitude more quickly than previ-ously possible.

The rest of this paper is organized as follows. Section2 describes the design space for all systems doing largescale file distribution, Section 3 describes the specific im-plementation and architecture of Bullet′, and Section 4presents the results of our experiments testing our exper-imental strategies and comparisons with other systems,including evaluation of Shotgun. Section 5 outlines therelated work in the field, and the paper concludes in Sec-tion 6.

2 Design Principles and Patterns

Throughout the paper, we assume that the source trans-mits the file as a sequence of blocks, that serve as thesmallest transfer unit. Otherwise, peers would not beable to help each other until they had the entire file down-loaded. In addition, we concentrate on the case when thesource of the file is the only node that has the file ini-tially, and wishes to disseminate it to a large group of re-ceivers as quickly as possible. This usage scenario corre-sponds to a flash-crowd retrieving a popular file over theInternet. In our terminology for peering relationships, asender is a node sending data to a receiver.

The design of any large-scale file distribution systemcan be realized as careful decisions made regarding thefundamental tenets of peer-to-peer applications and datatransfer. As we see them, these tenets are push vs. pull,encoding data, finding the right peers, methods for re-questing data from those peers, and serving data to othersin an effective manner. Additionally, an important designconsideration is whether the system should be fair to par-ticipants or focus on being fast. In all cases, however,the underlying goal of downloading files is always keepyour incoming pipe full of new data. This means thatnodes must prevent duplicate data transmission, restrictcontrol overhead in favor of distributing data, and adaptto changing network conditions. In the following sec-tions we enumerate these fundamental decisions whichall systems disseminating objects must make in more de-tail.

2.1 Push or PullOptions for distributing the file can be categorized bywhether data is pushed from sources to destinations,


pulled from sources by destinations, or a hybrid of thetwo. Traditional streaming applications typically choosethe push methodology because data has low latency, iscoming at a constant rate, and all participants are sup-posed to get all the data at the same time. To increase ca-pacity, nodes may have multiple senders to them, and ina push system, they must then devote overhead to keep-ing their senders informed about what they have to pre-vent receipt of duplicates. Alternately, systems can use apull mechanism where receivers must first learn of whatdata exists and which nodes have it, and then request it.This has the advantage that receivers, not senders, are incontrol of the size and specification of which data is out-standing to them, which allows them to control and adaptto their environments more easily. However, this two-step process of discovery followed by requesting meansadditional delay before receiving data and extra controlmessages in some cases.

2.2 Encoded or Unencoded

The simplest way of sending the file involves just sendingthe original, or unencoded, blocks into the overlay. Anadvantage of this approach is that receivers typically donot have to fit the entire file into physical memory to sus-tain high-performance. Incoming blocks can be cachedin memory, and later written to disk. Blocks only have tobe read when the peers request them. Even if the blocksare not in memory and have to be fetched from the disk,pipelining techniques can be used to absorb the cost ofdisk reads. As a downside, sending the unencoded filemight expose the “last block” problem of some file distri-bution mechanisms, when it becomes difficult for a nodeto locate and retrieve the last few blocks of the file.

Recently, a number of erasure-correcting codes thatimplement the “digital fountain” [4] approach were sug-gested by researchers [15, 16, 17]. When the sourceencodes the file with these codes, any (1 + ε)n correctlyreceived encoded blocks are sufficient to reconstruct theoriginal n blocks, with the typically low reception over-head (ε) of 0.03 − 0.05. These codes hold the potentialof removing the “last block” problem, because there isno need for a receiver to acquire any particular block, aslong as it recovers a sufficient number of distinct blocks.In this paper, we assume that only the source is capa-ble of encoding the file, and do not consider the potentialbenefits of network coding [1], where intermediate nodescan produce encoded packets from the ones they have re-ceived thus far.

Based on the publicly available specification, we im-plemented rateless erasure codes [17]. Although it isstraightforward to implement these codes with a lowCPU encoding and decoding overhead, they exhibit someperformance artifacts that are relevant from a systems

perspective. First, the reconstruction of the file cannotmake significant progress until a significant number ofthe encoded blocks is successfully received. Even withn received blocks (i.e., corresponding to the original filesize), only 30 percent of the file content can be recon-structed [14]. Further, since the encoded blocks are com-puted by XOR-ing random sets of original blocks, thedecoding stage requires random access to all of the re-constructed file blocks. This pattern of access, coupledwith the bursty nature of the decoding process, causesincreased decoding time due to excessive disk swappingif all of the decoded file blocks cannot fit into physicalmemory 1. Consequently, the source is forced to transmitthe file as a serious of segments that can fit into physicalmemory of the receivers. Even if all receivers have ho-mogeneous memory size, this arrangement presents fewproblems. First, the source needs to decide when to starttransmitting encoded blocks that belong to the next seg-ment. If the file distribution mechanism exhibits consid-erable latency between the time a block is first generatedand the time when nodes receive it, the source might sendtoo many unnecessary blocks into the overlay. Second,receivers need to simultaneously locate and retrieve databelonging to multiple segments. Opening too many TCPconnections can also affect overall performance. There-fore, the receivers have to locate enough senders hostingthe segments they are interested in, while still being ableto fill their incoming pipe.

We have observed a 4 percent overhead when encod-ing and decoding files of tens of MBs in size. Althoughmathematically possible, it is difficult to make this over-head arbitrary small via parameter settings or a largenumber of blocks. Since the file blocks should be suf-ficiently large to overcome the fixed overhead due toper-block headers, we cannot use an excessive numberof blocks. Similarly, since a file consisting of a largenumber of blocks may not fit into physical memory, asegment may not have enough blocks to reduce the de-coding overhead. Finally, the decoding process is sen-sitive to the number of recovered degree-1 (unencoded)blocks that are generated with relatively low probability(e.g. 0.01). These blocks are necessary to start the de-coding process and without a sufficient number of theseblocks the decoding process cannot complete.

In the remainder of the paper, we optimistically as-sume that the entire file can fit into physical memory andquantify the potential benefits of using encoding at thesource in Section 4.6.

1We are assuming a memory-efficient implementation of the codesthat releases the memory occupied by the encoded block when all theoriginal blocks that were used in in its construction are available. Per-formance can be even worse if the encoded blocks are kept until the fileis reconstructed in full.


2.3 Peering Strategy

A node receives the file by peering with neighbors andreceiving blocks from them. To enable this, the noderequires techniques to learn about nodes in the system,selecting ones to peer with which have useful data, anddetermining the ideal set of peers which can optimizethe incoming bandwidth of useful data. Ideally, a nodewould have perfect and timely information about distri-bution of blocks throughout the system and would beable to download any block from any other peer, but anysuch approach requiring global knowledge cannot scale.Instead, the system must approximate the peering deci-sions. It can do this by using a centralized coordina-tion point, though constantly updating that point with up-dates of blocks received would also not scale, while alsoadding a single point of failure. Alternatively, a nodecould simply maintain a fixed set of peers, though thisapproach would suffer from changing network and nodeconditions or an initially poor selection of peers. Onemight also imagine using a DHT to coordinate locationof nodes with given blocks. While we have not tested thisapproach, we reason that it would not perform well dueto the extra overhead required for locating nodes witheach block. Overall, a good approach to picking peerswould be one which neither causes nodes to maintainglobal knowledge, nor communicate with a large numberof nodes, but manages to locate peers which have a lot ofdata to give it. A good peering strategy will also allowthe node to maintain a set of peers which is small enoughto minimize control overhead, but large enough to keepthe pipe full in the face of changing network conditions.

2.4 Request Strategy

In either a push or pull based system, there has to be adecision made about which blocks should be queued tosend to which peers. In a pull based system, this is a re-quest for block. Therefore we call this the request strat-egy. For the request strategy, we need to answer severalimportant questions.

First, what is the best order for requesting blocks? Forexample, if all nodes make requests for the blocks in thesame order, the senders in peering relationships will besending the same data to all the peers, and there wouldbe very little opportunity for nodes to help each other tospeed up the overall progress.

Second, for any given block, more than one of thesenders might have it. How does the node choose thesender that is going to provide this particular block, orin a pull based system, how can senders prevent queu-ing the same block for the same node? Further, should anode request the block immediately after it learns aboutits existence at a sender, or should it wait until some of its

other peers acquire the same block? There is a tradeoff,because reacting quickly might bring the block to thisnode sooner, and make it available for its own receiversto download sooner. However, the first sender over timethat has this block might not be the best one to serve it; inthis case it might be prudent to wait a bit longer, becausethe block download time from this sender might be highdue to low available bandwidth.

Third, how much data should be requested from anygiven sender? Requesting too few blocks might not fillthe node’s incoming pipe, whereas requesting too muchdata might force the receiver to wait too long for a blockthat it could have requested and received from some othernode.

Finally, where should the request logic be placed? Oneoption is to have the receiver make explicit requests forblocks, which comes at the expense of maintaining datastructures that describe the availability of each block, thetime of requests, etc. In addition, this approach mightincur considerable CPU overhead for choosing the nextblock to retrieve. If this logic is in the critical path, thethroughput in high-bandwidth settings may suffer. An-other option is to place the decision-making at the sender.This approach makes the receiver simpler, because itmight just need to keep the sender up-to-date with a di-gest of blocks it currently has. Since a node might im-plicitly request the same block from multiple senders,this approach is highly resilient. On the other hand,duplicate blocks could be sent from multiple senders ifsenders do not synchronize their behavior. Further, mes-sage overhead will be higher than in the receiver-drivenapproach due to digests.

2.5 Sending StrategiesAll techniques for distributing a file will need a strategyfor sending data to peers to optimize the performanceof the distribution. We define the sending strategy as“given a set of data items destined for a particular re-ceiver, in what order should they be sent?” This is differ-entiated from the request strategy in that it is concernedwith the order of queued blocks rather than which blocksto queue. Here, we separate the strategy of the singlesource from that of the many peers, since for any file dis-tribution to be successful, the source must share all partsof the file with others.

Source

In this paper, we consider the case where the source isavailable to send file blocks for the entire duration of afile distribution session. This puts it in a unique posi-tion to be able to both help all nodes, and to affect theperformance of the entire download. As a result, it is


especially important that the source not send the samedata twice before sending the entire file once. Otherwiseit may prevent fast nodes from completing because it isstill “hoarding” the last block. This can be accomplishedby splitting the file in various ways to send to its peers.The source must also consider what kinds of reliabilityand retransmission it should use and what happens in theface of peer failure.

Non-source

There are several options on the order to send nodes data.All of them fulfill the near-term goal for keeping a node’spipe full of useful data. But a forward thinking senderhas some other options available to it which help futuredownloads, by making it easier for nodes to locate dis-joint data. Consider that a node A will send blocks 1, 2,and 3 to receivers B and C. If it sends them in numericalorder each time, B and C will both get 1 first. The util-ity to nodes who have peered with both B and C is just1 though, since there is only 1 new block available. Butif node A sends blocks in different orders, the utility topeers of B and C is doubled. A good sending strategywill create the greatest diversity of blocks in the system.

2.6 Fair-first or Fast-first

An important consideration for the design of a file dis-tribution system is whether it is the highest priority thatit be fair to network participants, or fast. Some proto-cols, like BitTorrent and SplitStream, have some notionof fairness and try to make all nodes both give and take.BitTorrent does this by using tit-for-tat in its sending andrequesting strategies, assuring that peers share and sharealike. SplitStream does this by requiring that each nodeis forwarding at least as much as it is receiving. It is anice side effect that in a symmetric network, this equalsharing strategy can be close to optimal, since all peerscan offer as much as they can take, and one peer is largelyas good as the next.

But this decision fundamentally prevents the possibil-ity that some nodes have more to offer and can be ex-ploited to do so. It is clear that there exist scenarioswhere there are tradeoffs between being fair and beingfast. The appropriate choice largely depends on the useof the application, financial matters, and the mindset ofthe users in the system. Experience with some peer-to-peer systems indicate that there exist nodes willing togive more than they take, while other nodes cannot sharefairly due to network configurations and firewalls. In theremainder of the paper, we assume that nodes are willingto cooperate both during and after the file download, anddo not consider enforcing fairness.

3 Implementation and Architecture

After a careful analysis of the design space, we came tothe following conclusions about how Bullet′ was to bestructured:

• Bullet′ would use a hybrid push/pull (Section 2.1)architecture where the source pushes and the re-ceivers pull.

• The number of parameters that could be tweaked bythe end user to increase performance must be mini-mized.

• Our system would support both the unencoded andthe source-encoded file transmission, described inSection 2.2. We will quantify the potential benefitsof encoding.

3.1 Architectural Overview

3

S

B B A A C C

E E D D

4:Pushed blocks

6: File

info 8:

Pulled

7:R

eque

st

1. Control tree

2: R

ansu

b C

olle

ct

3:R

ansu

b D

istri

bute

1 1 4 4

2 2

3 3

3

3

5: Ransub subset: {B,C}

Figure 1: Bullet′ architectural overview. Gray lines rep-resent a subset of peering relationships that carry explic-itly pulled blocks.

Figure 1 depicts the architectural overview of Bullet′.We use an overlay tree for joining the system and fortransmitting control information (shown in thin dashedlines, as step 1). We use RanSub [12], a scalable, de-centralized protocol, to distribute changing, uniformlyrandom subsets of file summaries over the control tree(steps 2 and 3). The source pushes the file blocks to chil-dren in the control tree (step 4). Using RanSub, nodesadvertise their identity and the block availability. Re-ceivers use this information (step 5) to choose a set ofsenders to peer with, receive their file information (step6), request (step 7) and subsequently receive (step 8) fileblocks, effectively establishing an overlay mesh on topof the underlying control tree. Moreover, receivers makelocal decisions on the number of senders as well as the


amount of outstanding data, adjusting the overlay meshover time to conform to the characteristics of the under-lying network. Meanwhile, senders keep their receiversupdated with the description of their newly received fileblocks (step 6). The specifics of our implementation aredescribed below.

3.2 Implementation

We have implemented Bullet′ using MACEDON.MACEDON [22] is a tool which makes the developmentof large scale distributed systems simpler by allowing usto specify the overlay network algorithm without simul-taneously implementing code to handle data transmis-sion, timers, etc. The implementation of Bullet′ consistsof a generic file distribution application, the Bullet′ over-lay network algorithm, the existing basic random treeand RanSub algorithms, and the library code in MACE-DON.

3.2.1 Download Application

The generic download application implemented forBullet′ can operate in either encoded or unencodedmode. In the encoded mode, it operates by gener-ating continually increasing block numbers and trans-mitting them using macedon multicast on theoverlay algorithm. In unencoded mode, it oper-ates by transmitting the blocks of the file once usingmacedon multicast. This makes it possible for usto compare download performance over the set of over-lays specified in MACEDON. In both encoded and un-encoded cases, the download application also supports arequest function from the overlay network to provide theblock of a given sequence number, if available, for trans-mission. The download application also supports the re-construction of the file from the data blocks delivered toit by the overlay. The download application is param-eterized and can take as parameters block size and filename.

3.2.2 RanSub

RanSub is the protocol we use for distributing uniformlyrandom subsets of peers to all nodes in the overlay. Un-like a centralized solution or one which requires state onthe order of the size of the overlay, Ransub is decentral-ized and can scale better with both the number of nodesand the state maintained. RanSub requires an overlaytree to propagate random subsets, and in Bullet′ we usethe control tree for this purpose. RanSub works by pe-riodically distributing a message to all members of theoverlay, and then collecting data back up the tree. Ateach layer of the tree, data is randomized and compacted,

assuring that all peers see a changing, uniformly randomsubset of data. The period of this distribute and collect isconfigurable, but for Bullet′, it is set for 5 seconds. Also,by decentralizing the random subset distribution, we caninclude more application-state, an abstract which nodescan use to make more intelligent decisions about whichnodes would be good to peer with.

3.3 Algorithms

3.3.1 Peering Strategy

In order for nodes to fill their pipes with useful data, itis imperative that they are able to locate and maintain aset of peers that can provide them with good service. Inthe face of bandwidth changes and unstable network con-ditions, keeping a fixed number of peers is suboptimal(see Section 4.4). Not only must Bullet′ discard peerswhose service degrades, it must also adaptively decidehow many peers it should be downloading from/send-ing to. Note that peering relationships are not inherentlybidirectional; two nodes wishing to receive data fromeach other must establish peering links separately. Herewe use the term “sender” to refer to a peer a node is re-ceiving data from and “receiver” to refer to a peer a nodeis sending data to.

Each node maintains two variables, namelyMAX SENDERS and MAX RECEIVERS, which specifyhow many senders and receivers the node wishes to haveat maximum. Initially, these values are set to 10, thevalue we have experimentally chosen for the releasedversion of Bullet. Bullet′ also imposes hard limits of 6and 25 for the number of minimum/maximum sendersand receivers. Each time a node receives a RanSubdistribute message containing a random subset of peersand the summaries of their file content, it makes an eval-uation about its current peer sets and decides whetherit should add or remove both senders and receivers. Ifthe node has its current maximum number of senders, itmakes a decision as to whether it should either “try out”a new connection or close a current connection basedon the number of senders and bandwidth received whenthe last distribute message arrived. A similar mechanismis used to handle adding/removing receivers, except inthis case Bullet′ uses outgoing bandwidth instead ofincoming bandwidth. Figure 2 shows the pseudocodefor managing senders.

Once MAX SENDERS and MAX RECEIVERS havebeen modified, Bullet′ calculates the average and stan-dard deviation of bandwidth received from all of itssenders. It then sorts the senders in order of least band-width received to most, and disconnects itself from anysender who is more than 1.5 standard deviations awayfrom the mean, so long as it does not drop below the


void ManageSenders() {

if (size(senders) != MAX_SENDERS)return;

if (num_prev_senders == 0) {// try to add a new peer by defaultMAX_SENDERS++;

}else if(size(senders) > num_prev_senders) {if(incoming_bw > prev_incoming_bw)

// bandwidth went up, try adding// a senderMAX_SENDERS++;

else// adding a new sender was badMAX_SENDERS--;

}else if (size(senders) < num_prev_senders) {if(incoming_bw > prev_incoming_bw)

// losing a sender made us faster,// try losing anotherMAX_SENDERS--;

else// losing a sender was badMAX_SENDERS++;

}}

Figure 2: ManageSenders() Pseudocode

minimum number of connections (6). This way, Bullet′

is able to keep only the peers who are the most usefulto it. A fixed minimum bandwidth was not used so asto not hamper nodes who are legitimately slow. In addi-tion, the slowest sender is not always closed since if allof a peer’s senders are approximately equal in terms ofbandwidth provided, then none of them should be closed.

A nearly identical procedure is executed to remove re-ceivers who are potentially limiting the outgoing band-width of a node. However, Bullet′ takes care to sort re-ceivers based on the ratio of their bandwidth they are re-ceiving from a particular sender to their total incomingbandwidth. This is important because we do not want toclose peers who are getting a large fraction of their band-width from a given sender. We chose the value of 1.5standard deviations because 1 would lead to too manynodes being closed whereas 2 would only permit a veryfew peers to ever be closed.

3.3.2 Request Strategy

We considered using four different request strategieswhen designing Bullet′. All of the strategies are for mak-ing local decisions on either the unencoded or source-encoded file. Given a per-peer list representing blocksthat are available from that peer, the following are possi-ble ways to order requests:

• First encountered This strategy will simply ar-range the lists based on block availability. That is,blocks that are just discovered by a node will be re-quested after blocks the node has known about for

a while. As an example, this might correspond toall nodes proceeding in lockstep in terms of down-load progress. The resulting low block diversitythis causes in the system could lead to lower per-formance.

• Random This method will randomly order each listwith the intent of improving the block diversity inthe system. However, there is a possibility of re-questing a block that many other nodes already havewhich does not help block diversity. As a result, thisstrategy might not significantly improve the overallsystem performance.

• Rarest The rarest technique is the first that looks atblock distributions among a node’s peers when or-dering the lists. Each list will be ordered with theleast represented blocks appearing first. This strat-egy has no method for breaking ties in terms of rar-ity, so it is possible that blocks quickly go from be-ing under represented to well represented when a setof nodes makes the same deterministic choice.

• Rarest random The final strategy we describe is animprovement over the rarest approach in that it willchoose uniformly at random from the blocks thathave the highest rarity. This strategy eliminates theproblem of deterministic choices leading to subop-timal conditions.

In order to decide which strategy worked the best, weimplemented all four in Bullet′. We present our findingsin Section 4.3.

3.3.3 Flow Control

Although the rarest random request strategy enablesBullet′ to request blocks from peers in a way that encour-ages block diversity, it does not specify how many blocksa node should request from its peers at once. This choicepresents a tradeoff between control overhead (making re-quests) and adaptivity. On one hand, a node could re-quest one block at a time, not requesting another oneuntil the first arrived. Although stopping and waitingwould provide maximum insulation to changing networkconditions, it would also leave pipes underutilized due tothe round trip time involved in making the next request.At the other end of the spectrum is a strategy where anode would request everything that it knew about fromits peers as soon as it learned about it. In this case,the control traffic is reduced and the node’s pipe fromeach of its peers has a better chance of being full butthis technique has major flaws when network conditionschange. If a peer suddenly slows down, the node willfind itself stuck waiting for a large number of blocks to


void ManageOutstanding (sender, block) {// start with current valuesender->desired = sender->requested + 1;

if (block->wasted <= 0 || block->in_front <= 1)sender->desired -=

0.4*block->wasted*sender->bandwidth/block_size);

if (block->wasted <= 0 && block->in_front > 1)sender->desired -= 0.226*(block->in_front - 1);

}

Figure 3: Pseudocode for setting the maximum numberof per-peer outstanding blocks.

come in at a slow rate. We have experimented with can-celing of blocks that arrive “slowly”, and found that inmany cases these blocks are “in-flight” or in the sender’ssocket buffer, making it difficult to effectively stop theirretrieval without closing the TCP connection.

As seen in Section 4.5, using a fixed number of out-standing blocks will not perform well under a wide vari-ety of conditions. To remedy this situation, Bullet′ em-ploys a novel flow control algorithm that attempts to dy-namically change the maximum number of blocks a nodeis willing to have outstanding from each of its peers. Ourcontrol algorithm is similar to XCP’s [11] efficiency con-troller, the feedback control loop for calculating the ag-gregate feedback for all the flows traversing a link. XCPmeasures the difference in the rates of incoming and out-going traffic on a link, and computes the total numberof bytes by which the flows’ congestion windows shouldincrease or decrease. XCP’s goal is to maintain 0 packetsqueued on the bottleneck link. For the particular valuesof control parameters α = 0.4, β = 0.226, the controlloop is stable for any link bandwidth and delay.

We start with the original XCP formula and adapt it.Since we want to keep each pipe full while not riskingwaiting for too much data in case the TCP connectionslows down, our goal is to maintain exactly 1 block infront of the TCP socket buffer, for each peer. With eachblock it sends, sender measures and reports two values tothe receiver that runs the algorithm depicted in Figure 3in the pseudocode. The first value is in front, corre-sponding to the number of queued blocks in front of thesocket buffer when the request for the particular blockarrives. The second value is wasted, and it can be ei-ther positive or negative. If it is negative, it correspondsto the time that is “wasted” and could have been occu-pied by sending blocks. If it is positive, it represents the“service” time this block has spent waiting in the queue.Since this time includes the time to service each of thein front blocks, we take care not to double count theservice time in this case. To convert the wasted (service)time into units applicable to the formula, we multiply itby the bandwidth measured at the receiver, and divide by

block size to derive the additional (reduced) number ofblocks receiver could have requested. Once we decideto change the number of outstanding blocks, we mark ablock request and do not make any adjustments until thatblock arrives. This technique allows us to observe anychanges caused by our control algorithm before takingany additional action. Further, just matching the rate atwhich the blocks are requested with the sending band-width in an XCP manner would not saturate the TCPconnection. Therefore, we take the ceiling of thenon-integer value for the desired number of outstandingblocks whenever we increase this value.

Although Bullet′ knows how much data it should re-quest and from whom, a mechanism is still needed thatspecifies when the requests should be made. Initially, thenumber of blocks outstanding for all peers starts at 3, sowhen a node gains a sender it will request up to 3 blocksfrom the new peer. Conceptually, this corresponds to thepipeline of one block arriving at the receiver, with onemore in-flight, and the request for the third block reach-ing the sender. Whenever a block is received, the nodere-evaluates the potential from this peer and requests upto the new maximum outstanding.

3.3.4 Staying Up-To-Date

One small detail we have deferred until now is how nodesbecome aware of what their peers have. Bullet′ nodes usea simple bitmap structure to transmit diffs to their peers.These diffs are incremental, such that a node will onlyhear about a particular block from a peer once. This ap-proach helps to minimize wasted bandwidth and decou-ples the size of the diff from the size of the file beingdistributed. Currently, a diff may be transmitted fromnode A to B in one of two circumstances - either becauseB has nothing requested of A, or because B specificallyasked for a diff to be sent. The latter would occur whenB is about to finish requesting all of the blocks A cur-rently has. An interesting effect of this mechanism is thatdiff sending is automatically self clocking; there are nofixed timers or intervals where diffs are sent at a specificrate. Bullet′ automatically adjusts to the data consump-tion rates of each individual peer.

3.3.5 Sending Strategy

As mentioned previously, Bullet′ uses a hybrid push/pullapproach for data distribution where the source behavesdifferently from everyone else. The source takes a rathersimple approach: it sends a block to each of its RanSubchildren iteratively until the entire file has been sent. Ifa block cannot be sent to one child (because the pipe toit is already full), the source will try the next child in around robin fashion until a suitable recipient is found. In


0

0.2

0.4

0.6

0.8

1

100 150 200 250 300 350 400 450

Per

cent

age

of n

odes

download time(s)

Physical Link Speed Possible MACEDON TCP feasible + startup

BulletPrime Bullet

BitTorrent MACEDON SplitStream MS

Figure 4: Comparison of Bullet′ to BitTorrent, Bullet,SplitStream and optimal case for 100MB file size underrandom network packet losses. The legend is ordered topto bottom, while the lines are ordered from left to right.

this manner, the source never wastes bandwidth forcinga block on a node that is not ready to accept it. Oncethe source makes each of the file blocks available to thesystem, it will advertise itself in RanSub so that arbitrarynodes can benefit from having a peer with all of the fileblocks.

From the perspective of non-source nodes, determin-ing the order in which to send requested blocks is equallyas simple. Since Bullet′ dynamically determines thenumber of outstanding requests, nodes should alwayshave approximately one outstanding request at the ap-plication level on any peer at any one time. As a result,the sender can simply serve requests in FIFO order sincethere is not much of a choice to make among such fewblocks. Note that this approach would not be optimalfor all systems, but since Bullet′ dynamically adjusts thenumber of requests to have outstanding for each peer, itworks well.

4 Evaluation

To conduct a rigorous analysis of our various designtradeoffs, we needed to construct a controlled experi-mental environment where we could conduct multipletests under identical conditions. Rather than attempt toconstruct a necessarily small isolated network test-bed,we present results from experiments using the Model-Net [28] network emulator, which allowed us to evaluateBullet′ on topologies consisting of 100 nodes or more. Inaddition, we present experimental results over the wide-area network using the PlanetLab [20] testbed.

0

0.2

0.4

0.6

0.8

1

100 200 300 400 500 600 700 800 900 1000 1100

Per

cent

age

of n

odes

download time(s)

BulletPrime Bullet

BitTorrent MACEDON SplitStream MS

Figure 5: Comparison of Bullet′ to BitTorrent, Bul-let, and SplitStream for 100MB file size under syntheticbandwidth changes and random network packet losses.

4.1 Experimental Setup

Our ModelNet experiments make use of 25 2.0 and 2.8-Ghz Pentium-4s running Xeno-Linux 2.4.27 and inter-connected by 100-Mbps and 1-Gbps Ethernet switches.In the experiments presented here, we multiplex one hun-dred logical end nodes running our download applica-tions across the 25 Linux nodes (4 per machine). Mod-elNet routes packets from the end nodes through an em-ulator responsible for accurately emulating the hop-by-hop delay, bandwidth, and congestion of a given networktopology; a 1.4-Ghz Pentium III running FreeBSD-4.7served as the emulator for these experiments.

All of our experiments are run on a fully intercon-nected mesh topology, where each pair of overlay par-ticipants are directly connected. While admittedly notrepresentative of actual Internet topologies, it allows usmaximum flexibility to affect the bandwidth and loss ratebetween any two peers. The inbound and outbound ac-cess links of each node are set to 6 Mbps, while the nom-inal bandwidth on the core links is 2 Mbps. In an at-tempt to model the wide-area environment [21], we con-figure ModelNet to randomly drop packets on the corelinks with probability ranging from 0 to 3 percent. Theloss rate on each link is chosen uniformly at random andfixed for the duration of an experiment. To approximatethe latencies in the Internet [7, 21], we set the propaga-tion delay on the core links uniformly at random between5 and 200 milliseconds, while the access links have onemillisecond delay.

For most of the following sections, we conduct iden-tical experiments in two scenarios: a static bandwidthcase and a variable bandwidth case. Our bandwidth-change scenario models changes in the network band-


0

0.2

0.4

0.6

0.8

1

160 180 200 220 240 260 280 300 320

Per

cent

age

of n

odes

download time(s)

BulletPrime rarest random request strategyBulletPrime random request strategy

BulletPrime first request strategy

Figure 6: Impact of request strategy on Bullet′ perfor-mance while downloading a 100 MB file under randomnetwork packet losses.

width that correspond to correlated and cumulative de-creases in bandwidth from a large set of sources fromany vantage point. To effect these changes, we decreasethe bandwidth in the core links with a period of 20 sec-onds. At the beginning of each period, we choose 50percent of the overlay participants uniformly at random.For each participant selected, we then randomly choose50 percent of the other overlay participants and decreasethe bandwidth on the core links from those nodes to 50percent of the current value, without affecting the linksin the reverse direction. The changes we make are cumu-lative; i.e., it is possible for an unlucky node pair to have25% of the original bandwidth after two iterations. Wedo not alter the physical link loss rates that were chosenduring topology generation.

4.2 Overall PerformanceWe begin by studying how Bullet′ performs overall, us-ing the existing best-of-breed systems as comparisonpoints. For reference, we also calculate the best achiev-able performance given the overhead of our underlyingtransport protocols. Figure 4 plots the results of down-loading a 100-MB file on our ModelNet topology usinga number of different systems. The graph plots the cu-mulative distribution function of node completion timesfor four experimental runs and two calculations. Startingat the left, we plot download times that are optimal withrespect to access link bandwidth in the absence of anyprotocol overhead. We then estimate the best possibleperformance of a system built using MACEDON on topof TCP, accounting for the inherent delay required fornodes to achieve maximum download rate. The remain-ing four lines show the performance of Bullet′ running in

0

0.2

0.4

0.6

0.8

1

160 180 200 220 240 260 280 300 320 340

Per

cent

age

of n

odes

download time(s)

BulletPrime, 14 senders, 14 receiversBulletPrime, dyn. #senders,#receiversBulletPrime, 10 senders, 10 receivers

BulletPrime, 6 senders, 6 receivers

Figure 7: Bullet′ performance under random networkpacket losses for static peer set sizes of 6, 10, 14, andthe dynamic peer set size sizing case while downloadinga 100 MB file.

the unencoded mode, Bullet, and BitTorrent, our MACE-DON SplitStream implementation, in roughly that order.Bullet′ clearly outperforms all other schemes by approxi-mately 25%. The slowest Bullet′ receiver finishes down-loading 37% faster than for other systems. Bullet′’s per-formance is even better in the dynamic scenario (fasterby 32%-70%), shown in Figure 5.

We set the transfer block size to 16 KB in all of ourexperiments. This value corresponds to BitTorrent’s sub-piece size of 16KB, and is also shared by the Bulletand SplitStream. For all of our experiments, we makesure that there is enough physical memory on the ma-chines hosting the overlay participants to cache the en-tire file content in memory. Our goal is to concentrate ondistributed algorithm performance and not worry aboutswapping file blocks to and from the disk. Bullet andSplitStream results are optimistic since we do not per-form encoding and decoding of the file. Instead, we setthe encoding overhead to 4% and declare the file com-plete when a node has received enough file blocks.

4.3 Request Strategy

Heartened by the performance of Bullet′ with respect toother systems, we now focus our attention on the var-ious critical aspects of our design that we believe con-tribute to Bullet′’s superior performance. Figure 6 showsthe performance of Bullet′ using three different peer re-quest strategies, again using the CDF of node completiontimes. In this case each node is downloading a 100 MBfile. We argue the goal of a request strategy is to promoteblock diversity in the system, allowing nodes to help eachother. Not surprisingly, we see that the first-encountered


0

0.2

0.4

0.6

0.8

1

100 200 300 400 500 600 700 800 900 1000 1100

Per

cent

age

of n

odes

download time(s)


BulletPrime 6 senders, 6 receivers

Figure 8: Bullet′ performance under synthetic bandwidthchanges and random network packet losses for static peerset sizes of 6, 10, 14, and the dynamic peer set size sizingcase while downloading a 100 MB file.

request strategy performs the worst. While the rarest-random performs best among the strategies consideredfor 70% of the receivers. For the slowest nodes, the ran-dom strategy performs better. When a receiver is down-loading from senders over lossy links, higher loss ratesincrease the latency of block availability messages dueto TCP retransmissions and use of the congestion avoid-ance mechanism. Subsequently, choosing the next blockto download uniformly at random does a better job ofimproving diversity than the rarest-random strategy thatoperates on potentially stale information.

4.4 Peer SelectionIn this section we demonstrate the impossibility ofchoosing a single optimal number of senders and re-ceivers for each peer in the system, arguing for a dynamicapproach. In Figure 7 we contrast Bullet′’s performancewith 10 and 14 peers (for both senders and receivers)while downloading a 100 MB file. The system config-ured with 14 peers outperforms the one with 10 becausein a lossy topology like the one we are using, havingmore TCP flows makes the node’s incoming bandwidthmore resilient to packet losses. Our dynamic approachis configured to start with 10 senders and receivers, butit closely tracks the performance of the system with thenumber of peers fixed to 14 for 50% of receivers. Undersynthetic bandwidth changes (Figure 8), our dynamic ap-proach matches, and sometimes exceeds the performanceof static setups.

For our final peering example, we construct a 100 nodetopology with ample bandwidth in the core (10Mbps,1ms latency links) with 800 Kbps access links and with-

0

0.2

0.4

0.6

0.8

1

141 142 143 144 145 146 147 148 149 150

Per

cent

age

of n

odes

download time(s)


Figure 9: Bullet′ performance in the absence of band-width changes and random packet losses for static peerset sizes of 10, 14, and the dynamic peer set size sizingcase while downloading a 10 MB file in a topology withconstrained access links.

out random network packet losses. Figure 9 shows that,unlike in the previous experiments, Bullet′ configuredfor 14 peers performs worse than in a setup with 10peers. Having more peers in this constrained environ-ment forces more maximizing TCP connections to com-pete for bandwidth. In addition, maintaining more peersrequires sending more control messages, further decreas-ing the system performance. Our dynamic approachtracks, and sometimes exceeds, the performance of thebetter static setup.

These cases clearly demonstrate that no statically con-figured peer set size is appropriate for a wide range ofnetwork environments, and a well-tuned system must dy-namically determine the appropriate peer set size.

4.5 Outstanding Requests

We now explore determining the optimal number of per-peer outstanding requests. Other systems use a fixednumber of outstanding blocks. For example, BitTorrenttries to maintain five outstanding blocks from each peerby default. For the experiments in this section, we use an8KB block, and configure the Linux kernel to allow largereceiver window sizes. In our first topology, there are25 participants, interconnected with 10Mbps links with100ms latency. In Figure 10 we show Bullet′’s perfor-mance when configured with 3, 6, 9, 15, and 50 per-peer outstanding blocks for up to 5 senders. The num-ber of outstanding requests refers to the total number ofblock requests to any given peer, including blocks thatare queued for sending, and blocks and requests that arein-flight. As we can see, the dynamic technique closely


0

0.2

0.4

0.6

0.8

1

100 120 140 160 180 200 220 240

Per

cent

age

of n

odes

download time(s)

BulletPrime , 50 outstBulletPrime , dyn outstBulletPrime , 15 outstBulletPrime , 9 outstBulletPrime , 6 outstBulletPrime , 3 outst

Figure 10: Bullet′ performance with neither bandwidthchanges nor random network packet losses for 3, 6, 9,15, 50 outstanding blocks, and for the dynamic queuesizing case while downloading a 100 MB file

tracks the performance of cases with a large number ofoutstanding blocks. Having too few outstanding requestsis not enough to fill the bandwidth-delay product of high-bandwidth, high-latency links.

Although it is tempting to simplify the system byrequesting the maximum number of blocks from eachpeer, Figure 11 illustrates the penalty of requesting moreblocks than it is required to saturate the TCP connection.In this experiment, we instruct ModelNet to drop packetsuniformly at random with probability ranging between 0and 1.5 percent on the core links. Due to losses, TCPachieves lower bandwidths, requiring less data in-flightfor maximum performance. Under these loss-inducedTCP throughput fluctuations, our dynamic approach out-performs all static cases. Figure 12 provides more insightinto this case. For this experiment, we have 8 partici-pants including the source, with 6 nodes receiving datafrom the source and reconciling among each other over10Mbps, 1ms latency links. We use 8KB blocks and dis-able peer management. The 8th node is only download-ing from the 6 peers over dedicated 5Mbps, 100ms la-tency links. Every 25 seconds, we choose another peerfrom these 6, and reduce the bandwidth on its link to-ward the 8th node to 100Kbps. These cascading band-width changes are cumulative, i.e. in the end, the 8thnode will have only 100Kbps links from its peers. Ourdynamic scheme outperforms all fixed sizing choices forthe slowest, 8th node, by 7 to 22 percent. Placing toomany requests on a connection to a node that suddenlyslows down forces the receiver to wait too long for theseblocks to arrive, instead of retrieving them from someother peer.

0

0.2

0.4

0.6

0.8

1

100 150 200 250 300 350 400

Per

cent

age

of n

odes

download time(s)

BulletPrime , dyn outstBulletPrime , 15 outstBulletPrime , 50 outstBulletPrime , 6 outstBulletPrime , 3 outst

Figure 11: Bullet′ performance under random networkpacket losses while downloading a 100 MB file. CDFline for 9 blocks omitted for readability.

4.6 Potential Benefits of Source Encoding

The purpose of this section is to quantify the po-tential benefits of encoding the file at the source. To-wards this end, Figure 13 shows the average block inter-arrival times among the 99 receivers, while downloadinga 100MB file. To improve the readability of the graph,we do not show the maximum block inter-arrival times,which observe a similar trend. A system that has a pro-nounced “last-block” problem would exhibit a sharp in-crease in the block inter-arrival time for the last severalblocks. To quantify the potential benefits of encoding,we first compute the overall average block inter-arrivaltime tb. We then consider the last twenty blocks and cal-culate the cumulative overage of the average block inter-arrival time over tb. In this case overage amounts to 8.38seconds. We contrast this value to the potential increasein the download time due to a fixed 4 percent encodingoverhead of 7.60 seconds, while optimistically assum-ing that downloads using source encoding would not ex-hibit any deviation in the download times of the last fewblocks. We conclude that encoding at the source in thisscenario would not be of clear benefit in improving theaverage download time. This finding can be explainedby the presence of a large number of nodes that will havea particular block and will be available to send it to otherparticipants. Encoding at the source or within the net-work can be useful when the source becomes unavailablesoon after sending the file once and with node churn [8].

4.7 PlanetLab

This section contains results from the deployment ofBullet′ over the PlanetLab [20] wide-area network


0

0.2

0.4

0.6

0.8

1

95 100 105 110 115 120 125 130 135 140 145 150

Per

cent

age

of n

odes

download time(s)

BulletPrime , dyn outstBulletPrime , 9 outst

BulletPrime , 15 outstBulletPrime , 50 outst

Figure 12: Bullet′ performance under synthetic band-width changes while downloading a 100 MB file. CDFlines for 3 and 6 omitted because the 8th node takes con-siderably more time to finish in these cases.

testbed. For our first experiment, we chose 41 nodesfor our deployment, with no two machines being de-ployed at the same site. We configured Bullet′, Bullet,and SplitStream (MACEDON MS implementation) touse a 100KB block size. Bullet and SplitStream were notperforming the file encoding/decoding; instead we markthe downloads as successful when a required numberof distinct file blocks is successfully received, includingfixed 4 percent overhead that an actual encoding schemewould incur. We see in Figure 14 that Bullet′ consis-tently outperforms other systems in the wide-area. Forexample, the slowest Bullet′ node completes the 50MBdownload approximately 400 seconds sooner than Bit-Torrent’s slowest downloader.

4.8 Shotgun: a Rapid SynchronizationTool

This section presents Shotgun, a set of extensions tothe popular rsync [27] tool to enable clients to syn-chronize their state with a centralized server orders ofmagnitude more quickly than previously possible. At ahigh-level, Shotgun works by wrapping appropriate in-terfaces to Bullet′ around rsync. Shotgun is useful, forexample, for a user who wants to run an experiment ona set of PlanetLab[20] nodes. Because each node onlyhas local storage, the user must somehow copy (usingscp or rsync) her program files to each node individ-ually. Each time she makes any change to her program,she must re-copy the updated files to all the nodes she isusing. If the experiment spans hundreds of nodes, thencopying the files requires opening hundreds of ssh con-nections, all of which compete for bandwidth.

0

0.1

0.2

0.3

0.4

0.5

0 1000 2000 3000 4000 5000 6000 7000

Blo

ck in

ter-

arriv

al ti

me(

s)

Block number

Average

Figure 13: Block inter-arrival times for a 100 MB fileunder random network packet losses in the absence ofbandwidth changes. The block numbers on the X axiscorrespond to the order in which nodes retrieve blocks,not the actual block numbers.

To use Shotgun, the user simply starts shotgund, theShotgun multicast daemon, on each of his nodes. To dis-tribute an update, the user runs shotgun sync, pro-viding as arguments a path to the new software image,a path to the current software image, and the host nameof the source of the Shotgun multicast tree (this can bethe local machine). Next, shotgun sync runs rsyncin batch mode between the two software image paths,generating a set of logs describing the differences andrecords the version numbers of the old and new files.shotgun sync then archives the logs into a single tarfile and sends it to the source, which then rapidly dis-seminates it to all the clients using the multicast overlay.Each client’s shotgund will download the update, andthen invoke shotgun sync locally to apply the updateif the update’s version is greater than the node’s currentversion.

Running an rsync instance for each target node over-loads the source node’s CPU with a large number ofrsync processes all competing for the disk, CPU, andbandwidth. Therefore, we have attempted to experimen-tally determine the number of simultaneous rsync pro-cesses that give the optimal overall performance usingthe staggered approach. Figure 15 shows that Shotgunoutperforms rsync (1, 2, 4, 8, and 16 parallel instances)by two orders of magnitude. Another interesting resultfrom this graph is that the constraining factor for Planet-Lab nodes is the disk, not the network. On average, mostnodes spent twice as much time replaying the rsync logslocally then they spent downloading the data.


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700

Frac

tion

of N

odes

Time

BulletPrimeSplitstreamSpeedtest

BulletSpeedtestBitTorrent

Figure 14: Comparison of Bullet′ to Bullet, BitTorrent,and SplitStream for 50MB file size on PlanetLab.

5 Related Work

Overcast [10] constructs a bandwidth-optimized overlaytree among dedicated infrastructure nodes. An incom-ing node joins at the source and probes for acceptablebandwidth under one if its siblings to descend down thetree. A node’s bandwidth and reliability is determinedby characteristics of the network between itself and itsparent and is lower than the performance and reliabilityprovided by an overlay mesh.

Narada [9] constructs a mesh based on a k-spannergraph, and uses bandwidth and latency probing to im-prove the quality of the mesh. It then employs stan-dard routing algorithms to compute per-source forward-ing trees, in which a node’s performance is still definedby connectivity to its parent. In addition, the group mem-bership protocol limits the scale of the system.

Snoeren et al. [26] use an overlay mesh to send XML-encoded data. The mesh is structured by enforcing k par-ents for each participant. The emphasis of this primarilypush-based system is on reliability and timely delivery,so nodes flood the data over the mesh.

In FastReplica [6] file distribution system, the sourceof a file divides the file into n blocks, sends a differ-ent block to each of the receivers, and then instructs thereceivers to retrieve the blocks from each other. Sincethe system treats every node pair equally, overall perfor-mance is determined by the transfer rate of the slowestend-to-end connection.

BitTorrent [3] is a system in wide use in the Internetfor distribution of large files. Incoming nodes rely onthe centralized tracker to provide a list of existing sys-tem participants and system-wide block distribution forrandom peering. BitTorrent enforces fairness via a tit-for-tat mechanism based on bandwidth. Our inspection

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 500 1000 1500 2000 2500 3000

Frac

tion

of N

odes

Time

Shotgun (Download Only)Shotgun (Download + Update)

2 parallel rsync4 parallel rsync8 parallel rsync

16 parallel rsync

Figure 15: The aggregate completion time for varyingnumber of parallel rsync processes for an update with24MB of deltas on 40 PlanetLab nodes.

of the BitTorrent code reveals hard coded constants forrequest strategies and peering strategies, potentially lim-iting the adaptability of the system to a variety of net-work conditions relative to our approach. In addition,tracker presents a single point of failure and limits thesystem scalability.

SplitStream [5] aims to construct an interior-node dis-joint forest of k Scribe [24] trees on top of a scalablepeer-to-peer substrate [23]. The content is split into kstripes, each of which is pushed along one of the trees.This system takes into account physical inbound and out-bound link bandwidth of a node when determining thenumber of stripes a node can forward. It does not, how-ever, consider the overall end-to-end performance of anoverlay path. Therefore, system throughput might be de-creased by congestion that does not occur on the accesslinks.

In CoopNet [18], the source of the multimedia contentcomputes locally random or node-disjoint forests of treesin a manner similar to SplitStream. The trees are pri-marily designed for resilience to node departures, withnetwork efficiency as the second goal.

Slurpie [25] improves upon the performance of Bit-Torrent by using an adaptive downloading mechanism toscale the number of peers a node should have. How-ever, it does not have a way of dynamically changing thenumber of outstanding blocks on a per-peer basis. In ad-dition, although Slurpie has a random backoff algorithmthat prevents too many nodes from going to the sourcesimultaneously, nodes can connect to the webserver andrequest arbitrary blocks. This would increase the mini-mum amount of time it takes all blocks to be made avail-able to the Slurpie network, hence leading to increased


minimum download time.Avalanche [8] is a file distribution system that uses

network coding [1]. The authors demonstrate the useful-ness of producing encoded blocks by all system partici-pants under scenarios when the source departs soon aftersending the file once, and on specific network topolo-gies. There are no live evaluation results of the system,but it is likely Avalanche will benefit from the techniquesoutlined in this paper. For example, Avalanche partic-ipants will have to choose a number of sending peersthat will fill their incoming pipes. In addition, receiverswill have to negotiate carefully the transfers of encodedblocks produced at random to avoid bandwidth wastedue to blocks that do not aid in file reconstruction, whilekeeping the incoming bandwidth high from each peer.

CoDeploy [19] builds upon CoDeeN, an existingHTTP Content Distribution Network (CDN), to supportdissemination of large files. In contrast, Bullet′ oper-ates without infrastructure support and achieves band-width rates (7Mbps on average with a source limited to10Mbps) that exceed CoDeploy’s published results.

Young et al. [29] construct an overlay mesh of k edge-disjoint minimum cost spanning trees (MSTs). The al-gorithm for distributed construction of trees uses over-lay link metric information such as latency, loss rate,or bandwidth that is determined by potentially long andbandwidth consuming probing stage. The resulting treesmight start resembling a random mesh if the links have tobe excluded in an effort to reduce the probing overhead.In contrast, Bullet′ builds a content-informed mesh andcompletely eliminates the need for probing because ituses transfers of useful information to adapt to the char-acteristics of the underlying network.

6 Conclusions

We have presented Bullet′, a system for distributing largefiles across multiple wide-area sites in a wide range dy-namic of network conditions. Through a careful eval-uation of design space parameters, we have designedBullet′ to keep its incoming pipe full of useful data with aminimal amount of control overhead from a dynamic setof peers. In the process, we have defined the importantaspects of the general data dissemination problem andexplored several possibilities within each aspect. Our re-sults validate that each of these tenets of file distributionis an important consideration in building file distributionsystems. Our experience also shows that strategies whichhave tunable parameters might perform well in a certainrange of conditions, but that once outside that range theywill break down and perform worse than if they had beentuned differently. To combat this problem, Bullet′ em-ploys adaptive strategies which can adjust over time toself-tune to conditions which will perform well in a much

wider range of conditions, and indeed in many scenariosof dynamically changing conditions. Additionally, wehave compared Bullet′ with BitTorrent, Bullet and Split-Stream. In all cases, Bullet′ outperforms other systems.

Acknowledgments

We would like to thank David Becker for his invaluablehelp with our ModelNet experiments and Ken Yocum forhis comments on our flow control algorithm. In addition,we thank our shepherd Atul Adya and our anonymousreviewers who provided excellent feedback.

References

[1] Rudolf Ahlswede, Ning Cai, Shuo-Yen Robert Li, andRaymond W. Yeung. Network Information Flow. In IEEETransactions on Information Theory, vol. 46, no. 4, 2000.

[2] Suman Banerjee, Bobby Bhattacharjee, and ChristopherKommareddy. Scalable Application Layer Multicast. InProceedings of ACM SIGCOMM, August 2002.

[3] Bittorrent. http://bitconjurer.org/BitTorrent.

[4] John W. Byers, Michael Luby, Michael Mitzenmacher,and Ashutosh Rege. A Digital Fountain Approach to Re-liable Distribution of Bulk Data. In Proceedings of ACMSIGCOMM, 1998.

[5] Miguel Castro, Peter Druschel, Anne-Marie Kermarrec,Animesh Nandi, Antony Rowstron, and Atul Singh. Split-stream: High-bandwidth Content Distribution in Coop-erative Environments. In Proceedings of the 19th ACMSymposium on Operating System Principles, October2003.

[6] Ludmila Cherkasova and Jangwon Lee. FastReplica: Ef-ficient Large File Distribution within Content DeliveryNetworks. In 4th USENIX Symposium on Internet Tech-nologies and Systems, March 2003.

[7] Frank Dabek, Jinyang Li, Emil Sit, Frans Kaashoek,Robert Morris, and Chuck Blake. Designing a dht forlow latency and high throughput. In Proceedings of theUSENIX/ACM Symposium on Networked Systems Designand Implementation (NSDI), San Francisco, March 2004.

[8] Christos Gkantsidis and Pablo Rodriguez Rodriguez. Net-work Coding for Large Scale Content Distribution. InProceedings of IEEE Infocom, 2005.

[9] Yang hua Chu, Sanjay G. Rao, Srinivasan Seshan, andHui Zhang. Enabling Conferencing Applications on theInternet using an Overlay Multicast Architecture. In Pro-ceedings of ACM SIGCOMM, August 2001.

[10] John Jannotti, David K. Gifford, Kirk L. Johnson,M. Frans Kaashoek, and Jr. James W. O’Toole. Over-cast: Reliable Multicasting with an Overlay Network. InProceedings of Operating Systems Design and Implemen-tation (OSDI), October 2000.

[11] Dina Katabi, Mark Handley, and Charles Rohrs. Inter-net congestion control for high bandwidth-delay productnetworks. In Proceedings of ACM SIGCOMM, August2002.

[12] Dejan Kostic, Adolfo Rodriguez, Jeannie Albrecht, Abhi-jeet Bhirud, and Amin Vahdat. Using Random Subsets toBuild Scalable Network Services. In Proceedings of theUSENIX Symposium on Internet Technologies and Sys-tems, March 2003.

[13] Dejan Kostic, Adolfo Rodriguez, Jeannie Albrecht, andAmin Vahdat. Bullet: High Bandwidth Data Dissemina-tion Using and Overlay Mesh. In Proceedings of the 19thACM Symposium on Operating System Principles, Octo-ber 2003.

[14] Maxwell N. Krohn, Michael J. Freedman, and DavidMazieres. On-the-Fly Verification of Rateless ErasureCodes for Efficient Content Distribution. In Proceedingsof the IEEE Symposium on Security and Privacy, Oak-land, CA, 2004.

[15] Michael Luby. LT Codes. In In The 43rd Annual IEEESymposium on Foundations of Computer Science, 2002.

[16] Michael G. Luby, Michael Mitzenmacher, M. AminShokrollahi, Daniel A. Spielman, and Volker Stemann.Practical Loss-Resilient Codes. In Proceedings of the29th Annual ACM Symposium on the Theory of Comput-ing (STOC ’97), pages 150–159, New York, May 1997.Association for Computing Machinery.

[17] Petar Maymounkov and David Mazieres. Rateless codesand big downloads. In Proceedings of the Second Inter-national Peer to Peer Symposium (IPTPS), March 2003.

[18] Venkata N. Padmanabhan, Helen J. Wang, and Philip A.Chou. Resilient Peer-to-Peer Streaming. In Proceedingsof the 11th ICNP, Atlanta, Georgia, USA, 2003.

[19] KyoungSoo Park and Vivek S. Pai. Deploying large filetransfer on an http content distribution network. In Pro-ceedings of the First Workshop on Real, Large DistributedSystems (WORLDS ’04), 2004.

[20] Larry Peterson, Tom Anderson, David Culler, and Timo-thy Roscoe. A Blueprint for Introducing Disruptive Tech-nology into the Internet. In Proceedings of ACM HotNets-I, October 2002.

[21] Pinger site-by-month history table. http://www-iepm.slac.stanford.edu/cgi-wrap/pingtable.pl.

[22] Adolfo Rodriguez, Charles Killian, Sooraj Bhat, DejanKostic, and Amin Vahdat. MACEDON: Methodology forAutomatically Creating, Evaluating, and Designing Over-lay Networks. In Proceedings of the USENIX/ACM Sym-posium on Networked Systems Design and Implementa-tion (NSDI), San Francisco, March 2004.

[23] Antony Rowstron and Peter Druschel. Pastry: Scalable,Distributed Object Location and Routing for Large-scalePeer-to-Peer Systems. In Middleware’2001, November2001.

[24] Antony Rowstron, Anne-Marie Kermarrec, Miguel Cas-tro, and Peter Druschel. SCRIBE: The Design of a Large-scale Event Notification Infrastructure. In Third Inter-national Workshop on Networked Group Communication,November 2001.

[25] Rob Sherwood, Ryan Braud, and Bobby Bhattacharjee.Slurpie: A Cooperative Bulk Data Transfer Protocol. InProceedings of IEEE INFOCOM, 2004.

[26] Alex C. Snoeren, Kenneth Conley, and David K. Gifford.Mesh-Based Content Routing Using XML. In Proceed-ings of the 18th ACM Symposium on Operating SystemsPrinciples (SOSP ’01), October 2001.

[27] Andrew Tridgell. Efficient Algorithms for Sorting andSynchronization. PhD thesis, 1999.

[28] Amin Vahdat, Ken Yocum, Kevin Walsh, Priya Mahade-van, Dejan Kostic, Jeff Chase, and David Becker. Scal-ability and Accuracy in a Large-Scale Network Emula-tor. In Proceedings of the 5th Symposium on OperatingSystems Design and Implementation (OSDI), December2002.

[29] Anthony Young, Jiang Chen, Zheng Ma, Arvind Krishna-murthy, Larry Peterson, and Randolph Y. Wang. Overlaymesh construction using interleaved spanning trees. InProceedings of IEEE INFOCOM, 2004.

208 USENIX Association2005 USENIX Annual Technical Conference

[11] Dina Katabi, Mark Handley, and Charles Rohrs. Inter-net congestion control for high bandwidth-delay productnetworks. In Proceedings of ACM SIGCOMM, August2002.

[12] Dejan Kostic, Adolfo Rodriguez, Jeannie Albrecht, Abhi-jeet Bhirud, and Amin Vahdat. Using Random Subsets toBuild Scalable Network Services. In Proceedings of theUSENIX Symposium on Internet Technologies and Sys-tems, March 2003.

[13] Dejan Kostic, Adolfo Rodriguez, Jeannie Albrecht, andAmin Vahdat. Bullet: High Bandwidth Data Dissemina-tion Using and Overlay Mesh. In Proceedings of the 19thACM Symposium on Operating System Principles, Octo-ber 2003.

[14] Maxwell N. Krohn, Michael J. Freedman, and DavidMazieres. On-the-Fly Verification of Rateless ErasureCodes for Efficient Content Distribution. In Proceedingsof the IEEE Symposium on Security and Privacy, Oak-land, CA, 2004.

[15] Michael Luby. LT Codes. In In The 43rd Annual IEEESymposium on Foundations of Computer Science, 2002.

[16] Michael G. Luby, Michael Mitzenmacher, M. AminShokrollahi, Daniel A. Spielman, and Volker Stemann.Practical Loss-Resilient Codes. In Proceedings of the29th Annual ACM Symposium on the Theory of Comput-ing (STOC ’97), pages 150–159, New York, May 1997.Association for Computing Machinery.

[17] Petar Maymounkov and David Mazieres. Rateless codesand big downloads. In Proceedings of the Second Inter-national Peer to Peer Symposium (IPTPS), March 2003.

[18] Venkata N. Padmanabhan, Helen J. Wang, and Philip A.Chou. Resilient Peer-to-Peer Streaming. In Proceedingsof the 11th ICNP, Atlanta, Georgia, USA, 2003.

[19] KyoungSoo Park and Vivek S. Pai. Deploying large filetransfer on an http content distribution network. In Pro-ceedings of the First Workshop on Real, Large DistributedSystems (WORLDS ’04), 2004.

[20] Larry Peterson, Tom Anderson, David Culler, and Timo-thy Roscoe. A Blueprint for Introducing Disruptive Tech-nology into the Internet. In Proceedings of ACM HotNets-I, October 2002.

[21] Pinger site-by-month history table. http://www-iepm.slac.stanford.edu/cgi-wrap/pingtable.pl.

[22] Adolfo Rodriguez, Charles Killian, Sooraj Bhat, DejanKostic, and Amin Vahdat. MACEDON: Methodology forAutomatically Creating, Evaluating, and Designing Over-lay Networks. In Proceedings of the USENIX/ACM Sym-posium on Networked Systems Design and Implementa-tion (NSDI), San Francisco, March 2004.

[23] Antony Rowstron and Peter Druschel. Pastry: Scalable,Distributed Object Location and Routing for Large-scalePeer-to-Peer Systems. In Middleware’2001, November2001.

[24] Antony Rowstron, Anne-Marie Kermarrec, Miguel Cas-tro, and Peter Druschel. SCRIBE: The Design of a Large-scale Event Notification Infrastructure. In Third Inter-national Workshop on Networked Group Communication,November 2001.

[25] Rob Sherwood, Ryan Braud, and Bobby Bhattacharjee.Slurpie: A Cooperative Bulk Data Transfer Protocol. InProceedings of IEEE INFOCOM, 2004.

[26] Alex C. Snoeren, Kenneth Conley, and David K. Gifford.Mesh-Based Content Routing Using XML. In Proceed-ings of the 18th ACM Symposium on Operating SystemsPrinciples (SOSP ’01), October 2001.

[27] Andrew Tridgell. Efficient Algorithms for Sorting andSynchronization. PhD thesis, 1999.

[28] Amin Vahdat, Ken Yocum, Kevin Walsh, Priya Mahade-van, Dejan Kostic, Jeff Chase, and David Becker. Scal-ability and Accuracy in a Large-Scale Network Emula-tor. In Proceedings of the 5th Symposium on OperatingSystems Design and Implementation (OSDI), December2002.

[29] Anthony Young, Jiang Chen, Zheng Ma, Arvind Krishna-murthy, Larry Peterson, and Randolph Y. Wang. Overlaymesh construction using interleaved spanning trees. InProceedings of IEEE INFOCOM, 2004.

Maintaining High Bandwidth under Dynamic Network Conditions · Maintaining High Bandwidth under...

Documents

Transcript of Maintaining High Bandwidth under Dynamic Network Conditions · Maintaining High Bandwidth under...