Thesis - Electrical and Computer Engineering - University of Virginia

A STUDY OF APPLICATIONSFOR

OPTICAL CIRCUIT-SWITCHED NETWORKS

A Thesis

Presented to

the faculty of the School of Engineering and Applied Science

University of Virginia

In Partial Fulfillment

of the requirements for the Degree

Master of Science

Computer Science

by

Xiuduan Fang

May 2006

APPROVAL SHEET

This thesis is submitted in partial fulfillment of the requirements for the degree of

Master of Science

Computer Science

Xiuduan Fang

This thesis has been read and approved by the examining committee:

Malathi Veeraraghavan (Advisor)

Marty Humphrey (Chair)

Alfred Weaver

Accepted for the School of Engineering and Applied Science:

Dean, School of Engineering and Applied Science

May 2006

Abstract

The networking community has made a significant investment in GMPLS networks, which are

connection-oriented networks that support dynamic call-by-call bandwidth sharing. Currently,

GMPLS switches are call blocking and GMPLS control-plane protocols only support immediate

requests for bandwidth. This thesis first addresses the question of suitability for different types

of applications for GMPLS networks. Using the Erlang-B formula, we reason that GMPLS net-

works are well suited for applications in which the required per-circuit bandwidth is on the order of

one-hundredth the shared link capacity.

Then, we propose two applications for the GMPLS network, CHEETAH, which we have de-

ployed as part of an NSF-sponsored project. The first is a web transfer application, for which we

design and implement a software package called WebFT. We integrate the CHEETAH end-host

software modules into WebFT to provide deterministic data-transfer services transparently to users.

The CHEETAH network provides connection-oriented services in addition to the connectionless

service offered by the Internet. This “add-on” design allows the WebFT package to provide normal

web access to non–CHEETAH clients through the Internet while simultaneously serving CHEE-

TAH clients on dedicated circuits. The experiments conducted on the CHEETAH testbed show

that WebFT can achieve low-variance, end-to-end transfer delays at different circuit rates and low

transfer delays when high-speed circuits are possible.

The second application is parallel file transfers on CHEETAH. We identify that two factors

limit file-transfer throughput on networks with a high bandwidth-delay product: TCP’s congestion-

control algorithm and end-host limitations. We propose a general cluster solution to overcome these

two factors. The solution uses GridFTP striped transfer and Parallel Virtual File System, version

iii

iv

2 (PVFS2) to transfer data amongst multiple hosts in parallel over dedicated circuits. To minimize

end-host network–and–disk contention, we modify GridFTP and PVFS2 code such that all pairs

of sending and receiving hosts are only responsible for blocks located in their local disks, which

results in improved throughput.

Acknowledgments

I am indebted to my advisor, Professor Malathi Veeraraghavan, for her consistent guidance and

support. Professor Veeraraghavan has tirelessly guided me, teaching me how to do research in a

systematic way. She has spent significant time on improving my writing skills. She has been and

will always be an excellent role model for me.

I am also grateful to all the other members in our research group, Dr. Xuan Zheng, Xiangfei

Zhu, Zhanxiang Huang, Tao Li, and Anant P. Mudambi, for all their help.

I am especially grateful to my grandmother, my parents, my brother Kevin, and my husband

Lin for their continuous love and support. Without them, I could not have achieved what I have

achieved today.

Finally, this work was carried out under the sponsorship of NSF ITR-0312376, NSF EIN-

0335190, and DOE DE-FG02-04ER25640 grants.

v

Contents

Acknowledgments v

1 INTRODUCTION 1

2 BACKGROUND 3

2.1 CO Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 CO Networks and GMPLS Control-Plane Protocols . . . . . . . . . . . . . 3

2.1.2 Existing Switches, Gateways, and Networks . . . . . . . . . . . . . . . . . 8

2.2 CHEETAH Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 CHEETAH Concept and Network . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 CHEETAH End-Host Software . . . . . . . . . . . . . . . . . . . . . . . 13

3 ANALYTICAL MODELS OF GMPLS NETWORKS 15

3.1 Bandwidth Sharing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Model for Applications in which Call-Holding Time is Independent of Per-

Circuit Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.2 Model for Applications in which Call-Holding Time is Dependent on Per-

Circuit Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Applications in which Call-Holding Time is Independent of Per-Circuit

Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

vi

Contents vii

3.2.2 Applications in which Call-Holding Time is Dependent on Per-Circuit

Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 WEB TRANSFER APPLICATION ON CHEETAH 29

4.1 WebFT Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.1 WebFT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.2 CGI Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.3 The WebFT Sender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.1.4 The WebFT Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Experimental Testbed and Results . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 PARALLEL FILE TRANSFERS ON CHEETAH 38

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2.1 FTP and GridFTP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.2.2 PVFS2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3 The Single-Host Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4 The General-Case Cluster Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4.1 The Splitting Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.4.3 Implementation—Modifications to PVFS2 . . . . . . . . . . . . . . . . . 53

5.4.4 Implementation—Modifications to GridFTP . . . . . . . . . . . . . . . . . 61

5.4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.5 The Specific Cluster Solution for TSI . . . . . . . . . . . . . . . . . . . . . . . . 68

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

Contents viii

6 CONCLUSIONS AND FUTURE WORK 70

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Bibliography 73

List of Figures

2.1 Distributed call-setup process progressing hop-by-hop . . . . . . . . . . . . . . . 6

2.2 CHEETAH concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 CHEETAH experimental testbed . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 CHEETAH end-host software . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Call-based sharing model for any single link of a switch . . . . . . . . . . . . . . 15

3.2 A bandwidth sharing model for file transfers . . . . . . . . . . . . . . . . . . . . 17

3.3 Plots of Pb vs. m for U = 40%,60%,80%, and 90% . . . . . . . . . . . . . . . . . 20

3.4 Plots of ρ vs. m and ρ/m vs. m . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.5 Plots of Pb vs. χ and U vs. χ for m = 10, 100, and 1000, N · λ0 = 50 and 100,

α = 1.1, and k = 1.25 MB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.6 Plot of N ·λ0 vs. χ for m = 10, 100, and 1000, U = 60% and 80%, α = 1.1, and

k = 1.25 MB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.7 Plots of N vs. m for U = 40%, 60%, 80%, and 90% . . . . . . . . . . . . . . . . . 25

4.1 WebFT architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.2 The flow of events from running CGI scripts . . . . . . . . . . . . . . . . . . . . 32

4.3 The flow chart for the WebFT sender . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.4 CHEETAH testbed for WebFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.5 The web page to test WebFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 The single-host solution vs. the general-case cluster solution . . . . . . . . . . . . 40

5.2 The model and flow chart of third-party control . . . . . . . . . . . . . . . . . . . 42

ix

List of Figures x

5.3 The model and flow chart of GridFTP striped transfer . . . . . . . . . . . . . . . . 43

5.4 PVFS system architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.5 A model of using GridFTP partial file transfer to implement the transferring step . 52

5.6 A model of using GridFTP striped transfer to implement the transferring step . . . 53

5.7 A snippet of pvfs2-fs2.conf, the PVFS2 configuration file on sunfire6 . . . . . . . . 55

5.8 A part of the output for pvfs2-fs-dump . . . . . . . . . . . . . . . . . . . . . . . . 55

5.9 The content of an s KB file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.10 A part of the output for the command more testfile/pvfs2cp2 | grep connect . . . . . 57

5.11 A part of the output of the command more testfile/pvfs2cp2 | grep writev | more . . 58

5.12 The pvfs2-fs-dump output for the test 1000M file . . . . . . . . . . . . . . . . . . 59

5.13 A snippet from the file pvfs2cp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.14 A part of the output for the strace command . . . . . . . . . . . . . . . . . . . . . 60

5.15 A snippet of the source code for PINT cached config get next io() . . . . . . . . . 61

5.16 The commands to start GridFTP servers on sunfire . . . . . . . . . . . . . . . . . 62

5.17 A part of the debug output for the GridFTP striped transfer . . . . . . . . . . . . . 63

5.18 The tcptrace outputs for GridFTP striped transfer before we modified GridFTP code 64

5.19 The tcptrace outputs for GridFTP striped transfer after we modified GridFTP code . 67

5.20 The specific cluster solution for TSI . . . . . . . . . . . . . . . . . . . . . . . . . 68

List of Tables

2.1 A classification of networks that reflects sharing modes . . . . . . . . . . . . . . . 4

4.1 Average throughputs and delays at a variety of circuit rates . . . . . . . . . . . . . 37

5.1 A summary of possible approaches to implement the general-case cluster solution . 54

5.2 The logical server numbers for the physical I/O servers . . . . . . . . . . . . . . . 56

5.3 The file descriptors and IP addresses for sunfire6 through sunfire10 . . . . . . . . . 57

5.4 The data-distribution pattern for /pvfs2/test 1000M . . . . . . . . . . . . . . . . . 58

xi

List of Abbreviations

API application programming interface

AS autonomous system

CHEETAH Circuit-switched High-speed End-to-End Transport ArcHitecture

CGI Common Gateway Interface

CL connectionless

CN compute node

CO connection-oriented

C-TCP Circuit-TCP

DNS Domain Name Server

DRAGON Dynamic Resource Allocation via GMPLS Optical Networks

FTP File Transfer Protocol

GbE Gigabit Ethernet

Gb/s gigabit per second

GB gigabyte

GFP Generic Framing Procedure

GMPLS Generalized Multiprotocol Label Switching

GPFS General Parallel File System

GSR Gigabit Switch Router

GT Globus Toolkit

I/O Input/Output

ION I/O node

xii

List of Abbreviations xiii

IP Internet Protocol

KB kilobyte

LAN Local Area Network

LMP Link Management Protocol

MAN Metropolitan Area Network

Mb/s megabit per second

MB megabyte

MPLS Multiprotocol Label Switching

MSPP Multi-Service Provisioning Platform

MTU Maximum Transmission Unit

NCSU North Carolina State University

NFS Network File System

NIC network interface card

OC Optical Carrier

OCS Optical Connectivity Service

ORNL Oak Ridge National Laboratory (ORNL)

PCI–X Peripheral Component Interconnect Extended

PVFS2 Parallel Virtual File System, version 2

QoS Quality of Service

RAID redundant array of inexpensive disks

RD routing decision

RSVP–TE Resource ReSerVation Protocol–Traffic Engineering

RTP Research Triangle Park

RTT round-trip delay time

SDM Space Division Multiplexing

SLR Southern Light Rail

SNMP Simple Network Management Protocol

SONET Synchronous Optical Network

List of Abbreviations xiv

SOX Southern Crossroads

TB terabyte

TCP Transmission Control Protocol

TDM Time Division Multiplexing

TE traffic engineering

TSI Terascale Supernova Initiative

VC virtual circuit

VLSR Virtual Label Switch Router

WAN Wide Area Network

WDM Wavelength Division Multiplexing

Chapter 1

INTRODUCTION

The networking community has made a significant investment in connection-oriented (CO) net-

working. Allowing the reservation of bandwidth in the form of a dedicated circuit, or virtual circuit

(VC), through a CO network prior to data transfers, this networking mode is recognized for its

ability to offer service guarantees at some cost of utilization and fairness.

A number of optical CO testbeds, some of which use Generalized Multiprotocol Label Switch-

ing (GMPLS), have been deployed for research and educational purposes. These include CA-

NARIE’s CA*net 4 [11], OMNInet [34], SURFnet [49], UKLight [55], DOE’s UltraScience net

[41], Dynamic Resource Allocation via GMPLS Optical Networks (DRAGON) [46], and Circuit-

switched High-speed End-to-End Transport ArcHitecture (CHEETAH) [13]. Further software

projects to enable the use of MPLS tunnels across Internet2 [26] and across the Department of

Energy’s ESnet [15] are also underway.

Most of these networks are primarily designed for large-scale scientific applications. Some of

these applications require high-bandwidth circuits and long call-holding times. To create large-

scale circuit or VC networks, we need to extend the usage of these networks beyond scientific

applications to millions of users. Thus, we need to identify and design more applications to use

these networks efficiently.

The first goal of this thesis is to determine what applications are well served by GMPLS net-

works, which currently only support immediate-request calls. We use the Erlang-B formula to

analyze the suitability of different types of applications. The study of application suitability for

1

Chapter 1. INTRODUCTION 2

GMPLS networks identifies applications suited to these networks in general, and specifically the

CHEETAH testbed.

Then, we study two applications for CHEETAH. The first is a web transfer application, where

we present a solution to improve web performance by leveraging CHEETAH without requiring

modifications to existing web server and client software. We implement a CGI-based software pack-

age called WebFT. WebFT is integrated with the CHEETAH end-host software modules to provide

deterministic data-transfer services transparently to users. With dedicated circuits on CHEETAH,

WebFT can achieve low-variance, end-to-end transfer delays at different circuit rates and low trans-

fer delays when high-speed circuits are possible.

The second application is parallel file transfers on CHEETAH, where we study how to achieve

multi-Gb/s throughput for bulk data transfers over WANs. We identify two factors that limit

throughput to hundreds of Mb/s: TCP’s congestion-control algorithm and end-host limitations.

Then, we present a cluster solution over dedicated circuits, using GridFTP striped transfer and Par-

allel Virtual File System, version 2 (PVFS2) to achieve multiple-host parallelism, and thus, improve

overall throughput.

The rest of this thesis is organized as follows. In Chapter 2, we provide background information

on a class of call-blocking CO networks and the CHEETAH experimental testbed. In Chapter 3, we

explore the suitability of different types of applications for call-blocking CO networks. In Chap-

ter 4, we design and implement a software package, called WebFT, to improve web performance

through CHEETAH. In Chapter 5, we propose a cluster solution using GridFTP striped transfer and

PVFS2 for parallel file transfers. Finally, we present our conclusions and list future-work items in

Chapter 6.

Chapter 2

BACKGROUND

In this chapter, we first review different types of GMPLS networks and control-plane protocols. We

point out that current GMPLS implementations use a call-blocking approach. Then, we briefly de-

scribe existing equipment and networks in which CO services can be enabled. Finally, we overview

the CHEETAH network and CHEETAH end-host software because all the work in this thesis has

been conducted as a part of the CHEETAH project.

2.1 CO Networking

Networks are commonly classified by scale into Local Area Networks (LANs), Metropolitan Area

Networks (MANs), Wide Area Networks (WANs), wireless networks, home networks, and inter-

networks [50]. This classification, however, misses the critical aspect of networking—resource

sharing. To reflect how resources are shared in networks , Veeraraghavan and Karol gave a classifi-

cation of networks based on both switching type and networking type, as shown in Table 2.1 [56]. In

this section, we focus on the CO networking mode and, more specifically, on a class of call-blocking

GMPLS networks.

2.1.1 CO Networks and GMPLS Control-Plane Protocols

There are two types of CO networks: packet-switched and circuit-switched (see Table 2.1). Packet-

switched CO networks include

3

Chapter 2. BACKGROUND 4

Table 2.1: A classification of networks that reflects sharing modes

PPPPPPPPPPPPPPP

Networkingtype

Multiplexing/Switching type Circuit-switched Packet-switched

Connectionless Not an option e.g., IP networks; Ethernetnetworks

Connection-oriented e.g., Telephone network,SONET/SDH, WDM

e.g., X.25, ATM, MPLS

• “Intserv” IP networks [8]

• Multiprotocol Label Switched (MPLS) [42] and Asynchronous Transfer Mode (ATM) net-

works

• IEEE 802.1p and 802.1q Virtual LAN (VLAN) Ethernet switch based networks [25]

Circuit-switched networks include

• Time-Division Multiplexed (TDM) SONET/SDH networks

• All-optical Wavelength Division Multiplexed (WDM) networks

• Space-Division Multiplexed (SDM) Ethernet switch based networks (an SDM connection is

created by mapping two ports into an untagged VLAN)

The GMPLS control-plane protocols are defined as a “common control plane” for these differ-

ent types of CO networks even though their data-plane protocols differ significantly. This common

control plane consists of:

1. Link Management Protocol (LMP) [29]

2. Open Shortest Path First–Traffic Engineering (OSPF–TE) routing protocol [27]

3. Resource Reservation Protocol–Traffic Engineering (RSVP–TE) signaling protocol [3]


These three protocols are designed to be implemented in a control processor at each network

switch. Each of these protocols provides an increasing degree of automation, and a corresponding

decreasing dependence upon manual network administration. This triple combination serves as an

excellent basis on which to create large-scale CO networks, in which switches can cooperate in a

completely automated fashion to respond to requests for end-to-end bandwidth. We consider each

protocol in a little more detail below, starting with LMP.

Primarily, the LMP module automatically establishes and manages the control channels be-

tween adjacent nodes, to discover and verify data-plane connectivity, and to correlate data-plane

link properties. In GMPLS networks, there could be multiple data-plane links between two adja-

cent nodes and the control channel could be established on a separate physical link from any of the

data-plane links. A mechanism is required to automatically discover these data-plane links, verify

their properties, combine them into a single traffic-engineering (TE) link, and correlate data-plane

links to the control channel. Thus, LMP contributes to our plug-and-play goal for CO networks by

minimizing manual administration.

The OSPF–TE routing protocol software module, located at a switch, enables the switch to

send topology, reachability, and the loading conditions of its interfaces to other switches, and re-

ceive corresponding information from them. This data-dissemination process allows the route com-

putation module at the switch to determine the next-hop switch toward which to direct a connection

setup (this module could be part of the signaling-protocol module or could be used to pre-compute

routing data ahead of when call-setup requests arrive). As a routing protocol, its value in creating

large-scale connectionless networks has already been observed with the success of the Internet. Ad-

mittedly, being a link-state protocol, it is only used intra-domain—that is, within the network of an

organization, referred to as an autonomous system (AS). Even within this intra-domain context, it

organizes the AS as a two-layer hierarchy, meaning that the AS is partitioned into self-contained ar-

eas interconnected by a backbone area. In conjunction with the distance-vector based inter-domain

routing protocol, Border Gateway Protocol (BGP), we have a highly decentralized automated mech-

anism to spread routing information, which was critical to the scaling of the Internet.


Finally, an RSVP–TE signaling engine at a switch manages the bandwidth of all the interfaces

on the switch, and programs the data-plane switch hardware to enable it to forward demultiplexed

incoming user bits or packets as and when they arrive. Given that dynamic bandwidth sharing in

CO networks is controlled by the signaling engine, the call-handling performance of this engine is

critical to the scaling of CO networks. The faster the response times of signaling engines, the lower

the cost to an application to release and reacquire bandwidth as and when needed. This allows

applications to hold circuits only for the duration of their communication bursts, which, in turn,

improves link utilization. The need for high call-handling performance from signaling engines can

be met with a completely automated and distributed bandwidth-management implementation. This

will allow for both temporal and spatial scalability (i.e., shorter call-holding times and networks

with large numbers of switches and hosts).

An RSVP–TE engine implemented in a control card at a switch executes three steps when it

receives a connection setup Path message (i.e., a request for bandwidth), as show in Fig. 2.1.

BW: Bandwidth;

D: Destination address

Route lookup

Bandwidth and

label management

Switch fabric

configuration

Route lookup

Bandwidth and

label management

Switch fabric

configuration

GMPLS switch GMPLS switch

Path message (BW, D)

(from previous switch on path)Path message (BW, D)

Path message (BW, D)

(to next switch on path)

Control plane

Data plane

Route lookup

Bandwidth and

label management

Switch fabric

configuration

Route lookup

Bandwidth and

label management

Switch fabric

configuration

Figure 2.1: Distributed call-setup process progressing hop-by-hop

1. Route computation: Based on the destination address to which the connection is requested

(D, in the example shown in Fig. 2.1), the RSVP–TE engine determines the next-hop switch


toward which to route the connection or a subset of switches on the end-to-end path within

its area of its domain. Constrained Shortest Path First (CSPF) algorithms can only be exe-

cuted intra-area because of the intra-area scope of bandwidth related parameters in OSPF–TE

messages.

2. Bandwidth and label management: If the switch is in a position to only compute the next-hop

switch in the route computation phase, then it needs to check if there is sufficient bandwidth

on a link connected to the next-hop switch. If it performs CSPF to determine a part of the

end-to-end route (i.e., the subset of switches on the path within its area of its domain), then

this step of bandwidth management is integrated with the partial route computation. But at

subsequent switches within the area, this step is required to check if there is sufficient band-

width available on the link to the next-hop indicated in the partial source route passed within

the Path signaling message (see Fig. 2.1 for how Path messages travel hop-by-hop). This

is because local conditions can change between the last routing protocol update, which pro-

vided the data used in the CSPF computation, and the arrival of the call being set up. Typical

implementations use a call-blocking approach where calls are simply rejected if sufficient

bandwidth is not available. Label management is the selection of labels to be used on in-

coming and outgoing switch interfaces. In the data plane, labels can be either explicit in the

data plane (e.g., labels used within packet headers in VC networks), or implicit (e.g., time

slots, wavelengths or interface identifiers in TDM, WDM, and SDM networks). In the con-

trol plane, labels are explicit in both types of switches, with the labels identifying time slots,

wavelengths and interface identifiers to be used for the connection across a circuit switch.

These labels are used in the next step.

3. Switch fabric configuration: This step is needed to configure the switch fabric to forward

user data as and when they arrive. This function maps incoming labels associated with input

interfaces to outgoing labels on appropriate outgoing interfaces. In packet switches, there is

an additional step to program the scheduler to enable it to serve packets arriving on the VC

being set up at the requested bandwidth level.


We do not show the rest of the call-setup procedure in Fig. 2.1, the continuation of the Path

message propagation hop-by-hop, or the Resv message returning in the opposite direction, which

implicitly confirms successful connection setup. Detailed procedures are also defined in RSVP–TE

for call-setup failure.

As mentioned in step 2, the bandwidth-management procedure implemented in most GMPLS

switches is based on call blocking. In other words, if the requested bandwidth is not available when

a call arrives, the call request is rejected. There is support for preemption, but if no existing call is

preemptable (because of priority levels), then the call is blocked.

The counterpart call-queuing model, though analyzed in textbooks [44], is seldom imple-

mented. This is because a call traversing multiple links requires a simultaneous allocation of

bandwidth on all these links. A distributed call-queuing model requires a call (an RSVP–TE Path

message) to wait in a queue until resources become available at the first switch, and then to join a

queue at the next switch in a hop-by-hop manner as shown in Fig. 2.1. Resources allocated to a call

at upstream switches will lie unused while the Path messages are queued at downstream switches.

Parallelizing this wait time by simultaneously queuing the call at multiple switches will decrease

wasted bandwidth, but not eliminate it. Therefore, call queuing is seldom implemented.

The RSVP–TE and OSPF–TE control-plane protocols do not support advance reservations of

bandwidth. For example, there are no objects defined in RSVP–TE to specify a future start time in

a Path message. Nor are there parameters defined in OSPF–TE to report future loading conditions

in the TE link state advertisements. Hence, these GMPLS control-plane protocols only support

immediate-request or on-demand calls.

2.1.2 Existing Switches, Gateways, and Networks

The most common network switches today are Ethernet switches, IP routers and SONET/SDH

switches. The first two are primarily connectionless packet switches; however, Ethernet switches

have VLAN capabilities with limited Quality of Service (QoS) support. A VLAN is constructed

by programming the switch to include two or more ports. It can be tagged or untagged. In tagged

mode, all Ethernet frames are tagged with a VLAN header that includes a VLAN ID. Frames


tagged with the same VLAN ID are treated in the same manner; that is, they are forwarded to all

the ports belonging to that VLAN. An untagged VLAN with two ports is essentially a SDM circuit

because all Ethernet frames arriving on either port are sent exclusively to the other port. No frames

arriving on other ports are forwarded to ports in an untagged VLAN. Ethernet switches available

from Extreme Networks, Dell, Cisco, Intel, Foundry, and Force 10, just to name a few vendors,

have these capabilities. Thus, the data-plane capabilities required to create circuits or VCs through

Ethernet switches are now available. However, control-plane software used to set up and release

circuits dynamically is not implemented within these switches. The Dragon project has developed a

software module called the Virtual Label Switch Router (VLSR), which implements the RSVP–TE

and OSPF–TE protocols. It runs on an external Linux host connected to the Ethernet switch [46] and

manages the bandwidth of the switch. It issues Simple Network Management Protocol (SNMP) [7]

commands to create the VLANs for admitted connections. With this external software, the Ethernet

switches become fully equipped CO switches.

IP routers are equipped with MPLS engines and RSVP–TE signaling software for dynamic

control of MPLS VCs. Both Cisco and Juniper routers support MPLS.

SONET/SDH and WDM switches are circuit switches in which time slots and wavelengths

are respectively mapped from incoming to outgoing interfaces. Some of these switches now sup-

port RSVP–TE and OSPF–TE control-plane implementations. For example, Sycamore SONET

switches implement these protocols. Examples of WDM switches that implement GMPLS control-

plane protocols include Movaz and Calient WDM equipment.

In addition to supporting pure CO-switching functionality, some of this equipment can be used

as gateways to interconnect different types of networks. Before describing the gateway functional-

ity of these pieces of equipment, we establish some terminology.

We define the term network to consist of switches and endpoints (data-sourcing and sink-

ing entities) interconnected by shared communication links, on which the sharing (multiplexing)

mechanism is the same on all links. Further, we define the term switch as an entity in which all

links (interfaces) support the same (single) form of multiplexing (referred to as switching capabil-

ity [45]). For example, a SONET switch is one in which all interfaces carry TDM signals formatted


according to the SONET multiplexing standards, and a SONET network is one in which all the

switches are SONET switches. Typical endpoints in a SONET network are IP routers with SONET

line cards; these nodes are endpoints in the SONET network as they source and sink data carried on

to the SONET network.

We use the term internetwork to denote an interconnection of networks (referred to as multi-

region networks) [45]. Entities (nodes) that interconnect networks necessarily need the ability to

support interfaces with different types of multiplexing capabilities, minimally two. We use the term

gateways to refer to such nodes. An IP router is a gateway in the connectionless Internet with

different line cards implementing the protocols of the networks to which they are connected. The

gateway functionality is achieved by the IP implementation within the router examining IP datagram

headers to determine how to route a packet from an incoming network to an appropriate outgoing

network. In contrast, gateways in a CO internetwork move data from one network to another using

circuit or VC techniques. For example, Ethernet cards in a Sycamore SN16000 implement the

Generic Framing Procedure (GFP) Ethernet-to-SONET encapsulation to map all frames received

on any of its Ethernet ports into a port on a SONET line card, which connects this gateway node

to a SONET network. In this scenario, the circuit is a simple SDM circuit. We thus refer to these

gateways as circuit or VC gateways to contrast them with packet-based IP routers. An example of

a VC gateway is a Cisco GSR 12008, which supports line cards that can be programmed to map all

frames arriving on a specific VLAN into an MPLS tunnel set up on one of its other ports. It thus

interconnects a VLAN based CO network to an MPLS based CO network.

While the data-plane capabilities for extracting data from one type of multiplexed connection

and sending it on to a different type of multiplexed connection are available, the control-plane capa-

bilities for controlling such circuits or VCs are not yet standardized, and hence, not implemented.

Finally, as for current CO network deployments, SONET/SDH and WDM networks are al-

ready in widespread deployment. However, the dynamic bandwidth provisioning capability sup-

ported by the GMPLS control-plane protocols, while available on some switches in deployment, is

not yet made available to users. Similarly, the Abilene backbone of Internet2 and DOE’s ESnet has

routers with built-in MPLS and RSVP–TE capabilities. There are ongoing research projects [22,24]


to enable the use of dynamically requested VCs through these networks, including CHEETAH [13],

a SONET based network, and DRAGON [46], a WDM based network. Both CHEETAH and

DRAGON are call-blocking and immediate-request GMPLS networks.

2.2 CHEETAH Network

Our research group has deployed the CHEETAH network as part of an NSF-sponsored project

proposed to provide high-speed, end-to-end connectivity on a call-by-call basis. In this section, we

review the CHEETAH concept and the current experimental testbed. We also describe the end-host

software needed in CHEETAH-connected computers.

2.2.1 CHEETAH Concept and Network

CHEETAH is a networking solution to provide end-host applications access to end-to-end CO ser-

vices, while preserving the connectionless services already available to them via the Internet. In

other words, CHEETAH is designed as an add-on service to existing Internet connectivity, and

further, it leverages the services of the latter.

As shown in Fig. 2.2, end hosts are equipped with two Ethernet Network Interface Cards (NICs).

The primary NICs (NIC I) in the end hosts are connected to the public Internet through the usual

Packet-switched

Internet

Packet-switched

Internet

End

host

Optical Circuit-

switched

CHEETAH Network

Optical Circuit-

switched

CHEETAH Network

NIC I

NIC II

End

host

NIC I

NIC II

IP routers IP routers

Ethernet-SONET

gateway

Ethernet-SONET

gateway

Figure 2.2: CHEETAH concept


LAN Ethernet switches or IP routers, while the secondary NICs (NIC II) are connected to Ethernet

ports on Ethernet-to-SONET circuit gateways.

Ethernet-to-SONET circuit gateways, in turn, are connected to wide-area SONET circuit-

switched networks, in which both circuit gateways and pure SONET switches are equipped with

GMPLS protocols to support call-by-call dynamic bandwidth sharing. End-to-end CHEETAH cir-

cuits (as shown in the dashed line in Fig. 2.2) are set up dynamically between end hosts with

RSVP–TE signaling messages being processed at each intermediate gateway or switch in a hop-by-

hop manner.

The add-on design of CHEETAH network brings two benefits:

1. Connectivity to the Internet allows a CHEETAH end host to communicate with other non–

CHEETAH hosts on the Internet while it communicates with another CHEETAH end host

through a dedicated CHEETAH circuit.

2. Applications can selectively choose to request CHEETAH circuits only when the Internet

path is estimated to provide a lower service quality than the CHEETAH circuit, and further

fall back to the Internet path if the CHEETAH circuit-setup attempt fails due to an unavail-

ability of circuit resources on the CHEETAH network.

Currently, the CHEETAH network consists of three Ethernet-to-SONET circuit gateways,

which are Sycamore SN16000 switches, deployed at MCNC in Research Triangle Park (RTP),

NC, Southern Crossroads (SOX) and Southern Light Rail (SLR) in Atlanta, GA, and Oak Ridge

National Laboratory (ORNL) in Oak Ridge, TN. The testbed layout is shown in Fig. 2.3. Hosts,

running Linux, are connected via Gigabit Ethernet (GbE) NICs to the SN16000 switches. The cir-

cuits, set up and released dynamically, consist of Ethernet segments from the hosts to the switches

mapped to Ethernet-over-SONET segments between the switches. The GbE signal is mapped to a

21-OC1 virtually concatenated SONET signal to create an end-to-end 1 Gb/s dedicated circuit.


zelda4

zelda5

Juniper

router

Con

trol c

ard

OC192

card

Cro

ssconne

ct

ca

rd

zelda1

zelda2

zelda3

Sycamore SN16000

Juniper

router

InternetInternet

ORNL, TN

SOX/SLR, GA

Contro

l card

OC192

card

Cro

ssconne

ct

card

Sycamore SN16000

wukong

MCNC/NCSU, NC

Figure 2.3: CHEETAH experimental testbed

2.2.2 CHEETAH End-Host Software

We have developed a software package for Linux hosts, called CHEETAH end-host software,

to enable the automatic use of CHEETAH circuits. Wherever possible, our goal is to integrate li-

braries of this CHEETAH end-host software into application software modules to make CHEETAH

services transparent to human users.

The CHEETAH end-host software architecture is shown in Fig. 2.4. The Optical Connectivity

Service (OCS) client module is used to determine whether the correspondent end host (called

party) is on the CHEETAH network. It does this by sending a TXT query to a Domain Name

Server (DNS). The TXT resource record is a generic type supported by DNS to allow users to store

any data about hosts. The TXT data we store for a CHEETAH end host consist of an indication that

it is a CHEETAH end host, along with the IP and MAC addresses of the host’s secondary NIC.

The routing decision (RD) module answers queries from applications as to whether to attempt

a circuit setup. It makes these decisions by using collected measurements about the two paths, the


Application

RSVP-TE client

TCP/IPNIC 1

NIC 2

End hostCHEETAH software

Routing decision

C-TCP

OCS clientInternet

CHEETAH network

Application

RSVP-TE client

TCP/IP NIC 1

NIC 2

End hostCHEETAH software

Routing decision

C-TCP

OCS client

Figure 2.4: CHEETAH end-host software

Internet path and the CHEETAH path, along with the size of the file to be transferred.

The RSVP–TE client module is used to initiate the setup and release of CHEETAH circuits

[59]. Parameters provided to this module include the secondary NIC IP address of the destination

to which a circuit is being requested and the desired bandwidth. The Sycamore switches in the

CHEETAH network receive these RSVP–TE messages, process them and set up circuits if the

requested bandwidth is available to the specified destination. It is a distributed switch-by-switch

signaling procedure.

The Circuit-TCP (C-TCP) module is the transport protocol that we have developed for CHEE-

TAH circuits [33]. Given that the bandwidth of a dedicated circuit is known before a file transfer

starts, any changes in the sending rate will either cause the circuit to remain idle or cause the receiver

buffer to fill up. Since neither option is desirable, we essentially removed the congestion-control

algorithms of TCP that were designed to keep adjusting the sending rate based on IP network con-

ditions in order to create our C-TCP module. This disabling of the congestion control is selectively

done only by TCP connections traversing the secondary NIC, which is used for CHEETAH circuits.

TCP connections traversing the primary NIC connected to the Internet continue using the standard

TCP code.

Corresponding to each CHEETAH software module is a library providing application program-

ming interfaces (APIs) to invoke the services of each module. These libraries are expected to be

linked into applications using the CHEETAH software and network.

Chapter 3

ANALYTICAL MODELS OF GMPLS NETWORKS

In Chapter 2, we reasoned that GMPLS networks are call-blocking networks that only support

immediate-request calls. One important question is, what applications, if any, are suitable for GM-

PLS networks. This chapter addresses this problem. First, we present bandwidth sharing models for

two types of applications, ones in which the per-circuit bandwidth and mean call-holding time are

independent and ones in which they are dependent (file transfers). Then, we provide numerical re-

sults for both models. Finally, we conclude that, GMPLS networks are well suited for applications

in which the required per-circuit bandwidth on the order of one-hundredth the shared link capacity

for both types of applications.

3.1 Bandwidth Sharing Model

The switch model used in our analysis is illustrated in Fig. 3.1, in which calls originating from hosts

on the N links (e.g., the N Ethernet links connecting hosts to Ethernet interfaces on a gateway)

share the link capacity C on link L (e.g., the SONET/SDH/WDM/MPLS link out of a gateway).

We assume that call-setup requests arrive according to a Poisson process with rate λ, since many

12

N-1N

Link L,

capacity C

Figure 3.1: Call-based sharing model for any single link of a switch

15

Chapter 3. ANALYTICAL MODELS OF GMPLS NETWORKS 16

call-arrival processes observable in practice can be modeled as Poisson processes [44]. Further, we

assume that call-holding times follow arbitrary distributions with a mean call-holding time denoted

as 1/µ. To understand the types of applications that can be supported on GMPLS circuit-switched

networks, we make a simplifying assumption that all calls are of the same type—that is, they need

the same amount of bandwidth. This allows us to treat link L as a link of m circuits, where each

circuit is of capacity C/m.

We ask two questions about the suitability of applications for GMPLS networks:

1. Are applications that require high-bandwidth circuits more or less desirable than applications

that require low-bandwidth circuits?1

2. Are applications that generate calls with long mean holding times more or less desirable than

calls with short mean holding times?

The first question is related to m, the number of circuits. The larger the per-circuit bandwidth, the

smaller the m for a given link capacity C. The second question is related to the mean call-holding

time, 1/µ.

For applications such as remote visualization and video conferencing, the mean holding time is

independent of the per-circuit bandwidth. On the other hand, for file transfers, commonly identified

as an application suitable for high-speed circuits [57], m and 1/µ are related. The larger the per-

circuit bandwidth (the smaller the m), the lower the mean call-holding time, 1/µ. We describe

models for these two cases in the following subsections, respectively.

3.1.1 Model for Applications in which Call-Holding Time is Independent of Per-

Circuit Bandwidth

Given our assumptions, we can model link L as an M/G/m/m system [44]. The call-blocking

probability in this model is given by the well-known Erlang-B formula:

Pb =ρm/m!

m∑

i=0(ρi/i!)

(3.1)

1In this chapter, we only use the word “circuits,” but the same model and analysis hold for virtual circuits as well.


where ρ, the offered traffic load, is given by ρ = λ/µ. Although this is a time-tested model for

telephony traffic, we found it useful to our current problem of identifying applications suited to

GMPLS networks.

Assume that the number of calls per second arriving on each of the N ports that are destined for

link L is λ′. Thus, from Fig. 3.1, the aggregate λ, call-arrival rate for link L, is given by:

λ = N ·λ′ (3.2)

The utilization of link L, U , is given by:

U =ρm

(1−Pb) (3.3)

3.1.2 Model for Applications in which Call-Holding Time is Dependent on Per-

Circuit Bandwidth

File-transfer applications belong in this category. Given that the GMPLS switch operates in a call-

blocking mode even when used for this category of applications, equations (3.1)–(3.3) apply here

as well. If file sizes are too small, the overhead incurred in call-setup delay will significantly reduce

link utilization (since call-setup delays could exceed file-transfer delays). Therefore, Veeraragha-

van’s team [57] proposed using an RD module at end hosts to decide, based on the file size and

other metrics, whether to request a circuit for a particular file transfer, or whether to simply use the

Internet connectivity.

Fig. 3.2 illustrates a model for the file transfer application. We use a settable parameter

crossover file size, χ, to model the behavior of the RD module, wherein files larger than χ are

Link L,

capacity C

...

12

N-1N

routing

decision (RD)

module

end host

λ ′0λ

Figure 3.2: A bandwidth sharing model for file transfers


routed to the CO network.

We assume that file sizes are distributed according to the Pareto distribution with the probability

density function:

f (x) =αkα

xα+1 , x≥k (3.4)

where α is the shape parameter (the larger the α, the higher the probability of small file sizes),

and k is the scale parameter, denoting the minimum file size. Crovella [14] characterized web file

sizes as following this distribution and suggested α in the range from 1.0 to 1.3 and a value for k of

1000 bytes.

Given that only files larger than χ are routed to the CO network, using (3.4), we derive the mean

file size, E[X |(X ≥ χ)], as

E[X |(X ≥ χ)] =αχ

α−1(3.5)

We then estimate the mean call-holding time, 1/µ, as

1µ

= Tprop +E[Temission] (3.6)

where Tprop is the one-way propagation delay, and

E[Temission] =E[X |(X ≥ χ)]

C/m=

αχα−1

· mC

(3.7)

By neglecting Tprop, we can approximate:

1µ

=αχ

α−1· m

C(3.8)

capturing the inter-dependence of m and 1/µ. We justify neglecting Tprop as follows. E[Temission]

should be larger than Tprop because the latter is incurred as part of call-setup delay, and to maintain

a high link utilization, mean call-setup delay should be much smaller than E[Temission], which means

that Tprop is much smaller than E[Temission].


From Fig. 3.2, we can derive the call-arrival rate at link L as:

λ = N ·λ′ = N ·λ0 ·P(X ≥ χ) = N ·λ0 ·(

kχ

)α(3.9)

Combining (3.9) with the mean holding time from (3.8), we get

ρ =λµ

= N ·λ0 · αα−1

· kα

χα−1 ·mC

(3.10)

3.2 Numerical Results

3.2.1 Applications in which Call-Holding Time is Independent of Per-Circuit Band-

width

Assume that the link capacity C = 10 Gb/s. This is a reasonable value if the switch is a SONET

or MPLS switch. For WDM switches, if the number of wavelengths on link L is 100, then a more

reasonable value for C would be 1 Tb/s because each wavelength is typically engineered to support

10 Gb/s. We will consider this number later in this chapter. For now, we consider C = 10 Gb/s.

We study the effect of changing m from 1 to 1000; in other words, the per-circuit bandwidth

varies inversely from 10 Mb/s to 10 Gb/s. We obtain numerical results corresponding to four differ-

ent fixed values of U , 40%, 60%, 80%, and 90%. Since we have two equations (3.1) and (3.3), if

we fix two parameters, U and m, then the other two variables, ρ and Pb, become fixed as well. We

use an iterative algorithm as follows to obtain these values. First, we observe that for a given m, U

increases as ρ increases. We also conduct experiments to confirm the observation. Then, we start

to assign ρ = m temporarily, and compute the corresponding Pb and U . If the current U is larger

than the given U , meaning that ρ is too large, we decrease ρ by ∆ρ = 0.001 until the corresponding

U in the current iteration is smaller than the given U ; otherwise, we increase ρ by ∆ρ until the

corresponding U in the current iteration is larger than the given U . Next, we compare the current U

and its neighbor in the previous iteration to get the closest one to meet the given U and m. Finally,

we compute the corresponding Pb. Fig. 3.3 plots Pb vs. m.


0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

U=80%

U=90%

m

P b

U=60%

U=40%

(a) m ∈ [1,100]

101 400 700 10000

0.01

0.02

0.03

0.04

0.05

U=80%

U=90%

m

P b

(b) m ∈ [101,1000]

Figure 3.3: Plots of Pb vs. m for U = 40%,60%,80%, and 90%

From Fig. 3.3a, we see that at small values of m, it is hard to achieve high utilization combined

with low call-blocking probability. Consider m = 10, which corresponds to a per-circuit allocation

of 1 Gb/s per call (e.g., for HDTV applications). To run the link at an 80% utilization level, the

corresponding call-blocking probability will be a high 23.62%. In Fig.3.3b, we show the effect of

large m at which values both high utilization and low call-blocking probability are achievable.

The effect of traffic load ρ is not obvious from Fig. 3.3. Therefore, we plot the traffic load ρ

vs. m and ρ/m vs. m in Fig. 3.4. From Fig. 3.4a, we see that ρ should be engineered to be high

0 20 40 60 80 1000

20

40

60

80

100

U=40%

U=60%

U=80%

U=90%

m

ρ

(a) ρ vs. m

0 20 40 60 80 1000

2

4

6

8

10

U=40%U=60%U=80%

U=90%

m

ρ/m

(b) ρ/m vs. m

Figure 3.4: Plots of ρ vs. m and ρ/m vs. m


when m is high. We also see that, as m increases, Pb decreases and ρ/m approaches U according to

(3.3). For example, when U = 60%, ρ/m approaches 0.6, reaching this value when m = 80. Thus,

ρ is typically close to and less than m when Pb is low (close to 0) and U is high (close to 1). For

example, at a fixed value of U = 80%, when m = 100, ρ = 80.35, Pb = 0.4%, and when m = 1000,

ρ = 800, Pb ≈ 0. Thus, ρ is close to m when Pb is low (close to 0) and U is high (close to 1).

From the two graphs (Figs. 3.3 and 3.4) we see that if we want to operate the link at a given

value of call-blocking probability, and a given value of utilization, the number of circuits, m, and

traffic load, ρ, become fixed. An alternative starting point is that a given application has a fixed

capacity requirement, which means that m is fixed. If we further assume that λ′, the call-arrival

rate per port, and mean call-holding time, 1/µ, are intrinsic to the application, then we can only

adjust the aggregate traffic load ρ by engineering N to achieve a given call-blocking probability or

utilization. But these graphs show us that once m is set, if m is small, we are highly limited in our

ability to achieve both high utilization and low call-blocking probability.

Having understood the influences of all the important variables in this model, ρ, m, Pb and U , let

us now consider three applications. The first application is a high-bandwidth application (m = 10),

the second, a low-bandwidth application (m = 1000) and finally, an intermediate-level bandwidth

application (m = 100).

High-bandwidth applications: When m = 10—that is, when the application requires a per-

circuit bandwidth of 1 Gb/s—we can achieve a target 80% utilization, only by operating the link at

a high call-blocking probability of 23.62%. Such a high call-blocking probability could be unac-

ceptable to users. We conclude that applications requiring a high per-circuit capacity relative to

the shared link capacity are unsuitable for the immediate-request call-blocking mode of bandwidth

sharing offered by GMPLS networks in situations where high utilization and low call-blocking prob-

ability are important. Since, as discussed in Chapter 2.1.1, call queuing is not an option, it appears

that we need a book-ahead mechanism for such applications.

We then ask whether the above answer is dependent on the mean call-holding time. In other

words, when m is small, do we require a book-ahead mechanism only if the mean call-holding time

is large or do we need such a mechanism even if the mean call-holding time is small? For example,


in a doctor’s office, where there are three to four doctors per office (m is 3 or 4), since our mean

holding times (appointment lengths) are fairly high, on the order of 20-30 minutes, we use a book-

ahead mechanism. If the mean holding time is on the order of 1-2 minutes (e.g., at a bank teller),

could an immediate-request approach work? The answer is that it would if there was space to wait.

In other words, if the queuing system has a buffer to wait, high-bandwidth calls that have short

mean holding times could be handled without a reservation system. Unfortunately, as explained in

Chapter 2.1.1, queuing models are not suitable for calls. Therefore, for applications that require

high bandwidth (i.e., m is small, irrespective of the mean call-holding time), our conclusion of

needing a book-ahead mechanism holds.

Low-bandwidth applications: At the other extreme, consider large values of m, say m = 500

to m = 1000. For example, in a video-telephony application with motion JPEG cameras operating

at 25 frames/sec (motion-JPEG used instead of MPEG to meet the stringent delay requirements of

telephony), we could allocate 10 Mb/s on an MPLS-shared 10 Gb/s link, in which case m = 1000.

At these high values of m, call-blocking probability of almost 0 and utilization levels close to 1 are

achievable as seen in Fig. 3.3b; however, the required traffic load is high (close to m) as noted in

our analysis of Fig. 3.4.

Whether and how such traffic loads can be engineered depends upon the second important

factor, mean call-holding time. At a traffic load ρ = 500, if the mean call-holding time is small (say

3 minutes for a video-telephony call, which is the number typically quoted as the mean duration of

telephony calls), the aggregate call-arrival rate, λ, needs to be about 2.8 calls/sec. Say on average

each end host makes 1 call every two hours, which means λ′ in (3.2) is about 0.5 calls/hour. This

means that we need N to be 20160 to obtain an aggregate ρ of 500 Erlangs. In other words, we

need calls from 20106 end hosts to be multiplexed (perhaps through a multi-level hierarchy of

switches) into the switch shown in Fig. 3.1, destined to share link L’s capacity. This is a high level

of aggregation requiring switches with large numbers of ports. Since line cards (the more the ports,

the more the line cards) drive up the cost of switches, our conclusion is that to achieve a high

utilization with low-bandwidth applications that have short durations and low call-arrival rates,

we need to equip the switch with a large number of line cards to generate sufficient traffic, which


could be expensive.

Consider what happens if the mean call-holding time, 1/µ, is larger, say 2 hours, and mean

call-arrival rate is still low at 1 per 2 hours. This means the number of ports, N feeding traffic into

the shared link can be 540. Building switches with this order of line cards is more feasible. We thus

conclude that the immediate-request, call-blocking mode of bandwidth sharing in GMPLS networks

can be used for low-bandwidth applications that have relatively long durations and low call-arrival

rates. There is an upper limit on mean call-holding time, because if it is very large, unless the call-

arrival rate is very low, ρ, will become very large causing a high call-blocking probability.

Intermediate-bandwidth applications: Finally, consider an intermediate level, where m is in

the range of 100. As seen from Fig. 3.3, call-blocking probabilities are very small when m = 100

even at utilizations of 90%. Now consider the question of mean call-holding times. If we again use

the video-conferencing application or eScience remote-visualization applications where the per-

circuit bandwidth is 100 Mb/s on a 10 Gb/s link (which means m = 100), and mean call-holding

times are in the 2-hour range, the required aggregate call-arrival rate is 40 per hour. If each port of

the switch offers a load of 1 call per 5 hours, we need N to be 200, which is an acceptable number

from a switch-cost perspective. Clearly, the higher the mean holding time, the smaller the N, and

hence, the more preferable the application. This result again is surprising: calls with long holding

times are preferable to calls with short holding times in a call-blocking mode of operation.

In summary, applications suitable for present-day GMPLS networks are those in which the

per-circuit capacity is 1/100th shared link capacity and have holding times on the order of tens of

minutes or higher.

3.2.2 Applications in which Call-Holding Time is Dependent on Per-Circuit Band-

width

As described in the model in Section 3.1.2, 1/(mµ) is constant if we neglect Tprop, and hence the

two questions raised at the start of Section 3.1 seem to reduce to one question. But if we study

the system at certain fixed values of m, say m = 10,100,1000 (as in Section 3.2.1), we have a

new parameter χ, the crossover file size, with which to manipulate the mean call-holding time 1/µ.


Therefore, in this section, we study the effect of χ on various metrics, such as ρ, Pb, U , and N ·λ0,

which represents the total call-arrival rate for all files whose sizes are greater than k.

Fig. 3.5 plots the two metrics, Pb, and U , against χ for fixed values of m and N ·λ0. The influence

of χ on ρ is interesting because two factors operate in opposing directions. As χ increases, at a given

m, the mean call-holding time, 1/µ, increases. But from (3.9), we see that λ is proportional to χ−α

and hence decreases as χ increases. Since α is larger than 1, λ decreases at a rate faster than 1/µ

increases. As a result, ρ decreases with increasing χ. Decreasing ρ is the reason why Pb and U drop

with increasing χ.

0 5 10 15

x 107

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

m=100, N⋅λ0=100

m=10, N⋅λ0=100

m=1000, N⋅λ0=100

χ (bytes)

Pb

(a) Pb vs. χ

0 5 10 15

x 107

0.4

0.5

0.6

0.7

0.8

0.9

1

m=100, N⋅λ0=50

m=100, N⋅λ0=100

m=10, N⋅λ0=100

m=1000, N⋅λ0=100

χ (bytes)

U

(b) U vs. χ

Figure 3.5: Plots of Pb vs. χ and U vs. χ for m = 10, 100, and 1000, N ·λ0 = 50 and 100, α = 1.1,and k = 1.25 MB

In Fig. 3.5, we hold N ·λ0 constant. But to see the effect of χ on the required call-arrival rate, we

plot N ·λ0 against χ for a set of given U in Fig. 3.6. From (3.10), we see that N ·λ0 is proportional

to χα−1. Therefore, N ·λ0 increases as χ increases. From this set of graphs, we see that we should

select a smaller χ so that the required N ·λ0 is not too large. If N ·λ0 is large, and the per-host call-

arrival rate, λ0, is low, it means that we need to engineer our switches with a large number of ports.

Another interesting result seen in this set of plots is that, unlike the results in Section 3.2.1, where

as m is increased, the required traffic load increases, here we see in Fig. 3.6 that, as m increases, the

required load N ·λ0 decreases.


0 5 10 15

x 107

40

60

80

100

120

140

160

U=60%, m=100

U=80%, m=100

U=80%, m=10

U=80%, m=1000

χ (bytes)

N⋅λ

0

Figure 3.6: Plot of N · λ0 vs. χ for m = 10, 100, and 1000, U = 60% and 80%, α = 1.1, andk = 1.25 MB

We further plot Fig. 3.7 to contrast the effects of m on N for non-file-transfer applications and

file-transfer applications by fixing U and χ. As shown in Fig. 3.3, ρ increases as m increases.

For non-file-transfer applications, since m and 1/µ are independent and 1/µ is constant, λ and N

increase with increasing ρ. We can also derive that the trend of N vs. m is the same as that of ρ vs.

m (see Fig. 3.4a and Fig. 3.7a). In other words, for m at a small value, the curve has a higher slope

0 20 40 60 80 1000

50

100

150

200

250

U=40%

U=60%

U=80%

U=90%

m

N

(a) N vs. m for non-file-transfer applications with λ′ =0.5 call/s and 1/µ = 0.8 s

0 20 40 60 80 1000

20

40

60

80

100

120

140

160

180

200

U=40%

U=60%

U=80%

U=90%

m

N

(b) N vs. m for file-transfer applications with λ0 =0.5 call/s, α = 1.1, k = 1.25 MB, and χ = 8 MB

Figure 3.7: Plots of N vs. m for U = 40%, 60%, 80%, and 90%


than that for m at a large value. In particular, for m at a high value, the curve has an approximately

constant slope of (U ·µ)/λ0 (see Fig. 3.7a). But for file-transfer applications, 1/(mµ) is a constant

for a fixed χ, C, and α. From (3.10), we can see that the trend of N vs. m is the same as that of

ρ/m vs. m as shown in Fig. 3.4b. In particular, for large m, the curve for N vs. m is flat for a given

U (see Fig. 3.7b). Thus, for file transfers, we can allocate smaller amounts of bandwidth per call,

which means that m can be larger to achieve lower Pb and higher U without increasing N if the user

can tolerate the longer holding time.

Repeating the questions asked in Section 3.2.1, we consider whether high-bandwidth circuits

can be used for file transfers. We reach the same answer as in Section 3.2.1 if m = 10. Fig. 3.5 shows

that the call-blocking probability is quite high (at 10% even at large χ) when m = 10. Furthermore,

Fig. 3.6 shows that a higher N ·λ0 load is required to achieve a certain U when m = 10 than when

m is larger. Therefore, we conclude that high-bandwidth circuits, such as m = 10, are not suitable

even for the file-transfer application, unless latency requirements dictate its use.

We see from Fig. 3.5 that using low-bandwidth circuits (m = 1000) does not reduce Pb or

increase U significantly if appropriate values of χ are selected, although it does not increase N

either (see Fig. 3.7b). Given the natural advantage of lower delay to using lower m for file transfers,

we focus the rest of our analysis on the intermediate-bandwidth m = 100 case.

Now we consider the question of what crossover file size, χ, to select when m = 100. From

Fig. 3.5, we see that χ should be in the range from 6 MB to 29 MB to meet a utilization higher than

80% and a call-blocking probability lower than 5%. We observe that χ cannot be too large, because

if it is, then U decreases and the required call-arrival rate, N ·λ0, becomes large as seen in Fig. 3.6.

On the other hand, if it is too small, then Pb becomes too high.

To achieve a low call-blocking probability and high utilization, just as we need to choose a

fairly large m (e.g., m = 100) in Section 3.2.1, here we see the need for a fairly high call-arrival

rate, N · λ0 (e.g., N · λ0 = 100). At an aggregate value N · λ0 of 100 calls/sec, we also see that χ

should be in the range from 6 MB to 29 MB. This means that the mean holding time is in the range

of 0.5 s to 2.3 s since the per-circuit rate is 100 Mb/s when m = 100. These mean call-holding times

are significantly smaller than the numbers we consider in Section 3.2.1, where even a mean call-


holding time of 3 minutes, results in a need for a large number of ports. We see from Fig. 3.5 that

lowering N ·λ0 can lower utilization significantly. To engineer an N ·λ0 rate of 100 calls/sec, if λ0

is 1 call every 10 s, it means that we require N to be 1000. This is not a small number and requires a

cascade of switches to build up this load. For example, if the bottleneck link is an enterprise access

link, it requires multiple aggregations from switches internal to the enterprise, whose links can be

run at lower utilization levels, so that the aggregate traffic load for the enterprise access link is high

enough to achieve a high utilization at an acceptable Pb.

Next, we note that the very low mean call-holding times require high-speed signaling engines

to reduce call-setup delays so that they approach round-trip propagation delays, and thus, the circuit

utilization is high. Our work on hardware-accelerated signaling [58] shows the feasibility of im-

plementing an RSVP-TE subset in hardware, which reduces per-switch call processing delays from

the 100 ms range we measured on Sycamore switches to the order of microseconds.

Finally, we note that, although a link capacity of 10 Gb/s is appropriate for SONET/SDH and

MPLS shared links, it is low for a WDM link. If we assume that the shared link supports 100 wave-

lengths, using a typical data rate of 10 Gb/s, link capacity is 1 Tb/s and the per-circuit bandwidth

is 10 Gb/s. Media-immersive applications could consume such high-levels of end-to-end capacity

(category of applications where the mean call-holding time is independent of m), but for the file-

transfer application, file sizes should increase significantly to make the use of WDM networks with

GMPLS control-plane protocols usable for file transfers.

3.3 Conclusions

In this chapter, we analyzed the call-blocking mode of operation to determine the types of appli-

cations suitable for GMPLS networks by dividing them into two categories: those for which the

per-circuit capacity is independent of the holding time, and those for which these two variables

are directly related, such as file transfers. We concluded the following for the first category. First,

applications that require high-bandwidth circuits relative to the link capacity (e.g., where the ratio

is one-tenth, say 1 Gb/s circuits on a 10 Gb/s link) are not suitable. Second, applications that re-


quire low-bandwidth circuits but have short holding times (on the order of a few minutes) require a

high degree of aggregation leading to expenses from large numbers of line cards. Ideal applications

require on the order of one-hundredth the link capacity as per-circuit rates, and have long holding

times. In the second category of applications, we found that the first conclusion to the first category

still holds; however, the second does not because the number of line cards keeps almost constant

for m at a high value. In this category of applications, we also found that calls need to have very

short call-holding times (on the order of seconds).

Chapter 4

WEB TRANSFER APPLICATION ON CHEETAH

In this chapter, we describe our implementation of a software package, called WebFT, as an applica-

tion for CHEETAH [16]. WebFT accomplishes web transfers across CHEETAH without changing

existing web client and web server software by integrating the CHEETAH end-host software mod-

ules into Common Gateway Interface (CGI) and other external modules.

The main reasons why we chose web transfers as a showcase for CHEETAH are three-fold.

First, web-based applications have become ubiquitous [19] and there is significant interest in im-

proving web performance. Although solutions such as web caching focus on the problems of over-

loaded web servers [9, 17], we focus on improving network performance. Second, according to

the analysis of Chapter 3, CHEETAH network can be operated at a low call-blocking probability

and a high utilization if circuits are on the order of one-hundredth the shared link capacity, for

example, 100 Mb/s on a 10 Gb/s link, and a circuit of 100 Mb/s is suitable for either many small

web file transfers or a single bulk web transfer. Third, many new types of web-based applications,

such as large-file downloads, high-quality video streaming, and remote visualization, require high-

throughput, low-jitter, and deterministic data transfers. These applications need QoS guaranteed

network connectivity. The connectionless sharing mode of the current Internet is inadequate to

provide such connectivity. We contend that the lack of rate-guaranteed network connectivity is hin-

dering these web-based applications from being developed and deployed. An answer to this need

lies in some of the newer networking technologies—for example, CO networking technologies,

currently under development and deployment. CO networks, such as CHEETAH and DRAGON,

29

Chapter 4. WEB TRANSFER APPLICATION ON CHEETAH 30

allow for the reservation of bandwidth in the form of a dedicated circuit or VC through the networks

prior to data transfer.

This chapter determines how we can leverage these new CO technologies to improve the per-

formance of web applications. We first describe the WebFT software design and implementation.

Then, we show our experimental results and reason that WebFT can achieve low-variance, end-to-

end transfer delays at different circuit rates and low transfer delays when high-speed circuits are

possible.

4.1 WebFT Design

A primary goal of the WebFT software design is to provide deterministic data-transfer services to

clients connected to a web server via the CHEETAH network. WebFT leverages the coexistence

of two paths between a web client and a web server—that is, through the Internet and through

the CHEETAH network. It allows clients that have network connectivity to the circuit-switched

CHEETAH network to connect the WebFT server and download web content (e.g., large files or

streamed video) through dedicated end-to-end circuits, while simultaneously providing normal web

access to other non–CHEETAH clients through the Internet. The dedicated nature of the circuits

allows for user data to be streamed unhindered from a web server to a web client via the CHEETAH

network. This results in low-variance transfer delays.

Another goal of the WebFT software design is not to impose any special requirements with

regards to the operating system or the web server or client software packages executed on the client

and server hosts. We leverage the CGI technology to achieve this goal [32].

4.1.1 WebFT Architecture

The WebFT architecture is shown in Fig. 4.1. On the web server side, WebFT includes two CGI

scripts, download.cgi and redirection.cgi, and a process called WebFT sender. Download.cgi is em-

bedded into web pages as a hyperlink, with the name of the file to be served as a parameter. When

the user clicks the download.cgi hyperlink on the web page through any typical web client, the web


Web serverWeb client

Web Server

(e.g. Apache)

CGI scripts

(download.cgi &

redirection.cgi

URL

Response

WebFT sender

OCS API RD API

RSVP-TE API

C-TCP API

Web Browser

(e.g. Mozilla)

WebFT receiver

RSVP-TE API

C-TCP API

Control messages

via InternetData transfers

via a circuit

OCS daemon

RD daemon

RSVP-TE daemon

RSVP-TE

daemon

Figure 4.1: WebFT architecture

server receives an HTTP message causing download.cgi to be initiated. Download.cgi, in turn, initi-

ates the WebFT sender process, which communicates with the WebFT receiver process on the client

host to transfer the data from the server side to the client side. By leveraging the CGI technology,

we avoid requiring any software upgrades to both web servers and web browsers.

Integrated into the WebFT sender and receiver are libraries provided with the CHEETAH end-

host software module described in Section 2.2. Through interaction with the CHEETAH end-host

software modules, the WebFT sender determines whether to use the Internet path or attempt to set

up a CHEETAH circuit, and if deemed appropriate, initiates the setup of a circuit. It then transfers

the user data, and initiates the release of the circuit. If, for some reason, the user data cannot be

transferred via the CHEETAH network (e.g., the client host is not connected to CHEETAH, the file

size is too small, which makes it inefficient to use a circuit, or bandwidth is not available on the

CHEETAH network), the WebFT sender process exits and redirection.cgi is invoked to transfer the

file via the Internet.

4.1.2 CGI Scripts

CGI defines an approach for a web server to interact with external programs, which are often re-

ferred to as CGI programs or CGI scripts. Fig. 4.2 shows the flow of events while running CGI

scripts.1

1This figure is adapted from Writing CGI Applications with Perl by Meltzer and Michalski [32].


`

WWW Client HTTP Web Server

① HTTP request

⑥ HTTP response

Gateway programs

CGI Run CGI

Scripts

②

⑤

③ ④

Figure 4.2: The flow of events from running CGI scripts

The WebFT package contains two CGI scripts developed in Perl5 on the server side: down-

load.cgi and redirection.cgi. On receiving a request from a client, the web server invokes the

download.cgi script with one input parameter, the requested file name. Download.cgi obtains the

client’s primary IP address by querying the environment variable of REMOTE ADDR. It then calls

the WebFT sender process and passes the client’s primary IP address and the requested file name to

the WebFT sender process. If the WebFT sender returns indicating a failure to transfer the file over

the CHEETAH network, download.cgi calls redirection.cgi to initiate a normal download of the file

via the Internet.

4.1.3 The WebFT Sender

The WebFT sender is integrated with APIs for the four basic CHEETAH end-host software mod-

ules. Thus, it interacts with the CHEETAH software daemons, including the OCS daemon, the RD

daemon, and the RSVP–TE daemon, as shown in Fig. 4.1. The flowchart for the WebFT sender is

shown in Fig. 4.3. Once the sender is initiated by the download.cgi script, it calls the OCS client

module to determine whether the client host is reachable via the CHEETAH network. If the answer

is yes, the OCS client module returns with the IP address and the MAC address of client’s secondary

NIC (the one connected to the CHEETAH network).

The WebFT sender then establishes a TCP connection through the host primary NIC via the

Internet to the WebFT receiver, which is running as a daemon on a well-known port in the client

host. Once the TCP connection is successfully established, the receiver sends back a desired CHEE-

TAH circuit rate (based on its receiving capability) and a C-TCP listening port number for the data


The client can be reached via the

CHEETAH network (OCS)

Request a CHEETAH circuit (RD)

Set up a circuit (RSVP_TE client)

Send the file via C-TCP

Release the circuit (RSVP_TE client)

Yes

Yes

Succeed

No

No

Fail

Return Success Return Failure

Figure 4.3: The flow chart for the WebFT sender

transfer on the CHEETAH circuit.

Then, the WebFT sender process calls the RD module (passing the client host’s primary IP

address, secondary IP address, client’s desired circuit rate, and file size as arguments) to deter-

mine whether to attempt a CHEETAH circuit setup. The RD module chooses between the two

options based on the loading conditions of the two networks (the Internet and the CHEETAH

circuit-switched network), the round-trip delay time (RTT), and the file size. If it returns a de-

cision to attempt a CHEETAH circuit setup, the WebFT sender process calls the RSVP–TE client

module (passing the client’s primary and secondary IP addresses and the circuit rate), asking it to

initiate circuit setup.


If the circuit setup is successful, the WebFT sender process calls the C-TCP send() subroutine,

passing the following arguments: the circuit rate, the client’s secondary IP address, the C-TCP

port number on which the client is ready to accept an incoming C-TCP connection on the circuit,

and the file name. The C-TCP send() subroutine opens a socket and connects the client through

the secondary NIC and the CHEETAH circuit. The file is transferred on the dedicated CHEETAH

circuit at a rate equal to the circuit rate.

Once the data transfer is completed, the WebFT sender process invokes the RSVP–TE client

APIs to initiate release of the CHEETAH circuit. Finally, it returns a Success indication to the

download.cgi script.

If, during the above-mentioned procedure, the OCS client module determines that the client host

does not have CHEETAH connectivity, or the RD module decides that it is better to use the Internet

path, or the circuit setup initiated by the RSVP–TE client module fails, the WebFT sender process

immediately returns a Failure indication to the download.cgi script. The download.cgi process then

calls redirection.cgi to download the file via the Internet as mentioned in Section 4.1.2.

4.1.4 The WebFT Receiver

To avoid manual intervention, the WebFT receiver is designed to run as a daemon on a well-known

port in the background on the client host and to process incoming connection requests from the

WebFT sender automatically. The WebFT receiver is completely independent of web browser soft-

ware, and therefore does not require any modification to the latter. All clients connected to the

CHEETAH network are configured to run this daemon.

The WebFT receiver forks a child process to handle each request for a TCP connection from the

WebFT sender through the primary NIC. The forked WebFT receiver process then creates a TCP

connection with the WebFT sender to accept the request and sends to the latter the information of

a pre-computed desired circuit rate. The circuit rate is typically computed based on the disk access

rate of the client host because with today’s technology, disk access rate is usually the bottleneck for

file transfers. The forked WebFT receiver process also sends the listening C-TCP port number for

the data transfer through the secondary NIC on the CHEETAH circuit.


The WebFT receiver includes the API libraries associated with the RSVP–TE client and C-TCP

modules of the CHEETAH end-host software. The RSVP–TE client module API library accepts

circuit setup requests from the CHEETAH network and the C-TCP module API library accepts

incoming C-TCP connection requests from the WebFT sender to transfer user data. After a data

transfer is completed, the forked child process terminates and returns to the parent WebFT receiver

process.

4.2 Experimental Testbed and Results

The Linux implementation of WebFT described in the previous section has been tested on the

CHEETAH experimental testbed. This section presents and discusses these results.

The CHEETAH portion relevant for our experiments is shown in Fig. 4.4. We chose two PCs,

zelda3 and wukong, which are located in Atlanta, GA and RTP, NC, respectively. Zelda3 is a

Dell PowerEdge 2850 with dual 2.8 GHz Xeon processors and 2 GB memory. Wukong is a Dell

PowerEdge 1850 with a 2.8 GHz Xeon processor and 1 GB memory. Both of them have an 800 MHz

front side bus and a PERC4 RAID-0 controller with two 146 GB SCSI disks. The RTT between

zelda3 and wukong is 24.7 ms for the Internet path and 8.6 ms for the CHEETAH circuit. We loaded

the Apache HTTP server 2.0 on zelda3 and ran a web client on wukong.

CHEETAH

Network

CHEETAH

Network

InternetInternet

zelda3

NIC I

NIC II

wukong

NIC I

NIC II

IP routers IP routers

Sycamore SN16000

MCNC, NC

Sycamore SN16000

Atlanta, GA

Figure 4.4: CHEETAH testbed for WebFT

We opened the mozilla web browser on wukong, entered the URL,


http://130.207.252.133/Webapplication.htm,2 and the web page that downloaded from the server

is as shown in Fig. 4.5. After we clicked the hyperlink Download test.rm in Fig. 4.5, which was

Figure 4.5: The web page to test WebFT

linked to http://130.207.252.133/cgi-bin/download.cgi?file=test.rm, a circuit was established at a

rate of 1 Gb/s from zelda3 to wukong illustrated by the dashed line in Fig. 4.4. The file, test.rm of

a size of 1.6 GB, was downloaded from zelda3 to wukong with a delay of about 19 s (excluding the

time for circuit setup and release) at a throughput of about 680 Mb/s. The throughput was lower

than the circuit rate because of the slow disk writing rate of wukong, which was approximately

700 Mb/s. Circuit setup across the two SONET switches took approximately 170 ms and circuit

release took 9 ms.

Table 4.1 gives the average throughput and delay (excluding the time for circuit setup and

release) to download test.rm via WebFT for lower-rate circuits. We show the results of using lower-

rate circuits to make the point that, if the web server (e.g., zelda3 in our experiment) has a GbE

secondary NIC and it needs to simultaneously support multiple web downloads, it needs to allo-

cate smaller bandwidth levels per download. It is also worth mentioning that the delay variance

is negligible because circuits provide dedicated end-to-end bandwidth and the C-TCP transport

protocol maintains a fixed sending rate closely matched to the circuit rate. In contrast, the delay

varies significantly on the Internet because concurrent traffic has a significant effect on any single

download [57].2130.207.252.133 is the primary NIC IP address of zelda3


Table 4.1: Average throughputs and delays at a variety of circuit rates

Circuit rate (Mb/s) Average throughput (Mb/s) Average delay (s)700 602.5 21.2600 515.4 25.0500 412.7 31.0400 337.3 37.9

From this experiment, we conclude that, for web downloads that require deterministic charac-

teristics (e.g., streamed data or web-based gaming applications), guaranteed services provided by

CO networks are indeed useful. Further, for large web downloads, the variability introduced by

the connectionless nature of the Internet could cause significantly large delays, especially on long

propagation-delay paths. Circuits are a better option for such downloads as well.

4.3 Conclusions

In this chapter, we described a new web-based file transfer software package, called WebFT, to

leverage new CO networking technologies that are increasingly available today. Specifically, we

used a wide-area experimental CO network testbed called CHEETAH, which we deployed as part

of an NSF-sponsored project. We integrated CHEETAH end-host software APIs into the WebFT

package to provide CHEETAH related services transparently to users. By leveraging the CGI tech-

nology, the WebFT package is completely independent of the web server and browser software, and

therefore, does not require any modifications to the latter. We tested WebFT on the experimental

CHEETAH testbed using Apache HTTP web server and Mozilla web browser (note: WebFT is

also usable with other web servers and web browsers as long as CGI is supported). Our experi-

mental results showed that WebFT can provide deterministic data services to CHEETAH clients on

dedicated end-to-end circuits, because it uses a new C-TCP transport protocol that is capable of

providing reliable end-to-end data transfers at the circuit rate.

Chapter 5

PARALLEL FILE TRANSFERS ON CHEETAH

5.1 Introduction

Today, scientists carry out experiments collaboratively on a global scale. These large-scale scien-

tific efforts are popularly termed as e-Science. E-Science projects share geographically distributed

and heterogeneous resources, such as computational systems, scientific instruments, databases, net-

works, and software. In particular, they need to share large volumes of data (terabytes or petabytes

or even larger) amongst geographically distributed applications. For example, scientists at NCSU,

who are the primary users of CHEETAH and the primary team members of the Terascale Supernova

Initiative (TSI) [54], run their simulations on a Cray X1E, located at ORNL. Each simulation cre-

ates a multi-TB dataset. These datasets are then downloaded from the Cray X1E to a local cluster,

called orbitty, for analysis. The scientists need access to the latest dataset as soon as it is created.

Currently, they use either the Logistical Runtime System (LoRS) tool [31] or bbcp [6] for these

bulk file transfers and achieve throughput in the range of 200 Mb/s to 400 Mb/s. Given that no link

has bandwidth lower than 1 Gb/s on the network path from the Cray X1E to orbitty (e.g., the back-

bone bandwidth of Internet2 is OC192), we should be able to achieve at least 1 Gb/s throughput.

In this chapter, we study the use of parallel file transfers on CHEETAH to support a broad class of

e-Science projects, including TSI.

To achieve multi-Gb/s throughput, we need to analyze why current solutions are limited to

hundreds of Mb/s. We have identified two factors for this poor performance. First, TCP’s con-

38

Chapter 5. PARALLEL FILE TRANSFERS ON CHEETAH 39

gestion control algorithm does not work well in networks with a high bandwidth-delay product.

On detecting congestion (through a packet loss or by receiving triple duplicate acknowledgments),

the TCP sender will drop its sending rate immediately and slowly increase its rate as packets get

through the network successfully. This process takes time to regain the full transfer speed. Second,

end hosts are themselves bottlenecks. Read–write speeds of hard disks are commonly hundreds

of Mb/s, which are lower than network bandwidth (several Gb/s). Therefore, hard disks create a

severe bottleneck. In addition, Baker and Feng [4] pointed out another possible limiting factor, the

PC I/O bus. Even without any other bottleneck, such as hard disks, a host that connects a 10 Gb/s

NIC through a 133 MHz, 64-bit Peripheral Component Interconnect Extended (PCI-X) bus can only

achieve a peak bandwidth of 133 MHz·64b=8.512 Gb/s.

To overcome the effects of these two factors, several solutions have been proposed. Most file-

transfer programs, such as GridFTP and bbcp, allow a user to employ multiple TCP streams to

mitigate the first factor. We propose the use of CO networks, such as CHEETAH, to overcome this

first limitation. Specifically, we reserve bandwidth (e.g., multiple Gb/s) from end host to end host

and thus avoid packet loss.

To reduce the second limitation, one possible solution is to equip each end host with high-

speed hardware, including high-speed CPUs, I/O buses, hard disks, and NICs. In this solution,

we concentrate on making each end host faster. Thus, we refer to this approach as a “single-host

solution.” Alternatively, we can relieve the end-host bottleneck by leveraging parallelism amongst

multiple end hosts, which we term a “cluster solution.” There are two variations of the cluster

solution based on whether the source file is located on a single-host file system, or distributed in

blocks across a multi-host file system, such as PVFS:

1. Non-split source file: The file is not split and is located on a file system in a single host.

2. Split source file: The file is split into multiple parts and these parts are distributed across

disks of multiple hosts.

The case of non-split source file is more general than the case of split source file. Thus, we term

the former “general case,” and the latter “special case.” For the general case, we need to carry out


the following steps:

1. Splitting: partition a large file located at a single host (on one or more disks) into multiple

parts, and load each part onto a separate host. We refer to the number of parts as the “splitting

degree.”

2. Transferring: transfer the parts to receiving hosts in parallel

3. Assembling: assemble the parts into a large file

For the special case, where the file is already partitioned into blocks and distributed across multiple

hosts, we do not need the steps of partitioning and assembling. All that is required is a file-transfer

tool, such as GridFTP, which supports striped file transfers for files that are striped across disks on

different hosts in a parallel file system. Fig. 5.1 illustrates the framework of the single-host and the

general-case cluster solutions.

source sinkfile transfer

(a) The single-host solution

original

sourcehost i

host 1

host n

......

......

splitting

original

sinkhost i’

host 1'

host n’

......

......

assemblingtransferring

......

(b) The general-case cluster solution

Figure 5.1: The single-host solution vs. the general-case cluster solution

In this chapter, we describe our design and implementation of these single-host and cluster

solutions. First, we briefly review the software tools of GridFTP and PVFS2 because we use these

tools in our general-case cluster solution. Next, we discuss the usage of the single-host and the


general-case cluster solutions. Finally, we describe a specific-case solution for moving datasets in

the TSI project.

5.2 Background

In this section, we briefly review File Transfer Protocol (FTP) and then describe how GridFTP

extends FTP to include the new features of multi-streaming, partial file transfer, and striping. We

also provide a brief overview of PVFS.

5.2.1 FTP and GridFTP

GridFTP is a data-transfer protocol proposed for fast data transfers on the Grid [1, 2]. It extends

FTP [36] by adding features for partial file transfer, multi-streaming, striping, and Globus-based

security. It has been implemented by the Globus Alliance as a component of the Globus Toolkit

(GT) [18, 20].

In the cluster solution, we mainly use the GridFTP functionalities of third-party control, partial

data transfer, multi-streaming, and especially striped data transfer. Before we describe GridFTP’s

extensions to FTP, we overview FTP and focus on its feature of third-party control.1

There are two kinds of TCP connections in FTP: control connections and data connections. All

FTP commands are transferred over the control connection, while user data are transferred over the

data connection. The default port number of the control connection on the FTP server is 21 and that

of the data connection is 20.

Third-party control provided in FTP allows a user to transfer files between two other hosts. To

implement this feature, FTP provides two commands, PASV and PORT. PASV has no argument

and is an abbreviation for passive. Just as the term “passive” implies, PASV requests an FTP server

to wait for a data connection rather than to initiate one on receiving a data transfer command.

PORT has an argument of host–port pair, with which it specifies the data port to be used in a data

connection.1Although RFC 959 [36] specifies this feature, it does not refer to the feature as “third-party control.” Instead, the

GridFTP specification [1] introduces the term, “third-party control.”


FTP client

C

6. B initiates a data connection to A

1. control connection

2. PASV3. host-port pair

FTP server

A

FTP server

B


4. PORT <host-port pair>5. response to PORT

Figure 5.2: The model and flow chart of third-party control

Fig. 5.2 shows the model and flow chart of third-party control. First, an FTP client on a third

party, denoted as C, establishes control connections to two FTP servers, denoted as A and B. C

forwards all FTP commands, such as user and password, between A and B via the control connec-

tions. Then, C sends a PASV command to A. On receiving PASV, A listens on a data port, which it

selects to be a number distinct from the well known port number, 20, returns to C a host–port pair

(host provides A’s IP and port is the one on which A listens for a connection), and waits for a data

connection. Then, C sends a PORT command to B with the host–port pair as the argument. After B

receives the PORT command, it initiates a data connection to A at the port on which A waits for a

connection.

FTP has three transfer modes:

1. Stream mode: transmit data as a stream of bytes

2. Block mode: transmit data as a series of data blocks. Each block is identified by a 3-byte

header, which contains two fields: 1-byte descriptor and 2-byte length. The descriptor field

indicates whether the block is a special block, for example, the last block that ends a file. The

length field specifies the length of the block.

3. Compressed mode: transmit compressed data

All these modes transfer data in sequence and do not support partial file transfer.

GridFTP extends the block mode by adding an offset field in the block header to support out-of-

sequence data delivery. With this extended block mode, GridFTP can do partial file transfer, which

transfers portions of files rather than complete files. This extended block mode is also fundamental


to the GridFTP features of multi-streaming and striping. These two features leverage parallelism to

speed up file transfers. Specifically, the feature of multi-streaming supports multiple TCP streams in

parallel between each pair of sending and receiving hosts. In contrast, the feature of GridFTP striped

transfer stripes data across multiple sending hosts and transfers these stripes in parallel to multiple

receiving hosts. Thus, GridFTP striped transfer leverages multiple-host parallelism and relieves the

bottleneck caused by end-host limitations. We describe below how GridFTP implements striped

transfer in detail.

GridFTP server

Block 1

Block n+1...

Block 2

Block n+2

...

Block n

Block 2n

...

data node 1

data node n

parallel file sy

stem

1. cont

rol con

nection

internal IPC

2. SPA

S

3. a list

of hos

t-port p

airs

globus-url-copy

receiving

front end

A

a third party C

data node 2

...

GridFTP server

Block 1

Block n+1

...

Block 2

Block n+2

...

Block n

Block 2n

...

data node 1'

data node n’

parallel file sy

stem


internal IPC

4. SPOR <host-port pairs>

5. response to SPOR

sending

front end

B

data node 2'

...

6. initiate data connections from sending

data nodes to receiving ones

...

Figure 5.3: The model and flow chart of GridFTP striped transfer


Fig. 5.3 shows the model of GridFTP striped transfer.2 Multiple pairs of end hosts, termed

as “data nodes” and typically located in two clusters, participate in a single data transfer that is

controlled by two GridFTP servers, termed as “front ends,” and a third party, which runs globus-

url-copy (a GridFTP client tool provided by GT). Each front end acts as the single GridFTP control

server on each cluster to coordinate file transfers between data nodes. Each data node moves the

parts of the file assigned to it to its peer.

To support GridFTP striped transfer, GridFTP defines two commands, SPAS and SPOR, which

extend PASV and PORT, respectively. If a front end receives a SPAS command, it requests all its

data nodes to wait for data connections and returns a list of host–port pairs for these data nodes. In

contrast, if a front end receives a SPOR command with a list of host–port pairs, it notifies its data

nodes to initiate data connections to the hosts specified in the SPOR command’s argument list.

Comparing Fig. 5.2 with Fig. 5.3, we see that the flow chart for GridFTP striped transfer is

similar to that for third-party control provided in FTP. The additional features in GridFTP striped

transfer are as follows. First, it involves many data nodes. Second, it uses SPAS and SPOR in-

stead of PASV and PORT. Third, it is required be unidirectional, which means that SPAS is paired

with a receiving front end and SPOR, with a sending one. In contrast, FTP does not have any

such restriction. Fourth, a front end communicates with its data nodes through an internal Inter-

process Communication (IPC) protocol, which is unspecified in the GridFTP specification. Finally,

although there are multiple data connections between sending and receiving data nodes, there are

only two control connections between two front ends and a third party.

In addition, as shown in Fig. 5.3, GridFTP striped transfer requires that end hosts on each cluster

have access to the file, which means that the file needs to be managed by a parallel file system.

Furthermore, the underlying parallel file system must deliver a high read–write throughput to avoid

becoming a bottleneck itself. Currently, General Parallel File System (GPFS) [21] and PVFS2 are

two popular parallel file systems. We use PVFS2 in our experiments because PVFS2 is open-source

software allowing us to make any required modifications whereas GPFS is a commercial product.

2Unless otherwise mentioned, the number of sending hosts is equal to that of receiving hosts. Although the twonumbers are not required to be equal, we make them equal to simplify our explanation.


5.2.2 PVFS2

Clemson University and Argonne National Laboratory jointly developed PVFS (or PVFS1) [12,37],

which has been released and supported under a GNU General Public License since 1998. The PVFS

team aimed to design and implement a parallel I/O system that handles the performance disparity

between I/O devices and processors, and addresses the scalability problem of Network File System

(NFS).

NFS is a distributed file system developed by Sun Microsystems, Inc. It is a client–server

application and allows a user to conveniently access files on a remote computer [48]. An NFS

server stores all files in a central location, which causes a scalability problem when the number of

clients exceeds the performance capacity of the machine exporting the file system. We can equip an

NFS server with more memory, a faster CPU, and higher-speed NICs, but being a central node, it

can still run out of resources. As the number of client nodes increases, each client receives a smaller

portion of the overall bandwidth for file I/O. Another problem is availability. If an NFS server goes

down, all its client nodes have to wait until the server recovers.

Unlike NFS, which is a central data storage system, PVFS uses storage on multiple computers

to create a large high-performance parallel file system. PVFS physically distributes a single file

across multiple disks in multiple nodes. For example, it stripes a file over the local disks in multiple

I/O servers using a simple round-robin style as in RAID0. Fig. 5.4 shows the system architecture

for PVFS1.3 It is still a client–server file system. Each host may play one or more of the following

three roles:

1. compute nodes (CN or clients), where applications run

2. I/O nodes (ION or I/O servers), where files are stored

3. metadata sever or management node (MGR), where metadata operations are handled

PVFS1 can have one and only one management node.

3This figure is adapted from PVFS1 user guide [37].


Figure 5.4: PVFS system architecture

A second version of PVFS, PVFS2, has several new features [38, 39]. For example, it allows

for several management nodes, which eliminates the possible bottleneck caused by a single man-

agement node in PVFS1. But it uses the same principles as PVFS1 to create a parallel file system.

5.3 The Single-Host Solution

The single-host solution leverages high-speed hardware to avoid the end-host bottleneck. Specif-

ically, we concentrate on the bottleneck created by hard-disk I/O. The other PC hardware compo-

nents, such as NICs, PCI-X buses, memory buses, and CPUs, are also possible bottlenecks, but as

Hurwitz and Feng [23] pointed out, these components are not the primary bottlenecks and they are

kept updated by new technologies. For example, new PCI Express×16 implementation will achieve

a peak bandwidth of 64 Gb/s [10] and thus will remove the possible bottleneck caused by the I/O

bus. To relieve the disk bottleneck, we can equip sending and receiving hosts with redundant arrays

of inexpensive disks (RAIDs). However, what is the peak write speed for a RAID?4 Is the hard-

ware solution feasible, scalable, and cost-effective? In this section, we address these questions after

4In this section, we only use write speed for our comparison because write speed is lower than read speed.


providing a brief overview of RAID.

Patterson, Gibson and Katz [35] formally defined RAID levels one through five and showed

that RAID outperformed single large expensive disks by an order of magnitude in speed, reliability,

scalability, and other metrics. Currently, the most commonly used RAID levels are RAID0 and

RAID5. A RAID0 stripes data evenly across all member disks without any parity or redundancy. A

RAID5 stripes data, including parity information, across all member disks.

Assume that the number of disks is M and that each disk has an equal write speed of x. If I/O

operations are ideally split into equal-sized blocks and these blocks are distributed evenly across

the M disks, then these I/O operations can be carried out concurrently on all member disks. Since

all M disks for RAID0 contain data, the maximum write speed for RAID0 is M · x. In contrast, for

RAID5, one disk contains parity information for the I/O operations, and thus, the maximum speed

is (M− 1) · x. In practice, as the number of hard disks connected to a RAID controller increases,

the write speed may not increase proportionally because the RAID controller itself becomes the

bottleneck. Currently, over 1 Gb/s read–write speeds are achievable for RAIDs. Barclay, Chong,

and Gray [5] reported that an 8-disk 3ware Escalade 8508 controller saturated at 1.8 Gb/s read

and 1.6 Gb/s write. An 8-disk Areca ARC-1120 controller, configured as RAID5, was reported to

saturate at 6.0 Gb/s read and 3.6 Gb/s write [53]. Therefore, the hardware solution is feasible.

In light of the RAID0 and RAID5s’ designs, a theoretical disk utilization for RAID0 is 100%

and for RAID5, disk utilization is (M− 1)/M. Assume that each hard disk is 146 GB SCSI disk.

To accommodate 2 TB data, we need at least (2 TB)/(146 GB) = 15 hard disks for RAID0 and

even more for RAID5. To manage an array of more than 15 hard disks, we need a high-end RAID

host adapter with an I/O processor and memory to off-load the intensive RAID5 XOR parity com-

putation. Given the trends in communication bandwidth growth from 1 Gb/s to tens of Gb/s, I/O

performance is likely to lag behind network performance for the near-term future. Hence, we con-

clude that although the single-host solution is feasible for fast file transfers, it is neither scalable nor

cost-effective.


5.4 The General-Case Cluster Solution

In this section, we describe the cluster solution for the general case of non-split source files at the

sending end. First, we address the problem of determining an appropriate value for the splitting

degree. Second, we discuss possible approaches to implement the general-case cluster solution and

explain why we use GridFTP and PVFS2 to implement it. We also present our specific require-

ments for GridFTP and PVFS2 to minimize network-and-disk contention. Then, we describe our

modifications to GridFTP and PVFS2 to meet these requirements. Finally, we provide experimental

results after we modified GridFTP and PVFS2.

5.4.1 The Splitting Degree

As mentioned in Section 5.1, the general-case cluster solution needs to first partition the source file.

One important question is to determine an appropriate value for the splitting degree.

First, we should select the splitting degree such that the cluster solution transfers a source

file faster than an approach without splitting. Let the size of the source file be x, the splitting

degree be d (d ≥ 1, where d = 1 means that the file is not split), and the number of pairs of

sending and receiving hosts be n (see Fig. 5.1b). Assume that the 2 ·n hosts have the same hardware

and software configurations and thus have the same processing power. Let the disk I/O for each

host be r for reading and w for writing. Let the time to split and load the file, and the time to

assemble the file be Tsplit and Tassemble, respectively. Tsplit and Tassemble are serial in nature because

the splitting and assembling steps involve a single source or sink. We assume that Tsplit and Tassemble

are independent of the splitting degree d. Since hosts at the sending cluster are typically co-located

in one geographic location, we ignore the RTT delay for inter-host communication. Similarly, we

ignore the RTT delay amongst receiving hosts. Thus, we estimate Tsplit and Tassemble as follows:

Tsplit = Tassemble =xr

+xw

(5.1)

Let the time to transfer the whole file from a single host at the sending site to a single host at the

receiving site be Ttrans f er. Assume that we evenly split the file into d parts. If d < n, it takesTtrans f er

d


to transfer these parts in parallel. Otherwise, the time isTtrans f er

n because we do not benefit by

increasing d to be larger than n. Hence, we have the following equation to guide us in our selection

of the splitting degree:

Tsplit +Ttrans f er

min(d,n)+Tassemble < Ttrans f er (5.2)

The speedup for the general-case cluster solution is

speedup =Ttrans f er

Tsplit +Ttrans f er

min(d,n) +Tassemble

(5.3)

Combining (5.1), (5.2), and (5.3), we reason that to get the largest speedup, we should select

the splitting degree such that

d = n if n >Ttrans f er

Ttrans f er−2(xr

+xw

)

d = 1 otherwise

(5.4)

In addition, the Ttrans f er > 2(xr + x

w) requirement should be met; otherwise, the splitting and

assembling operations take longer time than the transferring operation. The two condition of

n >Ttrans f er

Ttrans f er−2(xr

+xw

)and Ttrans f er > 2(x

r + xw) determine whether we should split the source

file, that is, whether we should use the general-case cluster solution. If the file transfer is carried

out over the Internet, Ttrans f er increases significantly as RTT increases and/or network congestion

increases. Consequently, the probability of meeting these two conditions increases.

In contrast, if the file is transferred over a CO network, such as CHEETAH, bandwidth is re-

served for the file transfer and thus, there is no congestion during data flow. Assume that a circuit

of rate b is reserved between each pair of the sending and receiving hosts. Since we do not benefit

by reserving a circuit faster than w, b should be no larger than w even if maximum bandwidth rate

is larger than w. If b < w, Ttrans f er depends on b. Hence, we estimate Ttrans f er as follows:

Ttrans f er =x

min(b,w)(5.5)


Thus, to use the cluster solution, we should at least satisfy

xmin(b,w)

> 2(xr

+xw

) =⇒ b <rw

2(r +w)(5.6)

However, if the circuit bandwidth is high, then the probability of meeting the condition (5.6) is

low or even zero. This argues against the cluster solution on CHEETAH. But note that during

the previous analysis, we assume that the three steps of splitting, transferring, and assembling are

carried out separately. If we pipeline them, then we can decrease the total delay. For example,

while we split some parts and load them to sending hosts, we can transfer these available parts to

receiving hosts without waiting for the splitting step to be finished. Additionally, if we use PVFS2

to manage files and the starting point is already split file, the cluster solution has value even on

CHEETAH.

5.4.2 Design

In this section, we propose possible approaches to implement the three steps of the general-case

cluster solution. We discuss their advantages and disadvantages and decide to use GridFTP striped

transfer and PVFS2.

There are several possible approaches to splitting and assembling a file. The first approach is

to use the functionalities of partial transfer and third-party control provided by some file transfer

tools. For example, we use GridFTP. However, there are two problems with this approach. Firstly,

disk space of the whole file size should be allocated on each host. Thus, this implementation is not

suitable for a large file which cannot even reside on a single host. Secondly, this approach is serial

in nature and consumes much time as we mentioned in Section 5.4.1. Thus, the overall speedup is

significantly affected even though the transferring step has a theoretical speedup of min(d,n).

Alternatively, we can write a socket program to implement splitting and assembling and thus

overcome the first space problem of using GridFTP partial transfer. However, this approach still

has significant overhead for splitting and assembling.

The best approach is to use PVFS2 to manage files. PVFS2 provides a tool, pvfs2-cp, to transfer


files between PVFS2 and other file systems, such as NFS, Linux ext2, and Linux ext3. Thus, we can

use it to assemble a PVFS2 file, which is distributed across multiple I/O servers, into a non-split one

stored in the other file systems, and vice versa. PVFS2 automatically manages partitioning. From

a user’s point of view, a file can be accessed as though it was stored in a single central location.

Hence, we can avoid assembling if a user chooses to access a file in PVFS2. We can even avoid

splitting if files are initially created in PVFS2. Thus, we choose to use PVFS2 to manage files and

we use pvfs2-cp to split or assemble a file if necessary (i.e., a file is not originally managed by

PVFS2, if users need to access the file via a non-PVFS2 file system).

After deciding to use PVFS2 for splitting and assembling, we study the approaches to transmit-

ting parts of a file. The first approach is to use GridFTP partial transfer (or any file transfer tools

that provide the functionality of partial transfer) to transfer partitions from one PVFS2 to another

PVFS2 in parallel but independently. To achieve highest throughput, we should avoid unnecessary

network–and–disk contention in each PVFS2 system by making all GridFTP servers responsible

for moving only the data blocks located in their local disks. For example, we should avoid the

following scenario: a GridFTP server reads a non-local data block and sends the block to its peer

receiver, which then has to move the block using PVFS2 to a disk of another host. To avoid such

network–and–disk contention, we should meet the following two conditions:

1. The software should know a priori how data are striped in PVFS2.

2. PVFS2 I/O servers and GridFTP servers run on the same hosts and GridFTP servers are

responsible only for their local data blocks.

Provided that the first condition holds, the second condition becomes trivial. However, PVFS2 does

not provide any explicit utility to examine data distribution. Therefore, to meet the first condition,

we investigated how PVFS2 works and modified PVFS2 code. We will describe our modifications

to PVFS2 in Section 5.4.3. Fig. 5.5 shows a model of using GridFTP partial file transfer to imple-

ment the transferring step, where for each data block, there is a GridFTP control connection and a

GridFTP data connection responsible for transmitting the block between the two PVFS2 systems.


PV

FS2

Block 6

Block 1

PVFS2 I/O server 1

GridFTP server 1

...

Block 2n

Block n

...

...

PVFS2 I/O server n

GridFTP server n

PV

FS2

Block 6

Block 1

...

Block 2n

Block n

...

...PVFS2 I/O server 1'

GridFTP server 1'

PVFS2 I/O server n’

GridFTP server n’

...

GridFTP partial file transfer

Figure 5.5: A model of using GridFTP partial file transfer to implement the transferring step

The second approach is to use GridFTP striped transfer. Similar to the first approach, to achieve

highest throughput, we should also minimize network–and–disk contention in each PVFS2 system.

For this target, we should meet the following two conditions besides the two conditions for the first

approach:

1. GridFTP stripes data across data nodes in the same sequence as PVFS2 does across PVFS2

I/O servers.

2. GridFTP and PVFS2 have the same stripe size.

We can easily meet the second condition by setting the stripe-size parameters for GridFTP and

PVFS2 to have the same value. We will address how we modified GridFTP code to meet the first

condition in Section 5.4.4.

Fig. 5.6 shows the model of using GridFTP striped transfer to implement the transferring step.

Unlike the first transferring approach, which is composed of many independent parallel partial

transfers, this approach has only a single file transfer involving many hosts (see Section 5.2.1). As

shown in Fig. 5.6, there are only two control connections between a third party and two front ends.

In addition, for each pair of sending and receiving data nodes, there is only a single data connection.


GridFTP server

Block 1

Block n+1

...

Block n

Block 2n...

I/O server 1

data node 1

I/O server n

data node n

PVFS2

control

connec

tion

internal IPC

globus-url-copy

receiving

front end

A

a third party C

...

GridFTP server

Block 1

Block n+1

...

Block n

Block 2n

...

I/O server 1'

data node 1'

I/O server n’

data node n’

PVFS2

control connection

internal IPC

sending

front end

B

...

data connection

...

data connection

Figure 5.6: A model of using GridFTP striped transfer to implement the transferring step

Comparing Fig. 5.5 with Fig. 5.6, we see that the approach using GridFTP striped transfer is more

natural and has less overhead to establish and release connections. For these reasons, we choose

to use GridFTP striped transfer to implement the transferring step. In conclusion, we use GridFTP

striped transfer and PVFS2 to implement the general-case cluster solution. For convenience, we

summarize the above-described approaches in Table 5.1.

5.4.3 Implementation—Modifications to PVFS2

As mentioned in Section 5.4.2, to minimize network–and–disk contention in the general-case clus-

ter solution, we need to know how a file is striped in PVFS2. In this subsection, we describe our

modifications to PVFS2 to obtain data distribution information.


Table 5.1: A summary of possible approaches to implement the general-case cluster solutionSteps Approach Pros. Cons.

GridFTPpartial filetransfer

wastes disk space, consumessignificant overhead to splitand assemble

splitting &assembling

socketprogram

avoids wasting disk space consumes significant overheadto split and assemble

pvfs2-cp avoids wasting disk space,avoids assembling or evensplitting overhead

transferring GridFTPpartial filetransfer

many independent transferswhich incurs much overheadto set up and release connec-tions

GridFTPstripedtransfer

a single file transfer

We installed two PVFS2 1.0.1 systems on a 22-node cluster, called sunfire. Sunfire1 through

sunfire22 are all equipped with two Intel(R)-Xeon 2.80 GHz CPUs, and 1 GB RAM, and are con-

nected to a 24-port GbE switch. They run Redhat Linux 9 and are the clients of an NFS server,

called centurion. We loaded each PVFS2 system on five sunfire hosts. For the first PVFS2 system,

we configured sunfire1 through sunfire5 as the I/O servers and compute nodes, and sunfire1 as the

only metadata server. For the second PVFS2 system, we configure sunfire6 through sunfire10 as

the I/O servers and compute nodes, and sunfire6 as the only metadata server. The configuration file

for the second PVFS2 is shown in Fig. 5.7. In this subsection, we carried out the experiments in the

second PVFS2 system unless otherwise mentioned.

Unlike PVFS1, which provides the utility of pvstat to examine physical file-distribution param-

eters (e.g., the index of the starting I/O node, the number of I/O servers, and the stripe size) [43],

PVFS2 1.0.1 does not provide any direct utility to inspect data distribution. We reported this prob-

lem to the pvfs2-user mailing list and were advised to use the tool pvfs2-fs-dump, which displays

information about the contents of the file system.5 However, the output by pvfs2-fs-dump does not

explicitly illustrate how files are striped. The output is not only hard to comprehend, but also is

5See http://www.beowulf-underground.org/pipermail/pvfs2-users/2005-April/000622.html.


...<MetaHandleRanges>

Range sunfire6 4-715827885</MetaHandleRanges><DataHandleRanges>

Range sunfire10 715827886-1431655767Range sunfire6 1431655768-2147483649Range sunfire7 2147483650-2863311531Range sunfire8 2863311532-3579139413Range sunfire9 3579139414-4294967295

</DataHandleRanges>...

Figure 5.7: A snippet of pvfs2-fs2.conf, the PVFS2 configuration file on sunfire6

verbose when the PVFS2 file system contains myriad files. Fig. 5.8 shows a part of the output of

the pvfs2-fs-dump command. For each file in PVFS2, pvfs2-fs-dump provides the handle number,...File: test_500M

handle = 715827830, type = Metafile, server = 0handle = 3579139362, type = Datafile, server = 3handle = 4294967244, type = Datafile, server = 4handle = 1431655716, type = Datafile, server = 0handle = 2147483598, type = Datafile, server = 1handle = 2863311480, type = Datafile, server = 2

File: test_2000Mhandle = 715827861, type = Metafile, server = 0handle = 2863311500, type = Datafile, server = 2handle = 3579139382, type = Datafile, server = 3handle = 4294967264, type = Datafile, server = 4handle = 1431655736, type = Datafile, server = 0handle = 2147483608, type = Datafile, server = 1

...

Figure 5.8: A part of the output for pvfs2-fs-dump

the type (Metafile or Datafile), and the I/O or metadata server number.We wanted answers to the

following questions. First, the I/O server numbers and metadata server numbers are logical num-

bers. It is unclear how PVFS2 match the logical server numbers with the physical servers. Second,

the order of the server numbers is not deterministic; for example, the file test 500M is striped in the


order 3, 4, 0, 1, and 2 whereas the file test 2000M is striped in the order 2, 3, 4, 0, and 1. How is

this order determined? Does it indicate the round-robin sequence of the I/O servers where the files

are distributed? Finally, the output of pvfs2-fs-dump does not provide any information about the

data stripe size. The default stripe size is 64 KB, but can a user set the stripe size?

The first question was easy to answer. Sunfire6 is the only metadata server (see Fig. 5.7).

Therefore, as a metadata server, sunfire6 has the logical number 0 (see Fig. 5.8). By combining the

handle numbers in Fig. 5.8 and the handle ranges for each data server in Fig. 5.7, we determined

physical servers corresponding to logical numbers (see Table 5.2). In other words, by combining

the output of pvfs2-fs-dump command and the contents of the pvfs2-fs2.conf file, we determined the

identification of the physical servers corresponding to logical numbers of I/O nodes.

Table 5.2: The logical server numbers for the physical I/O serversPhysical I/O server Logical number

sunfire10 0sunfire6 1sunfire7 2sunfire8 3sunfire9 4

To answer the other two questions, we wrote a program, called filegenerator, to create a file

such that the file stores the striping information. Consider an s-KB file with the format shown in

Fig. 5.9. We used the strace command to trace the system calls called by the utility pvfs2-cp. We

describe our trace results below.

1a...a︸︷︷︸ 2a...a︸︷︷︸ ... sa...a︸︷︷︸1024B 1024B ... 1024B

Figure 5.9: The content of an s KB file

First, we used filegenerator to create a 1000 MB file, called test 1000M, in the directory of /tmp/

on sunfire10. Then, we issued the command strace pvfs2-cp -t /tmp/test 1000M /pvfs2/test 1000M

-o testfile/pvfs2cp2 to copy the file into PVFS2 and to save the strace output into the file, called


[xf4c@sunfire10 xf4c]$ more testfile/pvfs2cp2 | grep connect...connect( 4,sa_family=AF_INET, sin_port=htons(3334),sin_addr=inet_addr( "128.143.63.248"), 16) = -1 EINPROGRESS(Operation now in progress)connect( 6,sa_family=AF_INET,sin_port=htons(3334),sin_addr=inet_addr( "128.143.63.216"), 16) = -1 EINPROGRESS(Operation now in progress)connect( 7,sa_family=AF_INET,sin_port=htons(3334),sin_addr=inet_addr( "128.143.63.226"), 16) = -1 EINPROGRESS(Operation now in progress)connect( 8,sa_family=AF_INET,sin_port=htons(3334),sin_addr=inet_addr( "128.143.63.224"), 16) = -1 EINPROGRESS(Operation now in progress)connect( 9,sa_family=AF_INET,sin_port=htons(3334),sin_addr=inet_addr( "128.143.63.225"), 16) = -1 EINPROGRESS(Operation now in progress)...

Figure 5.10: A part of the output for the command more testfile/pvfs2cp2 | grep connect

testfile/pvfs2cp2. Next, we identified the file descriptors used in the I/O servers on sunfire by typ-

ing the command more testfile/pvfs2cp2 | grep connect. From Fig. 5.106, we determined the file

descriptors used in sunfire6 through sunfire10 by matching IP addresses from Fig. 5.10 with the

names of these machines. The results are shown in Table 5.3. Further, we used the command,

more testfile/pvfs2cp2 | grep writev | more, to determine how the file was distributed across the I/O

servers. Fig. 5.11 shows a small part of the output for this command, where we saw that the distance

between neighboring blocks on the same host was 320 KB (e.g., 385-65, 321-1, etc.). Since each

Table 5.3: The file descriptors and IP addresses for sunfire6 through sunfire10File descriptor IP address Host name

4 128.143.63.248 sunfire106 128.143.63.216 sunfire67 128.143.63.226 sunfire98 128.143.63.224 sunfire79 128.143.63.225 sunfire8

6We configured the I/O servers and the metedata server to listen on the default TCP port number 3334.


writev( 4, ...,"65aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536,"385aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536,...writev( 7,...,"1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536,"321aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536,...writev( 6,...,"129aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536,"449aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536,...writev( 8,...,"193aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536,"513a aaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536,...writev( 9,...,"257aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536,"577aaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 65536,...

Figure 5.11: A part of the output of the command more testfile/pvfs2cp2 | grep writev | more

stripe was 64 KB, and there were five I/O servers, neighboring blocks were 65×5 = 320 KB apart.

Combining Fig. 5.11 and Table 5.3, we summarized the data-distribution pattern for test 1000M

in Table 5.4. Thus, test 1000M was distributed cyclicly across sunfire9, sunfire10, sunfire6, sun-

fire7, and sunfire8. Finally, we examined the output of pvfs2-fs-dump for test 1000M, as shown in

Fig. 5.12. Combining Fig. 5.12 and Table 5.1, we found that the I/O-server sequence given by pvfs2-

fs-dump was also sunfire9, sunfire10, sunfire6, sunfire7, and sunfire8. Therefore, we concluded that

Table 5.4: The data-distribution pattern for /pvfs2/test 1000MFile descriptor Host name Starting offset for each block

4 sunfire10 65, 385, 705 ... 10237456 sunfire6 129, 449, 769, ... 10238097 sunfire9 1,321,641,961, ... 10236818 sunfire7 193, 513, 833, ... 10238739 sunfire8 257, 577, 897, ... 1023937


...File: test_1000M

handle = 715827870, type = Metafile, server = 0handle = 4294967284, type = Datafile, server = 4handle = 1431655756, type = Datafile, server = 0handle = 2147483638, type = Datafile, server = 1handle = 2863311520, type = Datafile, server = 2handle = 3579139402, type = Datafile, server = 3

...

Figure 5.12: The pvfs2-fs-dump output for the test 1000M file

pvfs2-fs-dump shows the round-robin sequence of the I/O servers for file distribution.7

For the third question on the stripe size, we first used filegenerator to create a 128 KB

file, called test 128K. Then, we typed the command strace pvfs2-cp -s 131072 -t /tmp/test 128K

/pvfs2/test 128K2 -o pvfs2cp, which specified the stripe size as 128 KB in the -s option. Fig. 5.13

shows a part of the strace output, where the stripe size was 64 KB instead. Thus, we concluded that

writev( 4,...," 1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...)...writev( 6,...," 65aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...)...

Figure 5.13: A snippet from the file pvfs2cp

in PVFS2 1.0.1, pvfs2-cp has a bug of ignoring the -s option.8 To change the default stripe size,

we investigated the PVFS2 1.0.1 source code. We found that the statement that specifies the default

stripe size (64 KB) is located in the program $PVFS2dir9/src/io/description/Dist-simple-stripe.c as

shown below:

static PVFS_simple_stripe_params simple_stripe_params = 65536 /* strip size */

;7We repeated the procedure for many files and found that the result always holds. The PVFS2 team also confirmed

this result.8we reported this problem to the pvfs2-developer mailing list and were notified that this problem would be fixed in

the future.9$PVFS2dir denotes where PVFS2 is installed


By setting the parameter simple stripe params, we can change the default stripe size and

thus overcome the problem of pvfs2-cp ignoring the -s option. For example, we set sim-

ple stripe params=1048576 and recompiled the code. Then, we used pvfs2-cp to copy test 1000M

into PVFS2 and used strace to observe the system calls called by pvfs2-cp. Fig. 5.14 shows a part

of the strace output, where test 1000M was distributed across the I/O servers with the 1 MB stripe

size.

writev( 4,...," 1aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"...,)...writev( 6,...," 1025aaaaaaaaaaaaaaaaaaaaaaaaaaaa"...,)...writev( 7,...," 2049aaaaaaaaaaaaaaaaaaaaaaaaaaaa"...,)...writev( 8,...," 3073aaaaaaaaaaaaaaaaaaaaaaaaaaaa"...,)...writev( 9,...," 4097aaaaaaaaaaaaaaaaaaaaaaaaaaaa"...,)...

Figure 5.14: A part of the output for the strace command

Finally, we addressed the problem that PVFS2 stripes files across the I/O servers in a nonde-

terministic sequence. We found that inside the program $PVFS2dirsrc/common/misc/pint-cached-

config.c, there is a function, PINT cached config get next io(), which chooses a random I/O server

and then uses the order specified in pvfs2-fs2.conf to distribute a file, as shown in Fig. 5.15. The

reason that PVFS2 was designed to stripe data with a random starting I/O server is load balanc-

ing. But in our general-case cluster solution, we need to predict how a file is striped to minimize

network-and-disk contention. Hence, we modified the boldfaced statement in Fig. 5.15 into jitter

= -1 and obtained a predictable (fixed) order of data distribution. In other words, a file is distributed

across all the I/O servers according to the logical order specified in pvfs2-fs2.conf. Thus, for the

second PVFS2, the sequence is sunfire10, sunfire6, sunfire7, sunfire8, and sunfire9; and for the first

PVFS2, the sequence is sunfire1, sunfire2, sunfire3, sunfire4, and sunfire5. Consequently, given the

information of stripe size, we can exactly figure out how a file is striped across the I/O servers.


/* PINT_cached_config_get_next_io()* returns the address of a set of servers that should be used to* store new pieces of file data. This function is responsible for* evenly distributing the file data storage load to all servers.*/

int PINT_cached_config_get_next_io(...)

...num_io_servers = PINT_llist_count(

cur_config_cache->fs->data_handle_ranges);

/* pick random starting point */jitter = (rand() % num_io_servers);while(jitter-- > -1)

cur_mapping = PINT_llist_head(cur_config_cache->data_server_cursor);...cur_config_cache->data_server_cursor = PINT_llist_next(

cur_config_cache->data_server_cursor);

while(num_servers)

...cur_config_cache->data_server_cursor = PINT_llist_next(

cur_config_cache->data_server_cursor);data_server_bmi_str = PINT_config_get_host_addr_ptr(

config,cur_mapping->alias_mapping->host_alias);...

Figure 5.15: A snippet of the source code for PINT cached config get next io()

5.4.4 Implementation—Modifications to GridFTP

GridFTP stripes data across data nodes according to a data-connection sequence, termed “stripe

index,” in the range of 0 to n− 1. To meet the condition that GridFTP stripes data across data

nodes in the same sequence as PVFS2 does across PVFS2 I/O servers, we first need to answer

the question: what is the stripe index for each pair of sending and receiving data nodes? In other


words, in GridFTP striped transfer, how and in what order are sending data nodes matched with

receiving ones? The GridFTP specification [1] does not address this question. In this section, we

first investigate how sending and receiving data nodes are matched. Our experimental results show

that the matching is nondeterministic and thus, we cannot avoid the network-and-disk contention

unless we modify GridFTP code. Then, we describe how to modify the GridFTP code to get a

deterministic matching sequence between sending and receiving data nodes.

We installed the GridFTP package provided by GT3.9.5 on sunfire. This GridFTP package

contains the functionality of GridFTP striped transfer. We started GridFTP servers on sunfire1

through sunfire10 such that sunfire1 and sunfire6 are front ends and the other eight hosts are data

nodes. Fig. 5.16 shows the commands. With the -r option, we specified that the data nodes for

sunfire1 were ordered as sunfire2 through sunfire5 and those for sunfire6 were sunfire7 through

sunfire10. The -dn option means that the GridFTP server is a data node. We expected sunfire2

through sunfire5 and sunfire7 through sunfire10 were ideally matched according to the sequences

specified in the -r option, which means that sunfire2 would communicate with sunfire7, sunfire3

with sunfire8, and so on. However, the following results show that GridFTP striped transfer does

not work in this ideal way.

[xf4c@sunfire1 xf4c]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa-p 50001 -r sunfire2:5001, sunfire3:5001, sunfire4:5001, sunfire5:5001

[xf4c@sunfire6 etc]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa-p 50002 -r sunfire7:5001, sunfire8:5001, sunfire9:5001, sunfire10:5001

[xf4c@sunfire2 xf4c]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa-p 50001 -dn

...[xf4c@sunfire5 xf4c]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa

-p 50001 -dn[xf4c@sunfire7 xf4c]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa

-p 50001 -dn...[xf4c@sunfire10 xf4c]$ /home/xf4c/gt3.9.5/sbin/globus-gridftp-server -aa

-p 50001 -dn

Figure 5.16: The commands to start GridFTP servers on sunfire

We started globus-url-copy on a third party, sunfire11, to use the functionality of GridFTP striped


transfer (by turning on the -stripe option). The command is as follows:

[xf4c@sunfire11 xf4c]$ $GLOBUS_LOCATION/bin/globus-url-copy -vb -dbg

-stripe ftp://sunfire1:50001/home/xf4c/testfile/test_1G

ftp://sunfire6:50002/home/xf4c/testfile/test_1G1 2>dbg1.txt

We turned on the debug mode with the -dbg option so that we could obtain the details. Fig. 5.17

shows a part of the debug output. By examining the information in Fig. 5.17 below and Table 5.3

on page 57, we saw that the sequence for host–port pairs returned by the SPAS command were sun-

fire10, sunfire9, sunfire8, sunfire7 rather than the sequence of sunfire7 through sunfire10 specified

by the -r option for sunfire6.

The result of SPAS:debug: sending command: SPASdebug: response fromftp://sunfire6:50002/home/xf4c/testfile/test_1G1: 229-EnteringStriped Passive Mode.128,143,63,248,185,185128,143,63,226,186,31128,143,63,225,185,170128,143,63,224,186,15

229 End

Figure 5.17: A part of the debug output for the GridFTP striped transfer

Before the GridFTP striped transfer, we also started tcpdump [51] to capture the GridFTP traffic

amongst sunfire1 through sunfire10. After the transfer was finished, we used tcptrace [52] to ana-

lyze the captured traffic. Fig. 5.18 shows the tcptrace outputs for sunfire7–10. The GridFTP data

connections were between sunfire4 and sunfire10, sunfire3 and sunfire9, sunfire2 and sunfire8, and

sunfire5 and sunfire7. Thus, when the sending front end, sunfire1, executed the SPOR command, it

did not require its data nodes (sunfire2 through sunfire5) to establish connections sequentially with

the hosts returned by the SPAS command (sunfire10, sunfire9, sunfire8, sunfire7). We repeated the

experiment several times, and found that neither SPAS nor SPOR follows the sequence specified by

the -r option. Hence, we could not predict how data connections were established between multiple

data nodes.


[xf4c@sunfire10 tcptrace-6.6.7]$ tcptrace /tmp/sunfire10.log280048 packets seen, 280020 TCP packets tracedelapsed wallclock time:0:00:00.652783, 429006 pkts/sec analyzedtrace file elapsed time:0:08:30.409906TCP connection info:1: sunfire6.cs.Virginia.EDU:47763 - sunfire10.cs.Virginia.EDU:5001 (a2b)

221> 187< (complete)2: sunfire4.cs.Virginia.EDU:4878 - sunfire10.cs.Virginia.EDU:47545 (c2d)

186099> 93513< (complete)

[xf4c@sunfire9 tcptrace-6.6.7]$ tcptrace /tmp/sunfire9.log278903 packets seen, 278885 TCP packets tracedelapsed wallclock time:0:00:00.891238, 312938 pkts/sec analyzedtrace file elapsed time:0:07:27.005080TCP connection info:1: sunfire6.cs.Virginia.EDU:47764 - sunfire9.cs.Virginia.EDU:5001 (a2b)

212> 174< (complete)2: sunfire3.cs.Virginia.EDU:47586 - sunfire9.cs.Virginia.EDU:47647 (c2d)

185247> 93252< (complete)

[xf4c@sunfire8 tcptrace-6.6.7]$ tcptrace /tmp/sunfire8.log279503 packets seen, 279482 TCP packets tracedelapsed wallclock time: 0:00:00.745197, 375072 pkts/sec analyzedtrace file elapsed time: 0:07:50.749054TCP connection info:1: sunfire6.cs.Virginia.EDU:47765 - sunfire8.cs.Virginia.EDU:5001 (a2b)

215> 180< (complete)2: sunfire2.cs.Virginia.DU:48556 - sunfire8.cs.Virginia.EDU:47530 (c2d)

185827> 93260< (complete)

[xf4c@sunfire7 tcptrace-6.6.7]$ tcptrace /tmp/sunfire7.log275137 packets seen, 275109 TCP packets tracedelapsed wallclock time:0:00:01.237319, 222365 pkts/sec analyzedtrace file elapsed time:0:08:30.410378TCP connection info:1: sunfire6.cs.Virgiia.EDU:47766 - sunfire7.cs.Virginia.EDU:5001 (a2b)

209> 167< (complete)2: sunfire5.cs.Virginia.EDU:47577 - sunfire7.cs.Virginia.EDU:47631(c2d)

182995> 91738< (complete)

Figure 5.18: The tcptrace outputs for GridFTP striped transfer before we modified GridFTP code


These nondeterministic data connections between sending and receiving data nodes are unsuit-

able for us to deploy the general-case cluster solution on CHEETAH. We need to reserve bandwidth

before a data transfer. Given the nondeterminism, we need to reserve bandwidth between any pairs

of sending and receiving hosts—there are totally n · (n− 1) pairs. We would waste and even run

out of bandwidth if we reserved bandwidth for all possible pairs. We can solve this problem by

reserving bandwidth between two cluster switches and allows any hosts connected to a switch to

communicate with any hosts connected to the other switch. However, to minimize network–and–

disk contention, we have to make data connections deterministic.

We studied the GridFTP source code in GT3.9.5 and modified the implementation of the SPAS

and SPOR commands. For the SPAS command, we first obtained the IP addresses of data nodes

specified in the -r option for a receiving front end. Then, we sorted the list of host–port pairs

generated by the old SPAS command according to the IP-address order for receiving data nodes.

Then, we let SPAS return the sorted list to the third party negotiating the GridFTP striped transfer.

Thus, the argument for the SPOR command sent to the sending front end was also sorted by the

order of the IP addresses of the receiving data nodes. For the SPOR command, we requested

sending data nodes specified in the -r option for a sending front end to initiate data connections

sequentially to receiving data nodes specified in the argument of the SPOR command. In this

way, sending and receiving data nodes are matched according to their sequences in the -r option for

sending and receiving front ends. Additionally, their data connections have ascending stripe indexes

from 0 to n−1. Hence, it is easy to let GridFTP stripe data across data nodes in the same sequence

as PVFS2 does across PVFS2 I/O servers. We only need to set the -r option such that GridFTP data

nodes have the same sequence as PVFS2 I/O servers.

5.4.5 Experimental Results

We tested the general-case cluster solution on sunfire. In this section, we present the experimental

results to show that network–and–disk contention is minimized after we modified GridFTP and

PVFS2.

There are two PVFS2s on sunfire (see Section 5.4.3 on page 53). The I/O servers for the first


PVFS2 are ordered as sunfire1 through sunfire5. The I/O servers for the second PVFS2 are ordered

as sunfire10, and sunfire6 through sunfire9.

We started GridFTP front ends on sunfire1 and sunfire10 and GridFTP data nodes on sunfire1

through sunfire10. The data nodes for sunfire1 were ordered as sunfire1 through sunfire5 and those

for sunfire10 were sunfire10, sunfire6 through sunfire9.

Then, we started globus-url-copy on sunfire11 to conduct a file transfer between two PVFS2

systems on the sunfire cluster. The command is as follows:

[xf4c@sunfire11 xf4c]$ $GLOBUS_LOCATION/bin/globus-url-copy -vb -dbg

-stripe ftp://sunfire1:50001/pvfs2/test_1G

ftp://sunfire10:50002/pvfs2/test_1G1 2>dbg1.txt

Fig. 5.19 shows the tcptrace outputs for sunfire6 through sunfire10, where we saw that connec-

tions were established between sunfire1 and sunfire10, sunfire2 and sunfire6, sunfire3 and sunfire7,

sunfire4 and sunfire8, and sunfire5 and with sunfire9. Hence, the data connections were estab-

lished according to the sequences specified in the -r options for sunfire1 and sunfire10. Note that

in Fig. 5.19, we omited some TCP-connection information for sunfire6 through sunfire10 to save

space. These connections were not essential for the purpose of our experiment. They were either

for the communication between the PVFS2 metadata server and the PVFS2 I/O servers or for the

communication between the GridFTP front end and the GridFTP data nodes. Moreover, they only

contained a comparatively small number of packets. There were no other connections amongst the

PVFS2 I/O servers. In other words, each data node transfers only the data located in its local disk.

Thus, we minimized network-and-disk contention. We repeated the test many times and did not

find any exceptions to our original results.

Since we avoided unnecessary network-and-disk contention, we expected that the general-case

cluster solution would have a speedup of n (n=5 in our experiment) over normal GridFTP transfer

involving only a single source–sink pair. Surprisingly, we found that the cluster solution gained

only a small speedup. The reason for the poor performance is that PVFS2 had a much lower read–

write speed than NFS and Linux ext2 on sunfire. Thus, we need to continue working on PVFS2 or

try other parallel file systems (e.g., GPFS) to get a high read–write throughput.


[xf4c@sunfire6 xf4c]$ tcptracescript sunfire6.log181171 packets seen, 181163 TCP packets traced...TCP connection info:1: sunfire8.cs.Virginia.EDU:44786 - sunfire6.cs.Virginia.EDU:3334 (a2b)1565> 796<...

7: sunfire6.cs.Virginia.EDU:44721 - sunfire9.cs.Virginia.EDU:3334 (m2n)2> 1<

8: sunfire2.cs.Virginia.EDU:58306 - sunfire6.cs.Virginia.EDU:56735(o2p) 121641> 50070< (complete)

9: sunfire7.cs.Virginia.EDU:44734 - sunfire6.cs.Virginia.EDU:3334 (q2r)1571> 791<

10: sunfire9.cs.Virginia.EDU:45156 - sunfire6.cs.Virginia.EDU:3334 (s2t)1549> 789<

[xf4c@sunfire7 xf4c]$ tcptracescript sunfire7.log176887 packets seen, 176879 TCP packets traced...9: sunfire3.cs.Virginia.EDU:57513 - sunfire7.cs.Virginia.EDU:56871

(q2r) 121617> 52921< (complete)...[xf4c@sunfire8 xf4c]$ tcptracescript sunfire8.log155197 packets seen, 155189 TCP packets traced...17: sunfire4.cs.Virginia.EDU:57002 - sunfire8.cs.Virginia.EDU:56999

(ag2ah) 105821> 46770< (complete)...[xf4c@sunfire9 xf4c]$ tcptracescript sunfire9.log181769 packets seen, 181760 TCP packets traced...10: sunfire5.cs.Virginia.EDU:56857 - sunfire9.cs.Virginia.EDU:56905

(s2t) 123475> 55980< (complete)[xf4c@sunfire10 xf4c]$ tcptracescript sunfire10.log177961 packets seen, 177954 TCP packets traced...7: sunfire1.cs.Virginia.EDU:44346 - sunfire10.cs.Virginia.EDU:58105

(m2n) 122541> 53132< (complete)...

Figure 5.19: The tcptrace outputs for GridFTP striped transfer after we modified GridFTP code


5.5 The Specific Cluster Solution for TSI

As mentioned in Section 5.1, in the TSI project, scientists at NCSU, need to download multi-TB

datasets from the Cray X1E at ORNL to orbitty at the local site. These datasets are stored as separate

10 GB files on the Cray disks. We are not granted the permission to access the Cray directly. The

current file-transfer solutions, bbcp or LORS, use one intermediate hop to transfer the files to a

storage depot, TSILN, before moving them to orbitty. These solutions use only a single source and

sink to transfer data, and achieve a throughput of 200 Mb/s to 400 Mb/s.

We can improve the throughput by using a specific cluster solution as follows. Given that the

dataset is composed of many (e.g., about 200) separate files, we move these files from the Cray

X1E to five machines connected to CHEETAH, called zelda1 through zelda5. Then, we transfer the

files on CHEETAH circuits established between the five machines zelda1 through zelda5 and five

computing nodes of orbitty. Any file transfer tool can be used to carry out the transfers in parallel.

Fig. 5.20 shows the network configuration for this approach. This solution employs pipelining of

file movement between the Cray and the zelda hosts, and file movement between the zelda and

orbitty clusters. Since we have to move 200 files, but only have five hosts at each end, parallelism is

achieved at a file level rather than at a block level as described in the general cluster solution with

Dell

5424

.

.

.

zelda1

zelda2

zelda5

zelda4

zelda3

compute-

0-0

compute-

0-1

compute-

0-4

compute-

0-3

compute-

0-2

compute-

0-19

controller-0

(rudi)

disk-0-0

disk-3-0

disk-2-0

disk-1-0

monitoring

host

disk-4-0

controller-1

(orbitty)

orbitty at NCSU zelda at ORNL

Dell

5224

CHEETAH LAN

X1E at ORNL

X1E

Figure 5.20: The specific cluster solution for TSI


PVFS2 and GridFTP.

On a 1-Gb/s circuit between zelda5 at ORNL and compute-0-2 at NCSU, we achieved a disk–

to–disk throughput of 720 Mb/s using ftp. Thus, with five pairs of parallel independent transfers,

we expect an aggregate throughput of 3.6 Gb/s.

5.6 Conclusions

In this chapter, we described the single-host and cluster-based solutions to achieve throughput

above 1 Gb/s over WANs. We reasoned that the hardware solution created by equipping end hosts

with high-speed hardware is feasible but neither scalable nor cost-effective. Then, we proposed a

general-case cluster solution, which uses PVFS2 and GridFTP to transfer data between multiple

end hosts in parallel. By requiring GridFTP servers to transfer data blocks only located on their

local disks, we minimize end-host network–and–disk contention. To achieve this, we modified

source code of PVFS2 to force a fixed data-block distribution, and changed the implementation of

GridFTP SPAS and SPOR commands. Finally, we presented a solution for fast file transfers in

the TSI project. By reserving bandwidth and conducting transfers in parallel between five pairs of

senders and receivers, we achieved a disk–to–disk throughput of 3.6 Gb/s.

Chapter 6

CONCLUSIONS AND FUTURE WORK

We summarize the thesis in this chapter. We also discuss the future work needed to advance our

present research.

6.1 Conclusions

In this thesis, we studied applications for optical circuit-switched networks. In Chapter 2, we

reviewed different types of GMPLS networks and reasoned that they are call-blocking networks

that only support immediate-request calls. We also described CHEETAH as an example of GMPLS

networks. Then, in Chapters 3 through 5, we concentrated on three topics on applications for

GMPLS networks.

First, in Chapter 3, we addressed an important question: what applications are suitable to run on

GMPLS networks to achieve both high utilization and low call-blocking probability? We presented

single-link bandwidth sharing models for two categories of applications: those for which the per-

circuit capacity and the holding time are independent, and those for which they are directly related

(e.g., file transfers). For the two categories of applications, we concluded that ideal applications on

GMPLS networks require bandwidth on the order of one-hundredth the link capacity as per-circuit

rates. The first category of applications should have long call-holding times to keep the number of

line cards small. In contrast, the second category of applications need to have short call-holding

times (on the order of seconds).

70

Chapter 6. CONCLUSIONS AND FUTURE WORK 71

Second, according to the conclusions in Chapter 3, we believe that web file transfers can use

CHEETAH efficiently. Thus, in Chapter 4, we designed and implemented a new web-based file-

transfer software package, called WebFT. We integrated CHEETAH end-host software APIs into

the WebFT package to provide CHEETAH related services transparently to users. By leveraging

CGI, the WebFT package is completely independent of the web server and browser software, and

therefore, does not require any modifications to the latter. We also tested WebFT on CHEETAH and

our experimental results showed that WebFT can provide deterministic data services to CHEETAH

clients on dedicated end-to-end circuits.

Finally, in Chapter 5, we explained that TCP’s congestion-control algorithm and end-host lim-

itations made it hard to achieve a throughput above 1 Gb/s across long-RTT WANs. Then, we

described another parallel file-transfer application to overcome the two factors that limit through-

put. We used PVFS2 and GridFTP to implement a general-case cluster solution, where a source file

is not split. We also modified PVFS2 and GridFTP code to avoid unnecessary end-host network–

and–disk contentions, and thus maximized throughput. Furthermore, for the TSI project, where

a source file is already split into many parts, we presented a specific cluster solution, which used

several pairs of parallel independent transfers to get multi-Gb/s throughput.

6.2 Future Work

We list several significant directions in which we would like to advance this study:

• Analytical models of GMPLS networks: We used single-link bandwidth sharing models to

analyze the suitability of applications in GMPLS networks. We assumed that there was only

a single class of applications sharing networks. We plan to extend the analytical models to

multiple classes based on the multi-class call-blocking model presented by Kaufman [28].

We also plan to extend our models to multiple links and then to network models by referring

to the work done by Ramesh et al. [40] and Li et al. [30].

• Web transfer application on CHEETAH: Currently, only hosts directly connected to

CHEETAH can use WebFT to improve web performance. We plan to design and imple-

Chapter 6. CONCLUSIONS AND FUTURE WORK 72

ment a web application using partial-path circuits such that non–CHEETAH hosts can also

use CHEETAH. We will use the proxy software, Squid [47], to break up a long-distance con-

nectionless path into a partial circuit through CHEETAH, and two low-RTT connectionless

sub-paths. Using this approach, we can avoid congested connectionless links and reduce RTT.

Thus, non–CHEETAH hosts can use CHEETAH to improve web performance. In addition,

we can leverage web caching protocols provided by Squid to further improve web perfor-

mance. We will also extend our partial-path circuit models to include other CO networks and

reduce RTT on a national or even global scale.

• Parallel file transfers on CHEETAH: We will test the general-case cluster solution on

CHEETAH. We will work on PVFS2 or try GPFS to overcome the barrier of low I/O through-

put caused by end-hosts. For the TSI project, if we can directly access the Cray, we will

remove the intermediate step which moves data from the Cray to zelda. We will apply the

general-cluster case solution directly to a single-step file transfer between the Cray and or-

bitty.

Bibliography

[1] ALLCOCK, W. GridFTP: Protocol extensions to FTP for the Grid. Global Grid Forum Rec-

ommendation GFD.20, Mar. 2003.

[2] ALLCOCK, W., BRESNAHAN, J., KETTIMUTHU, R., LINK, M., DUMITRESCU, C., RAICU,

I., AND FOSTER, I. The Globus striped GridFTP framework and server. In Proceedings of

Super Computing 2005 (Nov. 2005).

[3] AWDUCHE, D., BERGER, L., GAN, D., LI, T., SRINIVASAN, V., AND SWALLOW, G.

RSVP-TE: Extensions to RSVP for LSP tunnels. RFC 3209, Dec. 2001.

[4] BAKER, M., AND FENG, W. 10-Gigabit Ethernet helps relieve network bottlenecks for

bandwidth-intensive applications. Dell Power Solutions (mar 2004), 113–116.

[5] BARCLAY, T., CHONG, W., AND GRAY, J. A quick look at Serial ATA (SATA) disk perfor-

mance. Technical Report MSR-TR-2003-70, Oct. 2003.

[6] bbcp. http://www.slac.stanford.edu/ ˜abh/bbcp/ .

[7] BELL, E., SMITH, A., LANGILLE, P., RIJHSINGHANI, A., AND MCCLOGHRIE, K. Defini-

tions of managed objects for bridges with traffic classes, multicast filtering and virtual LAN

extensions. RFC 2674, Aug. 1999.

[8] BRADEN, R., ZHANG, L., BERSON, S., HERZONG, S., AND JAMIN, S. Resource ReSerVa-

tion Protocol (RSVP)-version 1 fuctional specifications. IETF RFC 2205, Sept. 1997.

73

http://www.slac.stanford.edu/~abh/bbcp/

Bibliography 74

[9] BRESLAU, L., CAO, P., FAN, L., PHILLIPS, G., AND SHENKER, S. Web caching and zipf-

like distributions: Evidence and implications. In Proceedings of IEEE INFOCOM’99 (Mar.

1999).

[10] BREWER, J., AND SEKEL, J. PCI Express technology. Dell white paper, Feb 2004.

[11] CANARIE’s CA*net 4. http://www.canarie.ca/canet4/index.html .

[12] CARNS, P. H., III, W. B. L., ROSS, R. B., AND THAKUR, R. PVFS: A parallel file system

for linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference (Atlanta,

GA, Oct. 2000), pp. 317–327.

[13] CHEETAH. http://cheetah.cs.virginia.edu .

[14] CROVELLA, M., AND A.BESTAVROS. Self-similarity in World Wide Web traffic evidence

and possible causes. IEEE/ACM Transactions on Networking 5, 6 (Dec. 1997).

[15] The Energy Sciences Network (ESnet). http://www.es.net/ .

[16] FANG, X., ZHENG, X., AND VEERARAGHAVAN, M. Improving Web performance through

new networking technologies. In IEEE ICIW’06 (Feb. 2006).

[17] FLORESCU, D., VALDURIEZ, P., YAGOUB, K., AND ISSARNY, V. Caching strategies for

data-intensive Web sites. In Proceedings of the International Conference on Very Large Data

Bases (VLDB) (Sept. 2000).

[18] FOSTER, I., AND KESSELMAN, C. A metacomputing infrastructure toolkit. IEEE Commun.

Mag. 11(2) (1997), 115–128.

[19] GARZOTTO, F. Ubiquitous Web applications. In Proceedings of the 5th East European

Conference on Advances in Databases and information Systems (Springer-Verlag, London,

Sept. 2001).

[20] The Globus Alliance. http://www.globus.org/ .

http://www.canarie.ca/canet4/index.html

http://cheetah.cs.virginia.edu

http://www.es.net/

http://www.globus.org/

Bibliography 75

[21] General Parallel File System (GPFS). http://www-1.ibm.com/servers/eserver/

clusters/software/gpfs.html .

[22] GUOK, C. ESnet On-demand Secure Circuits and Advance Reservation System (OSCARS).

http://www.es.net/oscars/index.html .

[23] HURWITZ, J., AND FENG, W. End-to-end performance of 10-Gigabit Ethernet on commodity

systems. IEEE Micro 24, 1 (2004).

[24] HWANG, S.-Y., AND RIDDLE, R. Bandwidth Reservation for User Work (BRUW), May

2003.

[25] Virtual bridged Local Area Networks, May 2003.

[26] Internet2. http://www.internet2.net .

[27] KATZ, D., KOMPELLA, K., AND YEUNG, D. Traffic engineering (TE) extensions to OSPF

version 2. RFC 3630, Sept. 2003.

[28] KAUFMAN, J. S. Blocking in a shared resource environment. IEEE Transactions on Commu-

nications 29 (Oct. 1981), 1474–1481.

[29] LANG, J. Link Management Protocol (LMP). IETF RFC 4204, Oct. 2005.

[30] LI, C. Y., WAI, P. K. A., AND LI, V. O. K. The decomposition of a blocking model for

connection-oriented networks. IEEE/ACM Trans. Netw. 12, 3 (2004), 549–558.

[31] Logistical Runtime System (LoRS). http://loci.cs.utk.edu/lors/ .

[32] MELTZER, K., AND MICHALSKI, B. Writing CGI Applications with Perl. Addison-Wesley,

Reading, MA, 2001.

[33] MUDAMBI, P., ZHENG, X., AND VEERARAGHAVAN, M. A transport protocol for dedicated

end-to-end circuit. In IEEE ICC2006 (June 2006).

[34] OMNInet. http://www.icair.org/omninet/ .

http://www-1.ibm.com/servers/eserver/clusters/software/gpfs.html

http://www-1.ibm.com/servers/eserver/clusters/software/gpfs.html

http://www.es.net/oscars/index.html

http://www.internet2.net

http://loci.cs.utk.edu/lors/

http://www.icair.org/omninet/

Bibliography 76

[35] PATTERSON, D. A., GIBSON, G. A., AND KATZ, R. H. A case for redundant arrays of

inexpensive disks (RAID). In Proceedings of the International Conference on Management

of Data (SIGMOD) (June 1988).

[36] POSTEL, J., AND REYNOLDS, J. File Transfer Protocol (FTP). IETF RFC 959, Oct. 1985.

[37] The parallel Virtual File System project. http://www.parl.clemson.edu/pvfs/ .

[38] PVFS2 DEVELOPMENT TEAM. Parallel Virtual File System, version 2 (PVFS2). http:

//www.pvfs.org/pvfs2/pvfs2-guide.html , Sept. 2003.

[39] Parallel Virtual File System, version 2 (PVFS2). http://www.pvfs.org/pvfs2/ .

[40] RAMESH, S., ROUSKAS, G. N., AND PERROS, H. G. Computing blocking probabilities in

multi-class wavelength routing networks with multicast calls. IEEE Journal on Selected Areas

in Communications 20 (Jan. 2002), 89–96.

[41] RAO, N. S. V., WING, W. R., CARTER, S. M., AND WU, Q. Ultrascience net: Network

testbed for large-scale science applications. IEEE Commun. Mag. 43, 11 (Nov. 2005), 12–17.

[42] ROSEN, E., VISWANATHAN, A., AND CALLON, R. Multiprotocol label switching architec-

ture. RFC 3031, Jan. 2001.

[43] ROSS, R. B., CARNS, P. H., III, W. B. L., AND LATHAM, R. Using the Parallel Virtual File

System. http://www.parl.clemson.edu/pvfs/user-guide.html , July 2002.

[44] SCHWARTZ, M. Telecommunication networks: protocols, modeling and analysis. Addison-

Wesley, Boston, MA, 1986.

[45] SHIOMOTO, K., PAPADIMITRIOU, D., ROUX, J.-L. L., VIGOUREUX, M., AND BRUN-

GARD, D. Requirements for GMPLS-based multi-region and multi-layer networks

(MRN/MLN). IETF Internet Draft, Oct. 2005.

[46] SOBIESKI, J., LEHMAN, T., AND JABBARI, B. Dynamic Resource Allocation via GMPLS

Optical Networks (DRAGON). http://dragon.east.isi.edu/ .

http://www.parl.clemson.edu/pvfs/

http://www.pvfs.org/pvfs2/pvfs2-guide.html

http://www.pvfs.org/pvfs2/pvfs2-guide.html

http://www.pvfs.org/pvfs2/

http://www.parl.clemson.edu/pvfs/user-guide.html

http://dragon.east.isi.edu/

Bibliography 77

[47] Squid. http://www.squid-cache.org/ .

[48] SUN MICROSYSTEMS INC. NFS: Network File System protocol specification. IETF RFC

1094, Mar. 1989.

[49] SURFnet. http://www.surfnet.nl/info/en/home.jsp .

[50] TANENBAUM, A. S. Computer Networks, fourth ed. Prentice Hall PTR, Upper Saddle River,

New Jersey, 2002.

[51] Tcpdump public repository. http://www.tcpdump.org .

[52] Tcptrace – Official Homepage. http://jarok.cs.ohiou.edu/software/tcptrace/ .

[53] Tekram Systems Co., Ltd. http://www.tekram.com/ .

[54] TSI. http://www.phy.ornl.gov/tsi/ .

[55] UKLight. http://www.uklight.ac.uk/ .

[56] VEERARAGHAVAN, M., AND KAROL, M. Internetworking connectionless and connection-

oriented networks. IEEE Commun. Mag. (Dec. 1999), 130–138.

[57] VEERARAGHAVAN, M., ZHENG, X., LEE, H., GARDNER, M., AND FENG, W. CHEETAH:

Circuit-switched High-speed End-to-End Transport Architecture. In Proc. of Opticomm 2003

(Dallas, TX, Oct. 2003).

[58] WANG, H., VEERARAGHAVAN, M., KARRI, R., AND LI, T. Design of a high-performance

RSVP-TE signaling hardware accelerator. IEEE JSAC 23, 8 (Aug. 2005), 1588–1595.

[59] ZHU, X., ZHENG, X., VEERARAGHAVAN, M., LI, Z., SONG, Q., HABIB, I., AND RAO, N.

S. V. Implementation of a GMPLS-based network with end host initiated signaling. In IEEE

ICC2006 (June 2006).

http://www.squid-cache.org/

http://www.surfnet.nl/info/en/home.jsp

http://www.tcpdump.org

http://jarok.cs.ohiou.edu/software/tcptrace/

http://www.tekram.com/

http://www.phy.ornl.gov/tsi/

http://www.uklight.ac.uk/

Thesis - Electrical and Computer Engineering - University of Virginia

Documents

Transcript of Thesis - Electrical and Computer Engineering - University of Virginia