Cloud-Scale Data Center Network Architecture€¦ · Cloud-scale data center network architecture...

Cloud-Scale Data Center Network Architecture

Cheng-Chun TuAdvisor: Tzi-cker Chiueh

September 10, 2011

Abstract

Cloud-scale data center network imposed unique requirements that are dif-ferent from the tradition network architecture, which is based on a combinationof Layer 2 Ethernet switches and Layer 3 routers. The state-of-the-art showsthat the Layer 3 and Layer 2 model today brings significant configuration over-head and fails to meet some critical requirements of virtualized data center.As Ethernet offers high performance-to-cost ratio and ease of configuration, weargue that it is desirable to build the cloud-scale data center network replyingonly on Ethernet technology.

ITRI (Industrial Technology Research Institute, Taiwan) container computeris a modular computer designed to be a building block for constructing cloud-scale data centers. Rather than using a traditional data center network archi-tecture, ITRI container computer’s internal interconnection fabric, called Pere-grine, is specially architected to meet the scalability, fast fail-over and multi-tenancy requirements of these data centers. Peregrine is an all-Layer 2 networkthat is designed to support up to one million Layer 2 end points, provide quickrecovery from any single network link/device failure, and incorporate dynamicload-balancing routing to make the best use of all physical network links. Inaddition, Peregrine features a unique private IP address reuse mechanism thatallows virtual machines assigned the same IP address to run on it simultane-ously without interfering with on another. Finally, the Peregrine architectureis implementable using only off-the-shelf commodity Ethernet switches. Thisreport describes the design and implementation of a fully operational Peregrineprototype, which is built on a folded Clos physical network topology, and theresults and analysis of a performance evaluation study based on measurementstaken on prototype.

Contents

1 Introduction 31.1 Characteristics of Cloud-Scale Data Centers . . . . . . . . . . . . 3

1.1.1 Scale-out model . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.3 Multi-tenancy . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Requirements on Cloud-Scale Data Center Networks . . . . . . . 51.2.1 Any-to-any connectivity with non-blocking fabric . . . . . 61.2.2 Virtual machine mobility . . . . . . . . . . . . . . . . . . 61.2.3 Fast fail-over . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.4 Support for Multi-tenancy . . . . . . . . . . . . . . . . . . 71.2.5 Load balancing routing . . . . . . . . . . . . . . . . . . . 7

2 Current Data Center Network Architecture 92.1 Hybrid design: Layer 2 plus Layer 3 . . . . . . . . . . . . . . . . 92.2 Limitations of standard Ethernet . . . . . . . . . . . . . . . . . . 11

2.2.1 Revisiting the classic Ethernet . . . . . . . . . . . . . . . 112.2.2 Scalability issues of Ethernet . . . . . . . . . . . . . . . . 11

2.3 Mapping the L2 + L3 design to cloud-scale requirements . . . . . 14

3 All Layer 2 Network 173.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Standards and Industrial Solutions . . . . . . . . . . . . . . . . . 18

3.2.1 Link Aggregation Protocols . . . . . . . . . . . . . . . . . 183.2.2 ECMP: Equal-Cost Multi-Path . . . . . . . . . . . . . . . 193.2.3 TRILL and RBridge . . . . . . . . . . . . . . . . . . . . . 213.2.4 802.1aq: Shortest Path Bridging . . . . . . . . . . . . . . 213.2.5 Cisco FabricPath . . . . . . . . . . . . . . . . . . . . . . . 223.2.6 Brocade VCS . . . . . . . . . . . . . . . . . . . . . . . . . 233.2.7 Juniper QFabric . . . . . . . . . . . . . . . . . . . . . . . 243.2.8 OpenFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Academic Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.1 PortLand . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.2 VL2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3.3 Monsoon . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1

4 Peregrine: An All-Layer-2 Container Computer Network Ar-chitecture 304.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.2 ITRI Container Computer . . . . . . . . . . . . . . . . . . . . . . 324.3 Two-Stage Dual-Mode Packet Forwarding . . . . . . . . . . . . . 344.4 Fast Fail-Over . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Load Balancing Routing . . . . . . . . . . . . . . . . . . . . . . . 37

5 Peregrine Implementation and Performance Evaluation 395.1 Prototype Implementation . . . . . . . . . . . . . . . . . . . . . . 395.2 Network Initialization . . . . . . . . . . . . . . . . . . . . . . . . 425.3 Effectiveness of Load Balancing Routing . . . . . . . . . . . . . . 435.4 Packet Forwarding Performance . . . . . . . . . . . . . . . . . . . 465.5 Fail-over Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2

Chapter 1

Introduction

Cloud-scale data center is a computer house where a large number of computersystems and associated facilities are gathered together. Cloud-scale data cen-ter provides power management, cooling, data communication infrastructure,management interfaces, and security. In recent years, clod-scale data centersare built to provide environments for variety of applications that handle thecore business and critical operational data for companies. Business applicationssuch as on-line financial transaction processing, multimedia content delivery,and computationally intensive workload are critical to business revenue. Com-panies rely heavily on data center because it provides centralized control andmanagement for business applications. As such, data center is a key componentthat need to be carefully designed to meet the growing performance require-ments.

1.1 Characteristics of Cloud-Scale Data Centers

1.1.1 Scale-out model

Cloud-scale data centers require high-performance network interconnect at thescale of tens of thousands of servers. To connect such a large number of hosts,traditional data centers form a tree-like physical topology with progressivelymore expensive and specialized high-end devices when moving up the networkhierarchy. The model is called scale-up design, which suffers from limited scala-bility. For example, communication between hundreds of racks requires high ca-pacity backplane, which is usually deployed using highest-end IP switches/routeswith large backplane capacity. The existence of few and critical components inthe cloud-scale data center requires extra high reliability, periodically upgraded,and great maintenance efforts. Over the years, the scale-out model has beenreplaced by scale-up model, which aggregates a large number of low-cost andcommodity devices to achieve the equal functionalities provided by the few spe-cific expensive components. The optimization no longer implemented in the

3

few specific components but overall as a system-wide design. The use of com-moditized devices brings the flexibility that the deployment can be easy to scaleto a large number or to shrink to individual needs. Due to the large scale indata center, the difference of cost between using the commodity devices versusnon-commodity, high-end devices can amount to billions of dollars for the cloudservice providers [2, 26].

1.1.2 Virtualization

Virtualization has proven it a good solution to provide low cost host in the cloud-scale data center. Machines using virtualization software can easily consolidatemultiple applications/OSes from variety of vendors into a single physical server.While this trend seems to be on the server side only, surprisingly it has directlyinfluence on the underlying network. The following lists the characteristics ofcloud-scale data center network that comes from the virtualization.

1. Unpredictable bandwidth requirement: With typical virtualized serverusually having more than four network interface cards, the density ofvirtual machines per physical machine becomes higher. The adoption ofvirtual machine migration software, i.e., VMware vMotion, to dynamicallymove servers on demand across the network greatly increase the volume ofnetwork traffic. Moreover, with hypervisors deployed in the data center,the software such as VMware usually requires a broader layer 2 domain,which is incompatible with traditional data center network design. Theconcepts such as dynamically provisioning and virtual machine migra-tion make it very difficult to determine where the traffic is coming fromand where the traffic is going to. With the increased number of virtualmachines running and moving between different physical servers, the ap-plication traffic becomes unpredictable and congestion becomes commoncases.

2. Security policy enforcement: In the cloud-scale, virtualized data center,operators need to maintain policy changes and manage configurations overthe devices. The conventional design applies physical separation to protector isolate different data center elements. For example, the multi-tier model- web-server, application, and database - are often separated physically atdifferent locations. Today, all the tiers may be run on one single physicalmachine. As a result, virtualization breaks down the mutli-tier securitymodel. Moreover, while machine migrates from one server to anotherbecomes easy, often there are dependent devices need to be configured aswell, such as firewall and intrusion prevention systems.

3. Management overhead: The boundary between network, storage, secu-rity teams becomes blurring. For example, when moving virtual machinesfrom one to another, application bandwidth requirement and security poli-cies needs to be properly reconfigured at multiple devices such as routers,

4

switches, and load balancers. Moreover, the hypervisor comes with a soft-ware implemented switch or virtual bridge, which runs at the server side,hidden from the network management. Management of these elements be-comes an issues and network management tools today are only beginningto support the concept of virtualization.

1.1.3 Multi-tenancy

One important characteristic of cloud computing is that most information tech-nology users do not need to own there hardware and software infrastructure.They either pay for their IT infrastructure usage on demand or get it for freefrom the cloud service providers. In this new ecology, this requires the cloudservice providers, who own the cloud-scale data center, to consolidate the phys-ical data center infrastructure (PDC) and create multiple illusions of resourcesassigned for each individual tenant dynamically. The basic resource abstractionfor a tenant is a virtual data center (VDC), which consists of

• One or multiple virtual machines, each equipped with a virtual CPU, spe-cific amount of physical memory, and a virtual disk of a specific capacity.

• A guarantee minimum network bandwidth to the Internet.

• A set of management policies and configurations, including firewall, net-work configuration, disk backup policy, etc.

• The binary images for the OS and applications to be deployed on thevirtual machines.

Tenants either fully rely on VDC as their IT technology or partially integratetheir on-premise data centers into the cloud VDC. The latter requires a seamlessintegration of the remote resources in the cloud and local resources hosted bythe tenant.

By identifying the characteristics and challenges to the cloud-scale data cen-ter network, in the next section, we list the general requirements for data centernetwork architecture and map these requirements to various solutions presentedin chapter 2.

1.2 Requirements on Cloud-Scale Data CenterNetworks

Cloud-scale data center network architecture needs to support wide range ofapplication types, each with different requirements. Large number of servers,applications, increasing complexity of management, high availability and relia-bility requirements are all key issues data center framework needs to optimize.Before embarking on the design of the network architecture, we carefully re-viewed the related research literature and studied possible use cases and cameup with the following requirements:

5

1.2.1 Any-to-any connectivity with non-blocking fabric

Servers are typically equipped with multiple NICs to provide high aggrega-tion throughput for applications running on VMs. With the unprecedentedbandwidth requirement in today’s application, the underlying physical topol-ogy should guarantee there is rich connectivity between each PM’s NICs andthere should be no or low oversubscription, which refers to the ratio of the al-located bandwidth per user to the guaranteed bandwidth per user. In otherwords, 1:1 oversubscription means any arbitrary host in the data center shouldbe able to communicate with any other host at the full bandwidth capacity ofits network interface. Rich connectivity also means having multiple candidatepaths between two end points for various routing protocols to pick the optimalpath. This implies less chance of application performance degradation causedby network congestion and or device failures. The ultimate goal is to create alogically single switch, with non-blocking internal fabric and is able to scale upto support the requirement of cloud-scale data center: 100,000 or more portsand approaching one million virtual machines.

1.2.2 Virtual machine mobility

Virtual machine (VM) migration is one of the strategies to dynamically reassignVMs to PMs based on the cloud resource provisioning system. To maximize effi-ciency, virtual machine should be able to migrate transparently to any physicalmachine in the data center for the load balancing or energy saving purposes.More specifically, transparent migration means the IP address of the VM andthe security policies remains consistent without affecting any service running onthe VM after migration. The cloud-scale network architecture should create theillusion of a single layer 2 network as an unrestricted VM migration domain.

1.2.3 Fast fail-over

With the trend of using commodity [2] hardware, failures in the cloud-scale datacenter will be common. In the conventional design, detection of failures andrecovery from it depends on layer 2 or layer 3 routing protocols, e.g., IS-IS andOSPF, which are broadcast-based and require seconds to recovery. For example,spanning tree protocol (STP) periodically probes the network states and takestens of seconds to recalculate the tree when links or switches fail. However,variety of network and data intensive applications hosted in the data center, e.g.,on-line financial transaction processing, execute in milliseconds and definitelycan not tolerate a network meltdown of several seconds. The requirement oflow fail-over latency includes detecting the failure events, responding to themanagement server, possibly some root cause analysis, and finally taking therecovery actions. We expect the cloud-scale data center network to providefast recovery mechanism at the scale of milliseconds to minimize the impact onapplication performance.

6

1.2.4 Support for Multi-tenancy

Cloud service providers offer their resources to customers (customers) in an on-demand fashion. Customers using cloud resources to dynamically extend theirexisting IT infrastructure. Multi-tenancy support for cloud-scale data centermeans that the cloud infrastructure should be able to share with different cus-tomers (tenants) upon a large physical resources. To be more specific, thephysical resources should be able to be partitioned into multiple logical andindependent network resources and is able to be dynamically allocate to eachcustomers. Moreover, the cloud service provider should offer flexible and seam-less deployment approach for customers to integrate or extend their on-premiseIT infrastructure into the cloud. The most straightforward way for integrating acloud-based virtual data center with an on-premise physical data center (PDC)into a seamless data center is to ensure they share the same IP address space,i.e., IP addresses are allocated and reclaimed by a single entity, and connectthem with a pair of VPN gateways that are properly configured.

1.2.5 Load balancing routing

Designing the effective routing strategies for cloud-scale data center depends onthe understanding of traffic patterns. [12] collects socket-level logs from 1500servers and analyzed the one month data set. They conclude two traffic pat-terns named Work-Seeks-Bandwidth and Scatter-Gather patterns from serverssupporting map reduce style jobs as well as distributed replicated block storagelayer for persistent storage. Another study from [7] indicates that the variabil-ity in data center traffic is not amendable to concise summarization and henceengineering routes for just a few giant flows from traffic metrics is unlikely towork well. While it is still too early to claim whether there is predictable trafficpattern in data center network traffic, there are a few important metrics thatthe routing protocols should consider:

1. Topology aware: Cloud-scale data center usually deploy a mesh-like topol-ogy with multiple redundant links interconnecting between ToR, aggrega-tion, and core layers. The routing algorithm should be able to understandthe topology and efficiently use the rich connectivity.

2. Multipathing: the routing protocol should be able to establish multiplepaths between two end hosts to increase the aggregated bandwidth andavoid link congestion at particular hot-spot.

3. Many-to-one/many-to-many traffic pattern: With the massive deploymentof MapReduce-like applications in the cloud, it’s better that the rout-ing protocols can be optimized to this special many-to-one/many-to-manytraffic pattern.

4. Traffic engineering: Since every workload in the data center is controlledby the cloud service provider, it is practical for the cloud IT staff tomeasure and analyze the traffic characteristics. By classifying different

7

requirements, e.g., low-latency or high throughput, the routing protocolengineers the routes according to the individual need.

5. Resilience: the routing protocols should be resilient to any changes in thetopology and response in time to various events that affect the currentrouting decision.

8

Chapter 2

Current Data CenterNetwork Architecture

Given the above requirements, this section first presents the design of conven-tional data center network, which is a layer 2 plus layer 3 design. We next mapthe requirements to the conventional design and discuss the issues and causesof those issues from the design.

2.1 Hybrid design: Layer 2 plus Layer 3

Ethernet becomes one of the most popular LAN technologies today in many en-vironments including enterprise networks, campus networks, and data centers.Even Internet service providers use Ethernet as their backbone network to carrytraffic between multiple sites. Because of Ethernet’s high performance-to-costratio and ease of configuration, almost every computer system today is equippedwith one or multiple Ethernet network interface cards and Ethernet becomes thedominant networking technology. To achieve the ease of configuration, or theplug-and-play property, one of the fundamental design decision of Ethernet isusing broadcast model for querying and locating a specific services. For example,Address Resolution Protocol (ARP) [21] uses broadcast to discover the target IPaddress to its MAC address mapping. The DHCP protocol depends on broad-cast to locate the DHCP server and then IP address configuration is assigned tothe client. Although the broadcast model brings many conveniences, it restrictsits size to only hundreds of hosts [17] and thus unscalable when deploying on alarge network.

To deal with Ethernet’s limited scalability, a large network today is com-posed of multiple Ethernet LANs interconnected by IP routing, the so-calledLayer 2 plus Layer 3 (L2+L3) solution. In this design, the size of an EthernetLAN is usually restricted to a few hundreds of hosts and each LAN forms an IPsubnet. An IP subnet is a subset of network given by an IP prefix representingthe network identification. Each host in the subnet is assigned a host number.

9

Rack

AS

ToR

AR

CR CR

AR AR AR

AS

ToR

Rack

Layer 3 network

Layer 2 network

Internet

LB

Rack

ToR

…

Internet

ToR: Top-of-Rack Switch AS: Aggr L2 Switch LB: Load Balancer AR: Access L3 Router CR: Core L3 router …

…

Figure 2.1: Layer 2 plus Layer 3 hybrid network architecture design for datacenters.

The host number with IP prefix is the hosts IP address. The router typicallycontains many interfaces and each interface associates with an IP subnet. Theinformation for how to determine the outing interface from IP prefix is main-tained in the routing table, a data structure that maps IP prefix to the outgoinginterfaces.

Figure 2.1 shows the convention L2 + L3 network architecture for largecampus or data center network. The network is a hierarchy from the core routerat top to the ToR (Top of Rack) switches and servers. Usually there are 20 to40 servers per rack, and each server is equipped with multiple NICs connectedto ToR switches. ToR switches connect to the aggregation switches (AS) wherethe server-to-server traffic cross the rack flows through. Firewall and serverload balancing techniques are applied at this layer to optimize the network andsecure applications. The left bottom side of the figure forms a single layer 2domain. In order to scale to large number of nodes, another layer of network,layer 3 routing, is deployed to interconnect multiple Layer 2 domains. Theaccess routers (AR) connects to access switches (AS) downstream and connectsto core routers (CR) for traffic coming from and going to the Internet.

10

2.2 Limitations of standard Ethernet

2.2.1 Revisiting the classic Ethernet

Ethernet, standardized as IEEE 802.3, is a family of frame-based network tech-nologies for local area networks (LAN). A LAN consists of multiple Ethernetbridges and hosts. Each hosts in Ethernet is assigned an unique 48-bit MAC(Media Access Control) address. Ethernet bridge connects multiple hosts andbridges to form a multi-hop network and maintains a data structure calledforwarding table, a map that maintains the destination MAC address to theoutgoing port on the bridge. When a frame is presenting on a particular port,the switch automatically associates the port to the source MAC address in theframe, a process named source port learning. The bridge then forwards packetsby looking up the forwarding table using packets destination MAC address anddecides the outgoing port. If the destination MAC address is not present in thetable, the bridge broadcasts the packets to all ports except the receiving port,resulting in a domain-wide flood.

Another cause of flooding is by sending the broadcast frame using broadcastMAC address, i.e. ff:ff:ff:ff:ff:ff. Ethernet bridges, if not properly connected,suffers from the broadcast storm caused by loops in the physical topology. UnlikeIP packets, Ethernet frame does not carry a TTL (Time To Live) field in theframe. When a broadcast frame enters a Ethernet with loop topology, the framewill be repeatably replicated and forwarded to other bridges. This generatesinfinite number of frames in the current LAN and blocks all other networktraffic, resulting in a network meltdown.

The IEEE 802.1D STP (Spanning Tree Protocol) aims to solve the loopproblem. Given an arbitrary network topology, the bridges at multiple layersrunning STP will coordinate themselves and form an unique tree. STP firstauto elects a root bridge as the root node of the tree and collectively computesthe spanning tree by calculating the distance to the root bridge. The STPconverges when all the links in the tree are either in the forwarding state orin the blocking state. The Ethernet bridges then are only allowed to forwardframes to the ports in forwarding state. Coupling with the broadcast-baseddelivery model, this design gives Ethernet one of the most promising features:plug-and-play simplicity. When you connect Ethernet switch and hosts together,Ethernet is able to discover the topology automatically and learn about hostaddress and location on the network with little configuration.

2.2.2 Scalability issues of Ethernet

This enchanting plug-and-play property of Ethernet has proven to be successfulin the past decades. Unfortunately, it does not come without a cost. We willdiscuss the the fundamental design model of Ethernet and why it does not meetthe requirements for cloud-scale data center.

11

SW1 SW2

SW3 SW4

STP-enabled

blocked blocked

SW1

SW4 SW2 SW3

Equivalent to a single tree

SW1 SW2

SW3 SW4

loop

Figure 2.2: Classic STP Loop prevention in Ethernet.

Limited forwarding table size

When frames enter, Ethernet switch determines the outgoing interface by look-ing up the forwarding table. The forwarding table in switch contains one entryper destination MAC address. Each entry is associated with an aging time,and once expires, the entry will become invalid and can be reuse by new entry.However, in the case of heavy loading and large and diverse number of hostscommunicating to each other, the forwarding table might be full and can nothold any new entry. In this case, the incoming frame without destination MACaddress present in the table will be flooded to all ports and cause traffic storm.So why not increase the size of forwarding table? The reason is that forwardingtable size does not grow with the size of network is due to its cost. Traditionallythe table is stored in Content Addressable Memory (CAM), a specialized hard-ware device that sets the storage contents as a key for retrieving data associatedwith those contents. CAM provides very fast table lookup but is expensive (4-5times as much as a conventional RAM of the same size) and has limited storagecapacity.

As network grows in data center, the number of distinct MAC addressestraversing a switch explodes. In a data center, servers equipped with more thanfour Ethernet cards are prevalent and moreover, with servers running hypervi-sor, each virtual machine associates with a globally unique MAC address. Evenworse, the adoption of virtual machine migration causes the network admin-istrator unable to provision the number of entries a particular switch shouldmaintain in its table. As a result, the number of MAC addresses can growrapidly in a short time at unpredictable locations, degrading the overall net-work performance.

STP as a solution to loop prevention

As discussed before, Ethernet suffers from broadcast storm caused by loops andSTP is adopted to solve this problem. STP detects and breaks the loop byblocking the redundant links, resulting in an underutilized network. Figure 2.2shows an example. From the left, the topology contains loops inside and withoutSTP, a broadcast packets can cause disastrous infinite looping of frames. WithSTP enabled, the protocol initially auto elects a root bridge, in this case, the

12

SW1, and discovers all the redundant links needs to be blocked in the topology.The gray dash line in the figure represents a link with blocking state, meaningno data packet can pass through it. The resulting topology is a single tree (rightmost figure) without looping problem but suffers from the following drawbacks:

1. Single tree for all traffic

2. Single path for unicast and multicast

3. 50% of bandwidth unused

Slow fail-over

When links fail in a STP-enabled network, the spanning tree needs to be rebuilt.The network traffic during this un-converged period of time will be discarded,forcing network services to be paused. Broadcast storms might happen duringthis time and cause the network to be shutdown. The convergence time ofSTP ranges from few seconds to several minutes, depending on the size of thenetwork and protocols. In a data center environment, this can cause a seriousproblem. Although Rapid Spanning Tree Protocol (RSTP) provides a shorterconvergence time by only recalculating the subset of links in the tree, in a mesh-like network, the RSTP still blocks a large portion of links. The paper [14] builta simulator for RSTP and evaluated the behavior of it under a mesh topology.They showed that RSTP takes multiple seconds to converge on a new spanningtree and concluded that RSTP is not scalable in a large data center.

Broadcast overhead

Protocols based on broadcast introduce overhead to the data center network.Ethernet uses broadcast as control messaging for higher-layer protocols suchas DHCP and ARP (Address Resolution Protocol) [21]. In Ethernet, to findthe destination MAC address of the receiving end, the source end first sendsARP broadcast, querying the MAC address of the destination IP address. Thedestination node will reply to the source node with the MAC address of itsreceiving interface card. [14] shows that ARP traffic presents a significant burdenon a large network. Although each host caches the IP to MAC address mapping,based on their results, a million hosts in a data center will create 239 Mbps ofAPR traffic to arrive at each host at peak, which might cause congestion atthe bottleneck links and consume frame processing time. DHCP is anotherbroadcast-based protocol used ubiquitously to assign IP addresses in a LANdynamically. When a host boots up, it broadcasts DHCP discovery frames toall the node in the same domain. The DHCP server responses with available IPaddress for the requested node. Along with ARP, every broadcast frames mustbe processed by every end host and the broadcast frame processing can not beoffloaded to the network interface card.

13

2.3 Mapping the L2 + L3 design to cloud-scalerequirements

While the traditional L2 + L3 design has been adopted to Internet for decades,when applying to the cloud-scale data center, it manifests some limitations thatfail to meet the requirements of cloud scale data center network. In this section,we try to simulate using the traditional L2 + L3 design to build the cloud-scaledata center and see what’s the burdens encountered.

1. Any-to-any connectivity with non-blocking fabric: Cloud-scale fab-ric requires connecting thousands of hosts as well as meets the oversub-scription of 1:1. Typically, the traditional L2 + L3 design does not forma mesh-like network to meet the non-blocking property because (1) if de-ploying the fabric by only using the Layer 2 switch, STP simply blocksthe redundant links and only a single tree is used for forwarding and (2)if deploying by using the Layer 3 routers, it imposes great configurationoverhead and is more expensive than using commodity Layer 2 switches.As a result, the traditional hierarchical design is not able to meet the lowoversubscription requirement between each layers.

For example, although server-to-server communication in the same rackhave oversubscription of 1:1, traffic across the rack usually has higheroversubscription ratio. When traffic moves up through the layers of hi-erarchy, the oversubscription ratio increases rapidly. Typically up-linksfrom servers to ToRs are typically 1:5 to 1:20 oversubscribed. For exam-ple, 40 1G NICs connect to ToR switch with only 4 1G up links. Pathsthat routes through the core layer might suffers from oversubscription of1:80 to 1:240 [7]. The high oversubscription ratio constrains the workloadplacement by preventing idle server from being assigned and thus greatlydegrades the performance of data intensive applications.

2. Virtual machine mobility: VM migration needs to be transparent fromapplication, which means the IP address must remains the same. To bemore specific, this requires that the migration can only happen withinthe same layer 2 domain, because migrating to another layer 2 networkrequires reconfiguring the IP and subnet mask that resides in the targetlayer 2 network. The L2+L3 design, which connects multiple small layer2 domains with layer 3 routing, greatly restricts the mobility of VM tomigrate only in its current layer 2 domain. Some techniques such as us-ing VLAN (Virtual Local Area Network) to extend virtually to anotherphysically location, or tunneling techniques such as IP-in-IP, if configuredproperly, can be used to increase the VM mobility. However, this usuallyrequires error-prone manually labor and the result is a high turnaroundtime. Moreover, misconfiguration of VLANs can cause serious networkmeltdown.

3. Fast fail-over: Conventional Ethernet relies on the Spanning Tree Pro-tocol to guarantee loop-free packet forwarding. In a network that oper-

14

ates normally, if a link or switch goes down, STP automatically picks thebackup path from the redundant links and the network re-converges. Aftera few seconds, the network operates normally again. One problem of usingSTP as fail-over mechanism is that the re-convergence time for STP is toolong, i.g., several seconds. Mission critical application such as financialtransection that executes in milliseconds can not wait several seconds fora network disruption. Needless to say, the Layer 3 routing protocols suchas link state routing impose higher fail-over latency.

4. Support for Multi-tenancy: Multi-tenancy support needs the cloud-scale data center to provide a group of logically isolation virtual resourcesuch as a VDC (Virtual Data Center) for each tenant. Typically, underthe L2 + L3 architecture, VLAN is one of the options to partition anEthernet network and create multiple virtual layer 2 domains for eachtenant. Unlike physical LAN, VLAN cab be define logically and allows forend hosts to be grouped together even if there are physically located atseparate switches. The group shares the same broadcast domain and hasthe same attributes as a physical LAN. This is achieved by assigning eachgroup a VLAN ID, and frames coming from the group are tagged with theID when entering the VLAN and untagged when leaving. Although VLANoffers flexibility to create a logical LAN, it comes with some limitations.First, subnets for each VLAN and IP address range should be provisionedand managed. Second, layer 3 routers connecting each VLANs need tobe properly configured. Inter-VLAN traffic needs to be routed throughgateway. Moreover, there are limited number of VLAN IDs, i.e., 4096,can be used.

The last and most important limitation is that there is no way, or need sig-nificant efforts, to provide each tenant their own 24-bit IP address spacebecause routing packet with the same destination IP address but des-tined to different hosts on the same physical network is almost impossible.Figure 2.3 explains the difficulties using IP-in-IP tunneling. The cloudservice provider is hosting two customers, A and B. For each customer Aand B, the cloud service providers offers each of them the 24-bit private IPaddress space, e.g., 10.0.0.0. Accidentally both clients assign IP2 as theIP address of their VM and the VMs are located at the routing domainof the router R3. When IP1 from customer B and IP3 from customer Aboth try to communicate with their VM with address IP2, the next hoprouter R2 encapsulates the IP header with another outer IP header, whichdestines to router R3. Inside the data center fabric, the packets are routedby the outer IP header. As soon as packets arrive at the R3, the R3 de-capsulates the IP-in-IP packets and forwards to the outgoing interface setby its routing table.

Although IP-in-IP tunneling separates the routing domain of cloud serviceprovider and customers, the problem still exists at the edge router (R3)when destination IP addresses to different customers are the same, in thiscase, IP2. This implies that R3 must apply some non-standard techniques,

15

R1

R2 R4

R3

IP3

IP2

IP2 IP1

Data center fabric

IP6

Customer A, VDC1

Customer B, VDC2

Router vlanB

vlanA

IP1

IP3

ip10

IP-in-IP tunneling

ip20

IP10 IP20 IP2 IP3 Payload

IP-in-IP tunneling IP10 IP20 IP2 IP1 Payload

IP2 IP1 Payload

IP2 IP3 Payload

decap

Figure 2.3: Multi-tenancy problem caused by IP address space reuse.

e.g., VRF (Virtual Routing and Forwarding), in order to route to two hostsbelonging to different VDCs.

5. Load balancing routing: Although IP routing protocol can somehowconsider the load of each path and do some balancing, at layer 2 domain,the spanning tree protocol (STP) creates a single tree for packet forward-ing and all the redundant links are blocked. This is, in fact, a wasteof available bandwidth and increases the possibility of unbalanced linkutilization. Although configuring per-VLAN spanning tree (PVST) canimprove load balance and overall throughput [22], how to effectively useit and dynamically configure require periodically probing the network andreassigning the spanning tree root.

In this chapter, we revisit the classic Ethernet’s broadcast-based servicemodel and the spanning tree protocol. We argue that standard Ethernet doesnot scale to a large number of hosts because of this outdated model. We evaluatethe L2 + L3 design to see whether it meets the cloud-scale data center networkrequirements. In the next chapter we present several academic and industrialsolutions, which are aimed to build a single large layer 2 network fabric bysolving the scalability issues of Ethernet mentioned above.

16

Chapter 3

All Layer 2 Network

We define an all layer 2 network to be a scalable network architecture that carriestraffic based on Ethernet technologies. The network employs only commodityEthernet switches as dump packet forwarding engines. The forwarding decisionsare made by Ethernet header and its corresponding entry in the forwardingtable. There is only a single subnet. The routers in this network should be thegateway connecting to the WAN.

3.1 Design Issues

Before jumping into various proposed solutions, we first list the common designissues the network architects needs to address when building a single, large-scaleEthernet network.

1. Physical network topology: Since the traditional tree topology imposeshigh oversubscription, the solution requires a physical topology designthat is non-blocking, easy to manage, and extendable to large number ofnodes.

2. Addressing: The solution should solve the limited forwarding table prob-lem by designing some ways to assign MAC addressing in order to deliverthe frame as well as minimizing the forwarding table usage. The ad-dressing techniques should be supported by commodity switches or withminimum modifications.

3. Routing: if STP is disabled as a solution to make efficient use of all links,it should design its own routing techniques in the all layer 2 network. Thisinclude topology discovery, loop prevention, and load balancing.

4. Fail-over: The design should include failure detection, either centrallyor distributively disseminates the failure information, and triggers therecovery mechanism to quickly bring the network to normal operationmode.

17

SW1 SW2

Stacking or Trunking

SW3 SW4

Logically a single switch, states are synchronized

SW1

SW3 SW4

STP views as a non-loop topology

Logically a single link to the upper layer

Figure 3.1: SMLT: Split Multi-Link Trunking (left) and STP’s view (right).

3.2 Standards and Industrial Solutions

3.2.1 Link Aggregation Protocols

LACP: Link Aggregation Control ProtocolThe IEEE 802.3ad, or called link aggregation, MLT (Multi-Link Trunking),or NIC bonding, is an industrial standard to combine multiple network links inparallel into a logically single connection. The benefits are increased aggregationthroughput and redundancy in case of link fails. By leveraging the LACP, theswitches learn the identity of its neighboring switches capable of supportingLACP and the capability of each port. It then groups similarly configured portsinto a single local link. Packets destined to the logical link (or called trunk) aredistributed to one of the link in the group based on certain algorithm. If one ofthe links in the group fails, traffic previously carried over that failed link movesto the remaining links within the same group.

LACP is commonly deployed in the cloud data center or enterprise network.For example, a fat-tree topology has higher bandwidth demand when movingup in the hierarchy. A ToR switch can group four of its 10GE ports to another4 ports on its aggregation switch, creating a 40 Gb uplink bandwidth in a singlespanning tree (STP) domain. Without LACP, grouping multiple links betweentwo switches results in redundant links being blocked by STP. LACP solves thislimitation because from STP’s point of view, once the links are aggregated intoa logical one, the STP treats the aggregated group of links as a single entity,avoiding the possibility of redundant links being blocked.

SMLT: Split Multi-Link TrunkingSMLT is an enhancement to the LACP, which tries to remove the limitationthat all the physical links in the group must be located on the same switches.For LACP, all the physical links aggregated in the group can only connect toa pair of switches, which increases cabling complexity and limits the flexibilityof network design. SMLT combines the bandwidth of multiple Ethernet portswhile split links into multiple switches.

This offers not only larger aggregated bandwidth but also provide a higher

18

An

R1 R2 R3 R4

R5 R6 R7 R8

A1

Layer 3 ECMP routing

… Bn B1 … Cn C1 … Dn D1 …

Flow hashing to 4 uplinks

4

3

2

1

Figure 3.2: 4-way Equal-Cost Multi-Path setup.

reliability network design. Because once one of the neighboring switches fails,the rest of the switches can still forward the traffic for the group. Figure 3.1shows an example topology and its STP view. The left side of the figure showsthat SW1 and SW2 first create a single logical switches by using stacking pro-tocols or trunking, depending on the switch vendors. States such as forwardingtable, port status between SW1 and SW2 are synchronized. The SW3 and SW4each splits two of their ports to connect to SW1 and SW2, forming a multi-linktrunking topology. Although physically contains loop, there is no link beingblocked. The right side of the figure shows the STP’s view of this topology.Because spanning tree protocol is transparent to the redundant links in thetrunking port, no link is blocked.

3.2.2 ECMP: Equal-Cost Multi-Path

Equal-cost multi-path routing (ECMP) is a load balancing routing protocolbased on RFC 2991 [24] that optimizes flows over multiple best paths to asingle destination. ECMP applies load balancing routing on flows such as TCPor UDP, potentially provides increases in bandwidth between two endpoints byspreading the traffic over multiple paths. Figure 3.2 shows a four-way ECMPdesign. Let’s assume a flow is established from A1 to B1 and a packet from A1enters the next hop (SW5), the default hashing algorithm setting is to hash flowsbased on Layer 3 source-destination IP addresses, and Layer 4 port numbersand determines the outgoing interface from one of the four uplinks, assumingits SW2. The upper layer SW2 then deterministically forwards the packet toSW6 and delivers to B1. The resulting effect is that the traffic volume betweenSW5 to SW1-SW4 is randomly distributed, based on the each TCP or UDPflows.

However, ECMP has some limitations. if certain elephant flows are present,or multiple flows are accidentally hashed to the same outgoing interface, conges-tion and unbalanced link utilization can still happen [3]. Although per-packetload balancing to multiple paths gives the best bandwidth utilization, usually itis deprecated because of the following reasons: First, per-packet load balancing

19

Architecture SMLT ECMP

Protocol layer Layer 2 forwarding Layer 3 IP routingLoad balancegranularity

granular, per-host finer, per-flows, per-packet

Path diversityexploration

grouping ports IGP, BGP routing protocols

Table 3.1: A comparison of Trunking and ECMP.

Figure 3.3: Forwarding Paradigm of TRILL (From NIL Data Communications).

is likely to increase the number of out-of-order packets, resulting in the increas-ing workload on the receiving hosts. For example, TCP treats out-of-orderdelivery as an indication of network congestion. In this case, TCP decreasesits window size and the throughput of TCP connections becomes lower. Sec-ond, per-packet quires process switching on other than the highest-end routers,and process switching is 8 to 12 times more router processor intensive than fastswitching. Finally, randomly spread the packet into multiple links increases thedifficulties when IT stuff tries to debug the network.

Table 3.1 shows a comparison of SMLT and ECMP. There are also somecommon limitations of both SMLT and ECMP as below:

1. Upstream hash collision: usually there are limited number of ports or pathin a group. By applying hash algorithm on either per-host or per-flow,there is high possibility that multiple big or medium size flows are assignedto the same links and cause link congestion.

2. Downstream collision: Both ECMP and SMLT balance the traffic at theedge device upstream to the upper layer devices. Two flows coming fromdifferent hosts might end up using the same downstream link and conges-tion can happen.

20

3.2.3 TRILL and RBridge

TRILL (Transparent Interconnect of Lots of Links) [25], invented by RadiaPerlman to remove certain deficiencies of bridged Ethernet, is an IETF protocolimplemented by devices called routing bridges (RBridge) [20]. TRILL solutionis not intended to solve the scalability problem in modern data center, but toavoid the disadvantages of standard bridges. As mentioned before, using IProuting does not have problems such as possible looping in topology and STPinefficiency. TRILL is aimed to solve these problems by adding the routingcapability to the layer 2 bridge.

TRILL design makes no assumption about physical network topology. TRILLswitch can coexist with the standard Ethernet switch, making itself as an incre-mental feature. As for addressing and routing in TRILL, Rbridges in the datacenter runs the link-state routing protocol such as IS-IS. All Rbridges have thetopology information and are able to create shortest path route to each other.The Rbridges learns the MAC address of the end hosts by inspecting the packetsoriginating on that link. This information is distributed to all other Rbridges sothat other Rbridges knows the appropriate destination Rbridge. This approachis not scalable when having a large number of end hosts. When packets en-ter the Rbridge network, the ingress Rbridge determines the Rbridge nicknameassociating with the destination MAC address and encapsulates the outgoingpackets. Rbridge header includes hop count, the egress Rbridge nickname andthe ingress Rbridge nickname. Before packet reaches the destination Rbridge,at each intermediate Rbridge, the hop count is decremented. This preventstemporarily loop during the converging time.

Figure 3.3 shows an example of TRILL forwarding. The ingress RBridgereceives the users MAC frame and encapsulates in a TRILL header with ad-dresses of ingress RBridge A and egress RBridge C. The TRILL datagram getsa new MAC header, which lookups and modifes every time the packet is for-warded by an RBridge. The hop count field in the TRILL header, HopC, isdecreased and the TRILL header stays unchanged. To coexist with standardswitch, the Rbridge can use the standard layer 2 header with its own protocoltype. Then the Rbridge header is appended after the standard Ethernet headerand followed by the payload. When the egress Rbridge C receives this packet,the encapsulated Rbridge header is removed so that the Rbridge is transparentto the destination host S. In brief, the design can be thought as transparentlyconnecting the links but avoiding the disadvantages of bridges by using rout-ing. TRILL focuses on the issues of looping and STP inefficiency but doesnot directly solves cloud-scale data center network problems such as limitedforwarding table size and meets all requirements.

3.2.4 802.1aq: Shortest Path Bridging

Shortest Path Bridging (SPB) is an IEEE 802.1aq draft intended to serve as bothcarrier and enterprise solution. There are two ways for SPB multipath bridg-ing: Shortest Path Bridging Mac-in-Mac (SPBM) and Shortest Path Bridging

21

Figure 3.4: Forwarding Paradigm of Shortest Path Bridging, SPB (From NILData Communications).

VLAN (SPBV). The 802.1ad (Q-in-Q) forwarding is called SPBV and 802.1ah(Provider Backbone Bridging or MAC-in-MAC) is called SPBM forwarding. Forthe purpose of this report, we focus on SPBM.

SPB reuses the Provider Backbone (PBB) 802.1ah [4] Mac-in-Mac technique.The ingress switch takes the customer’s MAC frame and encapsulates it in802.1ah MAC frame. The 802.1ah frame header includes the service identifier(I-SID), which abstracts the service from the network by mapping single ormultiple VLANs to an I-SID. The SPB automatically constructs a shortestpath through the network to extend LAN connectivity. The frame is forwardedbased on the destination MAC address. Throughout the forwarding process,the frame remains unchanged. This alleviates the limited forwarding table sizeproblem since only the edge switches need to learn the user’s MAC addresses.

As for routing, SPB uses IS-IS as link state routing protocol to build thenetwork topology and selects shortest paths according to the link metrics. Thetraffic (unicast and multicast) is then assigned to that path. For the purpose ofload balancing, SPB computes 16 source trees that are shortest path. Figure 3.4shows an example of SPBM forwarding. The 802.1aq edge switch (switch U)takes the user’s MAC frames and encapsulates it in 802.1ah (MAC-in-MAC)frame. The destination MAC address in the 802.1ah header is the egress switch(switch C). Throughout the backbone forwarding process, the frame remainsuntouched and the destination MAC address is unchanged. When frame arrivesat the egress switch C, the frame is decapsulated and delivered to the host S.Table 3.2 shows the comparison of TRILL and SPB.

3.2.5 Cisco FabricPath

Cisco FabricPath [23] is an innovative Cisco NX-OS technology that combinesLayer 2 configuration and flexibility with Layer 3 convergence and scale. Theidea is to create a simple, scalable, and efficient layer 2 domains that is applicableto data centers. It increases server-to-server bandwidth with multiple active

22

Architecture TRILL SPB

Topology general generalAddressing TRILL header 802.1ah MAC-in-MACRouting Link-state, IS-IS Link-state, IS-ISLoad Balancing N x transit hash 16 x ECMP from source node

based treesLoop preven-tion

hop count and RPFC (ReversePath Forwarding Check)

RPFC

Fail-over la-tency

Depends on IS-IS Depends on IS-IS

Multicast orBroadcast

Single tree Source node based spanning tree

Compatibility new header requires additionalon every Rbridge, and new ASIC

Tradition Ethernet switchingwith 802.1ah capable hardware

Table 3.2: Solution strategies comparison of TRILL and SPB [10].

paths and create a non-blocking architecture to improve performance. Ciscoclaims that FabricPath is a superset of TRILL and will support TRILL whenit standardized.

About topology design, FabricPath makes no assumption about the physicaltopology. An example from the FabricPath [11, 23] suggests a two layer topol-ogy (aggregation/spine and access switch) with 16-way ECMP. When the twolayer switches are combined with 16-port 10G-bps PortChannels, FabricPath canprovide a potential data center fabric with bandwidth of 2.56 (Tbps) betweenswitches. As for routing and addressing, similar to TRILL, the control planeof FabricPath is built on top of the IS-IS (Intermediate System-to-IntermediateSystem) routing protocol, and the routing table is computed for multicast andunicast destinations. Frames in FabricPath are forwarded along the shortestpath, which reduce the overall network traffic load and increases efficiency. Theframes are always forwarded with known addresses, which means no floodingcompared to standard Ethernet. Moreover, the FabricPath frames include aTTL field similar to IP, which prevent the problem of loop in the bridge layer 2network. To address the issue of limited forwarding table size, MAC addressesin FabricPath are learn selectively only at edge, allowing to save MAC addresstable in the aggregation switches. The detection of links or switches failuresdepends on the IS-IS routing protocol. Between access and aggregation layer,FabricPath load balance the traffic using multi-way ECMP. ECMP is able touse the available links between any two devices and load balance the traffic toall the available links.

3.2.6 Brocade VCS

Brocade VCS protocols remove the need for STP and allow for all equal-costpaths to be active, which results in no single point of failure.

23

Figure 3.5: QFabric’s design: Node, Interconnect, and Director.

Brocade VCS enables organizations to preserve existing network designs andcabling and to achieve active-active server connections without using SpanningTree Protocol (STP).

the Brocade VDX design utilizes a single standard LAG, consisting of mul-tiple 10 GbE connections from

Logically a single switch ???????????????????????????????????? allow thetwo switches to look like a single switch to the core routers.

??????????? protocol, Transparent Interconnection of Lots of Links (TRILL),provides active multipath support, so that the rack server sees only a single ToRswitch,

3.2.7 Juniper QFabric

http://forums.juniper.net/t5/Architecting-the-Network/It-s-Here-Revealing-the-QFabric-Architecture/ba-p/77880

Juniper’s virtual layer 2 switching architecture, called QFabric [1], aimed todistribute the control and data plane to the edge. QFabric makes the networkitself behave like a single switch. Inside every switch is a mesh-like fabric that iscompletely flat and provides any-to-any connectivity between ports. The designof QFabric in Figure 3.5 has three basic components :

1. QF/Node: The QFabric Node provides access into and out of the fab-ric. The node devices are typically line cards that reside within a chassisswitch, forming a high density edge device.

2. QF/Interconnect: This is the backplane of QFabric. The design enablesany-to-any connectivity, where every device is a single hop away fromany other device. The initial release of QFabric architecture offers theinterconnection of up to 128 QF/node edge devices, creating a single fabriccapable of supporting 6,000 10GbE ports.

24

3. QF/Director: This component provides the control and management ser-vices for the fabric. The Director has an exclusive out-of-band controlplane network for carrying control traffic between QF/Node and QF/Interconnectdevices. Moreover, it allows the data center network to appear as a singleswitch, providing simplicity of management to network operators.

In conclusion, QFabric provides a flatten, non-blocking network fabric thatsupports high-speed server-to-server communication. It allows the data centernetwork to appear as a single switch, which dramatically reduces the cost ofmanaging multiple switches. QFabric further separates the control plane anddata plane to eliminate any single point of failure in the system. By addingports to an existing switches, the QFabric architecture can scale the data centernetwork with minimal additional management and operation overhead.

3.2.8 OpenFlow

OpenFlow [13] is a programmable protocol designed to manage and direct trafficamong OpenFlow switches from different vendors. Typically, the basic job ofa networking device (bridge or router) is to make forwarding/routing decision(control plane) and subsequently forward the data (data plane). The controlplane runs control protocols, e.g., STP, RSTP, routing protocols, MAC addresslearning. These control protocols program the forwarding information into dataplane, which can be a simple lookup table on TCAM (Tenery Content Ad-dressable Memory). Packets are then forwarded to the outgoing interface bylooking up the table. Typically, the control plane uses a communication proto-cols to program the forwarding information into data plane. Vendors today offervarious degree of programmability and proprietary protocols on their switchesand routers to program the forwarding information into data plane. As a re-sult, global and unified network resource management and traffic engineeringare limited by inconsistence of devices from multiple vendors. The goal Open-Flow’s goal is to provide an open and standard protocol to program the table inswitches and router from different vendors. OpenFlow consists of three parts:

1. Flow tables installed on switches

2. OpenFlow controller

3. Proprietary OpenFlow protocol for controller to create secure channelswith switches.

The flow table on each switch is controlled by OpenFlow controller via the securechannel. The controller imposes policies of flows into the flow table. Paths thatgo through the network could be optimized by their specific characteristics, suchas SLA, end-to-end latency, or throughput. When an OpenFlow switch receivesa frame, it first checks if there is matching entry in its flow table. If not, theswitch forwards the frame to the controller. The controller makes forwardingdecisions based on various fields in the frame, such as source/destination MAC,IP, or port numbers. This part of the logic can also be used as a firewall to blockor restrict certain network flows. Once the controller decides the forwarding

25

Figure 3.6: PortLand: Packet forwarding and Actual MAC (AMAC) to PseudoMAC (PMAC) mapping.

policy, it programs the information into the switch’s flow table via the securechannel.

In conclusion, with OpenFlow, the network operators can slice off a part ofnetwork devices (switches or routers) from the OpenFlow network and createvirtual network for researcher to develop new protocols. In addition, OpenFlowalso offers fine-grained flow-level forwarding control without restrictions of IProutes or STP.

3.3 Academic Solution

3.3.1 PortLand

The goal of PortLand [16] is to create a scalable, easily manageable, and fault-tolerant data center layer 2 network fabric. It proposed a three-tier hierarchicaltopology and a scalable layer 2 routing and forwarding protocol.

PortLand consider designing a data center over the tree-stage (core-aggregation-edge) fat tree topology. Edge and aggregation switches are grouped into podsand each pod connects to each core switch. PortLand overcomes the limitedforwarding table size problem by assigning the hierarchical pseudo MAC address(PMAC) to each host representing its location (present as pod.position.port.vmid).The design keeps PMAC transparent to host. The host remains unmodified andstill uses its original MAC address (AMAC). When communication starts, hostssending out ARP requests receive the PMAC of the destination host, in Fig-ure 3.6 step 1-3. The forwarding of the following packets is based on PMACof the destination host, which means the AMAC from hosts never consumeforwarding table entries except the edge switches. When edges switches as-sociated with the destination receives a frame, it needs to perform PMAC to

26

IP-in-IP encapsulation

VLB

IP-in-IP decapsulation

Figure 3.7: VL2: packet encapsulation, decapsulation, and VLB.

AMAC header rewriting so that the destination host is unaware of the existingof PMAC. This requires a centralized server, named fabric manager, to keeps allthe PMAC to IP address mappings and also requires edges switches to maintainPMAC to AMAC mappings.

When a packet enters the PortLand network fabric, since the location ofthe destination is encoded, switches can forward the packets based on PMAC.This design requires switches to discover their own locations using PortLandslocation discovery protocol (LDP), and the edge switch automatically learnsthe PMAC to AMAC mapping by observing the incoming packets. PortLandguarantees loop-free forwarding by preventing the switches to forward packetsto an upward-facing port if the packets is received from an upper switch inthe hierarchy. Switches running LDP detects switches and links failures byexchanging the liveness messages. The fabric manager keeps a fault matrixfor each links and updates with new information. Once changed, the fabricmanager informs all affected switches of the failures with new topology andthen recalculating their own forwarding table based on the new topology. Theauthor assumes the multipath and load balance problem is orthogonal to thework and suggests that one of the choices is using flow hashing ECMP to achieveflow-level load balancing.

3.3.2 VL2

The main ideas beyond VL2 is to create a large virtual layer 2 domain usingcommodity hardware, a non-oversubscription Clos network topology, and loadbalancing mechanism based on randomization. VL2 takes the scale out approachand design its topology based on aggregating the capacity of large number ofcommodity switches. The topology follows the three layers design: intermediate,aggregation, and ToR(Top of Rack). The links between intermediate switchesand aggregation form a Clos [5] network, providing rich path diversity and nooversubscription.

27

VL2 leverages two different IP address families, location-specific IP addresses(LA) and application-specific IP address (AA). LAs are hierarchically assignedto all the switches and switches perform the IP based link-state routing protocol.AA is allocated to applications, which remains unmodified when applicationmigrates to other locations. The idea is to create the illusion of having a largeIP subnet (AA address space) with underlying network routes packets by the LA.Similar to PortLand, this requires a directory system to maintain the mappingsof AA to LA.

To route between servers, VL2 deployed 2.5 layer agent at each server, inter-cepting the ARP and redirecting to the directory system. The directory systemresponses with the LA associated with the destination address AA and the agentencapsulates the LA to the outgoing packets. To equally distribute the load,VL2 applies Valiant Load Balancing (VLB) by randomly chooses one intermedi-ate switches and encapsulated the LA of intermediate switch into the outgoingpackets. In brief, from Figure 3.7, the outgoing packet is encapsulated with anintermediate switch LA and a ToR switch LA. Along the forwarding path, theintermediate switch and the ToR switch decapsulates the packets and forwardto next hop destination. VL2 also uses ECMP [9] to distribute traffic acrossequal-cost paths. The combination of VLB and ECMP prevents any host frombeing heavily loaded in the data center. In addition, VL2 uses link-state routingprotocol to detect switch/link failures and maintain switch-level topology.

3.3.3 Monsoon

Monsoon [8] is aimed to create an all layer 2 scalable and load balancing net-work architecture. Monsoon uses three layer approach (ToR, ingress/egress, andintermediate) to scale its layer 2 domain to 100,000 servers. Each ToR switchhas 2 10-Gbps port as uplinks, ingress/egress as well as intermediate switchhave 144 10-Gbps ports. The design is 1:1 oversubscription and well suitedfor load balance routing. To route between servers, the source node needs twoinformation. First, a list of MAC addresses responsible for handling this desti-nation IP address, and second, a list of MAC addresses of the ToR switches thateach of the servers is connected to. Monsoon agent on each server replaces theuser-level ARP functionality and is responsible for obtaining theses informationfrom a Monsoon directory server and encapsulates every outgoing packets. Thedirectory server maintains the IP address of a server to a list of (server MACaddress and ToR switch MAC address). When communication starts, the out-going frames is encapsulated with MAC addresses of ToR plus the intermediateswitch and sends out. The intermediate switches and ToR switches, which sup-port MAC-in-MAC tunneling, need to decapsulate the frame. Eventually theframe arrives at destination host as an unaltered frame. With MAC-in-MACtunneling, the switches in Monsoon forwards frames based only on the switchesMAC addresses, which solves the limited forwarding table size problem.

To distribute workload in data center, Monsoon provides mechanisms calledload spreading. The load spreading is achieved by creating a VIP (Virtual IP)shared by a set of servers (server pool). When requests come, the directory

28

server replies the VIP with a list of the MAC addresses associated with it anduses MAC rotation to provide efficient server-to-server forwarding. The senderuses consistent hashing to select a destination host from the MAC address list.When server fails, Monsoon leverages the existing data center health servicesystem to remove/add the servers from server pools.

PPPPPPPPIssue

SolutionPortLand VL2 Monsoon

Topology Two-layer, multi-root Three layer, Clos net-work

Three layer, multi-root

Addressing Hierarchical, encodelocation into MACaddress

Flat, IP-in-IP, Loca-tion address(LA) anApplication Address(AA)

Flat, MAC- in-MACencapsulation, sourcerouting

Routing Local Discover Pro-tocol, shortest path,route by MAC

link state routing,shortest path, route byIP

centralized routingbased on 4D architec-ture

Load Bal-ance

flow hashing ECMP VLB + flow-basedECMP

VLB + MAC rotation

Loop preven-tion

packets traveling downcan not travel back up

depend on IP routingprotocol

NA

Fail-over la-tency

60-80 ms, centrallycontrol and notify

seconds, depends onrouting protocol con-verged time

based on ECMP to de-tect failures

ARP/DHCPhandling

redirect ARP at switch disable ARP andDHCP, replace byuser-level agent

redirect ARP andDHCP

Table 3.3: Summary of the solution strategies from PortLand, VL2, and Mon-soon architecture.

29

Chapter 4

Peregrine: An All-Layer-2Container ComputerNetwork Architecture

4.1 Introduction

Cloud computing ushers in an era in which most information technology usersdo not need to own the system hardware and software infrastructure on whichtheir day-to-day IT applications run. They either pay for their IT infrastructureusage on demand or get it for free, e.g. through subsidies from advertisers.Although the concept of decoupling using from owning IT infrastructure onlystarts to gain traction in the enterprise space within the last 3 years, it hasbeen quite common and popular in the consumer space. In this new ecology, itis the cloud service providers that build and own IT infrastructures on whichthird-party or their own cloud applications run and deliver services, and alongthe way get reimbursed for values provided to their respective users.

The name of the game behind most cloud computing business models iseconomy of scale. By consolidating IT infrastructures within an organization(private cloud) or across multiple organizations (public cloud), both the capitalexpense (software licensing cost, hardware acquisition cost, etc.) and opera-tional expense (human system administration and support cost, energy usagecost, etc.) could be significantly cut down. In addition, by exploiting statisti-cal multiplexing, a consolidated IT infrastructure could be made more capable,flexible and robust than the sum of the parts from which it is consolidated.Although IT infrastructure consolidation brought forth by cloud computing hasmany benefits, it also escalates the scalability issue of IT infrastructure to a newlevel. One such issue is scalability of a cloud data center’s network architecture.This paper describes the design, implementation and evaluation of a data centernetwork called Peregrine, which is specifically designed for a container computer

30

built at Industrial Technology Research Institute (ITRI) in Taiwan.The ITRI container computer is designed to be a modular building block

for constructing a cloud data center computer, which in general is composed ofmultiple container computers that are connected by a data center network, isinterfaced with the public Internet through one or multiple IP routers, and isdesigned as an integrated system whose hardware components such as serversand switches are stripped off unnecessary functionalities, whose resources arecentrally configured, monitored and managed, and which encourages system-wide optimizations to reach more optimal global design tradeoffs. A key designdecision of the ITRI container computer is using only commodity hardware,including compute servers, network switches, and storage servers, and leavinghigh availability and performance optimization to the systems software. Anotherkey decision is to design a new data center network architecture from the groundsup to meet the unique requirements imposed by a cloud data center computer.

Before embarking on the design of the network architecture for the ITRIcontainer computer, we carefully reviewed related research literature and studiedpossible use cases and came up with the following requirements:

1. There is only one network, which supports communications among pro-grams, data storage accesses and interactions with the Internet.

2. The network must be buildable from mainstream commodity layer-2 switchesfor lower cost and better manageability.

3. The network must be able to support up to one million end points, eachof which corresponds to a virtual or physical machine.

4. The fail-over latency for any single network link/device failure is lowerthan 100 msec.

5. The loads on the network’s physical links are balanced.

6. The network must support private IP address reuse, i.e., multiple instancesof the same private IP address can co-exist simultaneously.

The first requirement dictates that the ITRI container computer shouldnot use a separate SAN for storage data accesses, and its network must in-terface seamlessly with the container computer’s internet edge logic component.The second requirement mandates that only mainstream rather than high-endenterprise-grade Ethernet switches be used and the modifications required onthese switches must be minimized. The last requirement is included specificallyto support Amazon EC2-like IaaS (infrastructure as a service) cloud service,where multiple virtual data enters are multiplexed on a physical data center, andeach virtual data center is given the full private IP address address 10.X.X.X

so that a customer’s virtual data center could seamlessly inter-operate withits existing on-premise physical data centers without any network/system re-configuration, such as IP address re-assignment.

A natural choice for building an all-layer-2 data center network is the stan-dard Ethernet architecture. Unfortunately, because conventional Ethernet isbased on a spanning tree architecture, it cannot satisfy the third and fourth re-quirements. Moreover, because the number of forwarding table entries in most

31

Physical Server VM0 VM1 VMn

Layer-2-Only Data Center Network

Load Balancing Traffic Shaping Intrusion Detection NAT/VPN

Compute Server Rack

Layer-3 Border Routers

Storage Server

Figure 4.1: System architecture of the ITRI container computer and its varioussystem components

mainstream Ethernet switches is between 16000 to 64000, they are un-equippedto meet the third requirement. Finally, IP address reuse is actually considereda run-time configuration error and is thus impossible to support in standardEthernet networks.

Peregrine satisfies all the requirements mentioned above. It uses a two-stagedual-mode packet forwarding mechanism to support up to 1M end points usingonly mainstream Ethernet switches. It incorporates load-aware routing to makethe best of all the physical network links, and proactively provisions primaryand backup routes to anticipate potential network failures. Peregrine supportsprivate IP address reuse through a protected address translation mechanismsimilar to virtual address translation. Finally, Peregrine only requires about100 lines of code change on mainstream Ethernet switches.

4.2 ITRI Container Computer

Figure 4.1 shows the logical system architecture of the ITRI container computer.The ITRI container computer is physically housed in an ISO-standard 20-foot(6.096 meter) shipping container, and consists of 12 server racks lined up onboth sides of the container with an access aisle in the middle, where each serverrack holds up to 96 current-generation X86 CPUs and 3TB of DRAM. TwelveJBOD (Just a Bunch Of Disks) storage servers, each packed with 40 disks,are installed in the container computer. Together with the local disks directlyattached to compute server nodes, the container boasts of more than 1 petabyteworth of usable disk space.

The ITRI container computer uses a single 380VDC power distribution net-work to distribute power to all of its hardware devices, avoiding power efficiencyloss due to unnecessary conversion between AC and DC currents. The PDU oneach server rack is capable of supporting 25 Kilowatts of power. The cool-

32

Core

Region

TOR

Core

Region

TOR

Figure 4.2: The physical topology of the ITRI container computer’s network isa modified Clos network.

ing subsystem uses a combination of air and liquid cooling technologies, and isspecifically designed to achieve an annual average PUE of 1.2 in a subtropicalclimate such as Taiwan, where PUE is defined as the total amount of energyconsumption divided by that of the IT equipments alone.

The ITRI container computer is designed to subsume all hardware function-alities seen in a typical data center, and thus includes support for all internetedge logic such as NAT (network address translation), VPN (virtual privatenetworking), traffic shaping, and server/network load balancing, which is imple-mented on general-purpose server clusters rather than commercial proprietaryappliances.

To support lights-out management, the ITRI container computer incorpo-rates a comprehensive SNMP-based environmental monitoring and control sub-system to protect itself, including a fire-and-smoke detection system backed upby a clean-agent gas-fire suppression subsystem, a physical security alarm sub-system, and an early earthquake detection system that proactively shuts itselfdown in the event of an earthquake. This container computer is designed tosustain a Richter scale 6.0 earthquake with no operational impact.

The ITRI container computer’s network is a modified Clos network, as shownin Figure 4.2. Every rack contains 48 server nodes, each having 4 1GE NICs, andincludes 4 top-of-rack (TOR) switches, each having 48 1GE ports and 4 10GEports. There is a virtual switch inside every server node that is connected to theserver node’s four NICs, which in turn are connected to the four TOR switchesin the same rack. The four 10GE unlinks on each TOR switch are connected tofour different regional switches, each of which has 48 10GE ports. To improvethe performance of storage accesses, each storage server has four 10GE NICs andis directly connected to four different regional switches. In total, five regionalswitches are used in the ITRI container computer. Peregrine is designed to

33

connect multiple ITRI container computers, but will require another layer ofcore switches to establish the necessary connectivity.

4.3 Two-Stage Dual-Mode Packet Forwarding

Because the number of addressable hosts in a single IP subset of an enter-prise network rarely exceeds 5,000, the number of forwarding table entries ona large percent of mainstream enterprise-grade Ethernet switches is no largerthan 32,000. Coupled with the fact that Ethernet switches forward packetsbased on their destination address, mainstream Ethernet switches cannot beused to build a network with a million end points or hosts because they cannotafford to allocate a forwarding table entry for each and every host.

Peregrine solves this problem using a two-stage forwarding scheme. Hostsin a Peregrine network are partitioned into disjoint groups, each of which is“proxied” by a dedicated intermediary. That is, every intermediary is capableof reaching every host in its group in one hop. To send a packet to a destinationD, the source S first identifies the intermediate associated with D, uses the MACaddress of D’s intermediary as the packet’s destination address, and embeds D’sMAC address somewhere inside the packet. This process is known as MAC-in-MAC (MIM) encapsulation. When S sends this MIM packet out, it reachesD’s intermediary first, and the intermediary, knowing that it is an MIM packet,takes out the embedded D’s MAC address and replaces the packet’s destinationaddress with it, and dispatches it to the normal packet forwarding process. Theintermediary is said to perform MIM decapsulation in this case. The packeteventually will arrive at D because we assume D’s intermediary can alwaysreach D in one hop.

Whenever a VM moves from one physical machine to another, the VM’srouting state in the network must be modified accordingly so that packets des-tined to this VM could still reach it after its migration. Two-stage forwardingsimplifies routing state migration in a way similar to mobile IP [19]: When a VMmoves to a new PM, Peregrine changes the VM’s intermediary to the intermedi-ary covering the new PM and informs all parties previously communicating withthe VM of this change. In the transition period, the old intermediary shouldsend back a ”host unreachable” ICMP message on behalf of the migrated VM,whenever it receives packets destined to the VM.

With two-stage forwarding, an intermediary switch only needs to allocateforwarding table entries for the other intermediaries and all the hosts in thesame group as the intermediary switch. Let G denote the number of hosts inan intermediary’s group. In a 1,000,000-node Peregrine network, the number offorwarding table entries needed by every intermediary switch is thus 1000000

G +G. For a non-intermediary switch, it only needs to allocate forwarding tableentries for other intermediaries. In the context of the ITRI container computer,there are two possible choices for the intermediary: the TOR switch or thevirtual switch inside every server node. When the intermediary is a TOR switch,G is 480 because a TOR switch is connected to 48 server nodes and every

34

server node is assumed to run 10 virtual machines, and the required numberof forwarding table entries per switch is about 2600, which is way below thecurrent forwarding table size limit of 16000. When the intermediary is a virtualswitch, G becomes 10, and the required number of forwarding table entries perswitch becomes 100,000, which is higher than the limit. However, choosing TORswitches to be intermediaries requires modifications to these switches, althoughthe modification effort is relatively minor, about 100 lines of code, as mostmodern switches support the ability to trap certain types of packets and handlethem separately. More problematically, there is a serious performance penaltyassociated with MIM decapsulation because existing switches are not designedto perform this function in the data plane, and therefore have to support it inthe control processor, whose packet processing rate can be easily 3 to 4 ordersof magnitude slower than the data plane’s packet forwarding rate. Hopefully,as more and more commercial switches support the OpenFlow standard [13],which provides the flexibility of customized packet processing in the data plane,this performance penalty could be significantly reduced.

Although two-stage forwarding provides the generality of exchanging packetsbetween any two nodes in a 1,000,000-node network, this generality comes with apotential performance cost. To mitigate this performance overhead, we proposean optimization called dual-mode forwarding, which allows a source to sendpackets directly to those destinations with which it communicates frequently,and indirectly to the rest. Because the number of nodes that a given node, sayX, is expected to communicate frequently, even in a 1,000,000-node network,is expected to be small, say 100, one can allocate forwarding table entries forthese nodes on the switches along the paths between X and them, to speed upthese communications. This is possible because there are many unused entriesin the switch forwarding tables after two-stage forwarding is adopted.

More generally, Peregrine dynamically measures the traffic volume from ev-ery host to every other host, i.e., the traffic matrix, and sorts the resultingmeasurements into a list in a decreasing order. Starting from the head entryof this list, Peregrine allocates a forwarding table entry in every switch on thepath from the source to the destination in the entry, and continues to go downthe list until either the entry’s traffic volume is too low to be worthwhile orthe occupancy ratio of any of the forwarding tables on the path exceeds a cer-tain threshold. Originally, every node is reachable indirectly via its associatedintermediary. As more traffic load information becomes available, Peregrinegradually builds up direct routes between those node pairs with heavy commu-nications. Note that for a given node X, some nodes may find it worthwhileto build a direct route to X, but others may choose to use the original indirectroute.

4.4 Fast Fail-Over

Peregrine is designed to reduce the fail-over delay of any single switch/port/linkfailure to under 100 msec. To achieve this aggressive goal, for a given node X,

35

Peregrine pre-computes a primary and a backup route from every other nodeto X, where the primary route and backup route are node-disjoint and link-disjoint excluding the two end points, assuming the underlying physical networkconnectivity provides enough redundancy. Whenever a network link or devicefails, the primary routes provisioned on the failed device or link are identified,and the nodes that are using these primary routes are notified to switch totheir corresponding backup routes. The fail-over delay of a network device/linkfailure thus consists of the time to detect the device/link failure, the time toidentify affected primary routes and the nodes currently using these routes, andthe time to inform these affected nodes to switch from primary to backup routes.

Because Peregrine uses conventional Ethernet switches and Ethernet switchesforward packets based on their destination address, the only way to forwardpackets destined to a given node X on different routes is to assign multipleMAC addresses to X, each representing a distinct route to reach X. At start-uptime, Peregrine installs pre-computed primary/backup routes to every hosts inthe switches’ forwarding tables. At run time, switching from the primary tothe backup route of a given host is a matter of using the host’s backup MACaddress rather than primary MAC address.

To enable fast fail-over, for a given host X, Peregrine pre-computes twodisjoint paths from each of the other hosts to X. One simple way to achievethis is to compute two disjoint spanning trees (primary and backup), each ofwhich is rooted at X and spans all other hosts. Whenever a network failureaffects the primary spanning tree for the host X, all other hosts are informedto switch to the backup spanning tree of X by using X’s backup MAC addressto reach X. That is, X is reachable to the rest of the world either through itsprimary MAC address or backup MAC address, but never both. The mainadvantage of this design is it greatly simplifies the bookkeeping of availabilitystate for each host. However, there are two disadvantages. First, any failure thataffects (even a slight portion of) a given host’s primary spanning tree rendersthe entire spanning tree unusable – the collateral damage of this coarse-gainedfail-over strategy may be too severe. For example, a NIC-to-TOR link failuremay disable many spanning trees. Second, although Clos network provides richconnectivity, ensuring that a given node’s primary and backup spanning treesare completely disjoint might be difficult and greatly reduces the flexibility ofload balancing routing (described below). Therefore, Peregrine adopts a fine-grained fail-over approach called node-pair path (NPP) fail-over, which requiresPeregrine to, keep track for each host X whether each of its communicatinghosts currently uses X’s primary or backup MAC address to reach X. Thisdesign allows two different hosts to reach X using X’s primary and backupMAC address simultaneously. With NPP design, although the primary pathsfrom all other hosts to X form a spanning tree and the backup paths anotherspanning tree, these two trees are no longer required to be disjoint, but theprimary and backup paths between each node pair still need to be disjoint.By removing the requirement that the primary and backup trees of a givenhost be disjoint, the flexibility and efficiency of Peregrine’s routing algorithm issignificantly increased.

36

4.5 Load Balancing Routing

Traditional Ethernet architecture does not support dynamic routing that couldaccommodate fluctuating workload patterns; only Layer-3 routers provide suchsupport. By exploiting the capability of populating the forwarding tables onswitches, Peregrine supports load-balancing packet routing, which takes intoaccount the following factors. First, the importance of different physical linksin a data center network is different, even if the physical network topology issymmetric. For example, a physical link may be more critical than anotherbecause it is used by many hosts to access a storage server. Peregrine computesthe notion of link criticality [6] and uses it to avoid choosing more critical linksearly on so as to eventually achieve network-wide load balance. Second, thenumber of hops on the route between two hosts is an important quality indicatorbecause it determines the network latency as well as the amount of injected load.

A simple and effective load-balancing routing scheme is to compute a largenumber of paths between every source and destination pair < s, d >, and todistribute the traffic from s to d equally among these paths. However, thisalgorithm is infeasible because it would require a large number of forwardingtable entries for each host. Instead, we could statically run this routing algo-rithm, and use its result to steer the direction of more practical routing algo-rithms. More concretely, we compute up to N shortest paths for every possiblesource/destination pair < s, d >, equally distribute the traffic between s andd on these N paths, and then compute the link criticality of a physical link lwith respect to < s, d >, denoted as θl(s, d), as M

N , where M is the numberof paths between s and d that go through the link l. Then the expected loadfrom s to d on the link l is θl(s, d) ∗ TM(s, d), where TM(s, d) represents thebandwidth demand from s to d, and the total expected load on the link l isthus θl =

∑(s,d) θl(s, d) ∗ TM(s, d). Finally, we define the cost of the link l

as cost(l) = θl/Rl, where Rl is the residual capacity of the link l, and avoidchoosing links with higher cost as much as possible when computing routes.

Given a traffic matrix, each of whose entries represents the bandwidth de-mand from one host to another, Peregrine first sorts its entries in a decreasingorder and then computes paths for these entries in this order. That is, host pairswith higher bandwidth demands are routed earlier. To compute the primarypath for a host pair < s, d >, Peregrine computes K shortest paths from s tod, filters out those paths that cross switches whose forwarding table is alreadyfull, and picks the path whose sum of the costs of its links is minimum. Aftertaking out the links on the primary path from s to d, Peregrine repeats thesame process to calculate their backup path. After the primary and backuppath from s to d are computed, the residual capacity of every links on these twopaths is reduced by TM(s, d), and the expected load and cost on other links arealso adjusted accordingly.

Whenever a link experiences congestion because of traffic load fluctuations,Peregrine identifies all source destination pairs whose primary or backup pathpasses through this link, deducts their measured bandwidth demands from themeasured costs of the links on these primary paths, and applies the same routing

37

algorithm to compute a new primary or backup path for each of these sourcedestination pairs, this time using their measured bandwidth demands.

38

Chapter 5

Peregrine Implementationand PerformanceEvaluation

5.1 Prototype Implementation

The current Peregrine implementation on the ITRI container computer, asshown in Figure 5.1, consists of a kernel agent that performs MIM encapsulationand is installed on the Dom0 of every physical machine with Xen, a central di-rectory server (DS) that performs generalized IP to MAC address look-up, anda central route algorithm server (RAS) for that constantly collects the trafficmatrix, runs the load-balancing routing algorithm based on the traffic matrix,and populates the switches with the resulting routing state.

With two-stage dual-mode packet forwarding, there are up to four ways toreach a Peregrine host X:

• Route directly to X using X’s primary MAC address,

• Route directly to X using X’s backup address,

• Route to X’s primary intermediary and then X using X’s primary MACaddress, and

• Route to X’s backup intermediary and then X using X’s backup MACaddress.

The first two possibilities exist only for those hosts that can be directly reach-able. Accordingly, there are four MAC addresses associated with each Peregrinehost, its primary MAC address, backup MAC address, primary intermediary’sMAC address and backup intermediary’s MAC address. Traditionally, translat-ing a host’s IP address to its MAC address is done via the ARP protocol, which

39

Layer-2-Only Clos Network

Physical Server

MIM agent

VM0 VM1 VMn

Directory Server

Route Algorithm Server

Figure 5.1: The software architecture of the current Peregrine prototype, whichconsists of a kernel agent installed on every physical machine, a central directoryserver (DS) for IP to MAC address look-up, and a central route algorithm server(RAS) for route computation and route state population.

is incompatible with Peregrine’s design because it is based on broadcast-basedqueries and unicast-based responses. Instead, Peregrine adopts a centralizeddirectory service (DS) architecture, as shown in Figure 5.1, in which every ARPquery about an IP address A is transparently intercepted by Peregrine’s kernelagent and re-directed to the DS, which responds with the four MAC addressesassociated with A, and the availability status of the four routes to reach A.

Peregrine does not require modifications to the header structure of Ethernetpackets. To perform MIM encapsulation for an outgoing packet, the Peregrineagent puts the primary or backup intermediary’s MAC address in the packet’sdestination address field, and the MAC addresses of the sending and receivinghost in the packet’s source address field. This means that every Peregrinehost’s MAC address has only 24 bits, rather than 48 bits. In addition, the MACaddresses of all VMs in a Peregrine network are centrally allocated, and everyVM is assigned two MAC addresses.

The centralized IP to MAC address mapping architecture also enables Pere-grine to support private IP address reuse, which allows multiple virtual datacenters (VDC) to run on a single Peregrine network and gives each VDC thesame private IP address space (e.g. 10.x.x.x). When a VM in a VDC issuesan ARP query about an IP address, Peregrine consults with the DS using theIP address and the ID of the VDC, which disambiguates the same IP addresssimultaneously used by multiple VDCs based on their VDC ID.

Figure 5.2 gives an example to illustrate the MAC address look-up for two-stage dual-mode packet forwarding. When VM3 sends out an ARP query aboutVM6’s IP address (step 1) , the Peregrine agent installed at Dom0 of VM3’sphysical machine (PM1) intercepts this query and submits the resulting query tothe directory server (DS) (step 2). The DS looks up its database and sends thefour MAC addresses and their availability status associated with VM6 (step 3)back to the Peregrine agent on PM1, which first creates and sends an legitimate

40

SW1 SW3

SW2

SW4

DS

PM1 PM2 VM6

1. ARP Request

2. Redirect

3. Reply

4. Encapsulation 5. Decapsulation

mac3 mac1 VM3

mac1 VM3’s mac mac3 mac1 VM3

mac1 VM3’s mac

DA:6-byte SA: 6-byte

Ethernet header Ethernet header

DA SA

PM: Physical Machine VM: Virtual Machine SW: Switch DS: Directory Server

mac1 mac2 mac3 mac4

direct indirect VM6: Primary/backup: mac1/mac2 SW3: Primary/backup: mac3/mac4

MAC address

VM3

Directory Service

(primary, backup) (primary, backup)

Figure 5.2: MAC address translation for two-stage dual-mode forwarding.

ARP reply to VM3 as well as caches the reply to answer future ARP querieson VM6’s IP address. Once VM3 receives VM6’s MAC address, it forms theassociated packet and sends the packet out. In Peregrine, all packets from aDomU VM pass through the Peregrine agent in Dom0 of the correspondingphysical machine. For each packet going by, the Peregrine agent consults withthe ARP cache with the packet’s destination IP address and rewrites the packet’sdestination MAC address field based on the MAC address look-up result. Forexample, in the case of Figure 5.2, VM6 can be reached in four ways: (1)Indirect Primary: The Peregrine agent on PM1 performs MIM encapsulationwith the MAC address of VM6’s primary intermediary and VM6’s primaryMAC address, and sends the packet out (step 4). When the packet arrives atVM6’s primary intermediary, i.e., SW3, it decapsulates the MIM packet andforwards the resulting packet to VM6 (step 5). (2) Indirect Backup: Everythingworks in the same way as the Indirect Primary case, except that it is VM6’sbackup intermediary, SW4, that is used for packet relaying. (3) Direct Primary:The destination MAC address of the outgoing packets is VM6’ primary MACaddress. (4) Direct Backup: The destination MAC address of the outgoingpackets is VM6’ secondary MAC address.

Figure 5.3 illustrates how Peregrine’s fast fail-over mechanism works. Ini-tially, VM6’s primary and backup MAC addresses, mac1 and mac2, are pre-populated on the switches along the two disjoint routes by the RAS (step 1)The primary route to VM6 goes through SW2 and SW3 while the backup routegoes through SW1 and SW4. Whenever a link along the primary path fromVM3 to VM6 is down, an SNMP trap is sent from the link’s adjacent switchto the RAS (step 2), which determines the source destination pairs that are af-fected by the link failure and passes this information to the DS (step 3), whichthen informs the source hosts that their associated destination hosts are reach-able only via their backup MAC addresses, in this case, sending out an ARPentry update to PM1 (step 4) indicating that to send packets from VM3 to

41

SW1

SW3 SW2

SW4

PM1 PM2 VM3 VM6

DS RAS

5. Backup Path

Primary

Path

mac1

mac1

mac1

mac2

mac2 mac2

VM6 mac1: Primary mac2: Backup

1.  Deploy Forwarding table

2. Link down trap

3. update

4. Update cache

Directory Service Route Algorithm Service

Figure 5.3: Switching from direct primary route to direct backup route upon alink failure

VM6 should use mac2 as the destination MAC address, which is the backupMAC address for VM6. After that, all packets destined to VM6 from VM3 gothrough its backup path (step 5).

Upon a link/switch ailure, the DS only needs to update those physical ma-chines that currently cache ARP entries that are invalidated by the failure,because it keeps track of which physical machines cache which ARP entries.The DS performs these ARP cache updates using unicast. For a given VM, thenumber of physical machines caching its ARP entry is expected to be relativelysmall. Therefore the DS allocates space enough to record at most M cachingmachines for a given ARP entry, where M is tentatively set to 50. For a verypopular VM that communicates with a large number of physical machines, aspecial flag is set in its ARP entry, and any modification to its ARP entrytriggers an ARP update to every physical machine.

5.2 Network Initialization

To run the Peregrine architecture on commodity Ethernet switches, these switchesare required to block broadcast packets, disable unicast packet flooding, andturn off the IEEE 802.1D spanning tree protocol (STP). When the ITRI con-tainer computer starts up, the switches on its network are first put in the stan-dard STP mode, and then configured to satisfy Peregrine’s requirements. How-ever, turning off STP on switches one by one could easily lead to packet loopingand thus broadcast storms. To solve this problem, Peregrine uses the followingalgorithm to convert the network from the standard STP mode to the Peregrinemode:

1. Statically configure all the switches in such a way that the switch directlyattached to a Peregrine server called RAS, which is also responsible forroute computation, is the root of the initial spanning tree when the con-tainer computer network starts up.

42

2. Construct a list of switches by doing a breadth-first search of all theswitches in the initial spanning tree, and reverse the list.

3. From the RAS, visit each switch in the reversed list by turning off broad-cast packet forwarding, unicast packet flooding, and STP on it, and pop-ulating its static forwarding table with the results of the load balancingrouting algorithm.

Essentially, this algorithm starts the switch reconfiguration process from theleaves of the initial spanning tree, which is rooted at the switch attached tothe RAS, and ensures that all the switches that are not yet reconfigured arereachable through the initial spanning tree.

We used two racks in the ITRI container computer as the evaluation testbedfor the Peregrine prototype. The testbed consists of four 48-port TOR switcheswith 4 10GE uplink, two 48-port 10GE regional switches, and 48 physical ma-chines. Each physical machine is equipped with eight 2.53GHz Intel Xeon CPUcores, 40GB DRAM, and 4 GE NICs, and is installed with CentOS 5.5, whichis equipped with the Linux kernel 2.6.18. Two physical machines are used todeploy the RAS and DS. The MIM kernel agent is installed on all other physi-cal machines. Each physical machine is connected to four TOR switches via aseparate 1GE NIC, and each TOR switch in turn is connected to four regionalswitches via a separate 10GE link. No firmware modifications are required onthese regional or TOR switches.

5.3 Effectiveness of Load Balancing Routing

We used a simulation approach to evaluating the effectiveness of Peregrine’sload-balancing routing algorithm. The test network being simulated spans 52physical machines with 384 links. Each physical machine is connected to fourTOR switches via a separate 1GE NIC, and each TOR switch in turn is con-nected to four regional switches via a separate 10GE link. To derive realisticinput network traffic loads, we started with the packet traces collected from theLawrence Berkeley National Lab campus network [18]. Each packet trace spansover a period of 300 to 1800 seconds from different subnets with a total of around9000 end hosts. We assumed each packet trace represents a VM-to-VM trafficmatrix in a virtual data center, and the VMs are assigned to PMs in a randomfashion. Because the ITRI container computer is designed to support multiplevirtual data centers running concurrently on it, we created multiple multi-VDCtraffic matrixes, each of which is constructed by randomly combining five VM-to-VM traffic matrixes into one traffic matrix. Totally 17 multi-VDC 300-secondtraces were created and replayed on the simulated network.

Given a multi-VDC packet trace, we used the first half of the trace to deriveits traffic matrix, compute routes for communicating physical machines, andreplayed the second half of the trace on the simulated network using the result-ing routes. The metric used to measure the effectiveness of routing algorithmsis the congestion count, Nc, during the trace replay period. For every second

43

5 10 15

1000

020

000

3000

040

000

Trial Number

Con

gest

ion

Cou

nt

●

●

●

●

●

● ●●

●

●

●

●

●

●

●●

●

1 2 3 4 6 7 8 9 11 1312 14 16 17

1%2%

3%4%

Add

ition

al tr

affic

load

per

cent

age

● RSPRFLCR% of additional traffic

Figure 5.4: Congestion count (left Y axis) and additional traffic load (right Yaxis) comparison between the full link criticality-based routing algorithm andthe random shortest path routing algorithm using multiple multi-VDC packettraces as inputs

1 2 3 4 5

0.5

0.6

0.7

0.8

0.9

% of host pairs accounting for 90% of total traffic

Nc−

FLC

RN

c−R

SP

R

●

●

●

●

●

●

●●

● Z=100%Z=90%

Figure 5.5: Congestion count ratio between full link criticality-based routingand random shortest-path routing under different degrees of skewedness in theinput traffic matrix

44

of the input trace, we placed the load of every communicating host pair duringthat second on the links along the pair’s route in the simulated network. Duringthis replay process, whenever a host pair’s load is placed on a link whose ca-pacity (Mbits/sec) is already exceeded, the congestion count is incremented byone. Figure 5.4 compares the congestion counts of the full link criticality-basedrouting (FLCR) algorithm, which is load aware, and the random shortest-pathrouting (RSPR) algorithm, which is load-insensitive, using as inputs the 17multi-VDC input packet traces described above. These two algorithms repre-sent the two extremes of Peregrine’s routing algorithm (Section III.D): FLCRcorresponds to when Z is set to 100, whereas RSPR corresponds to when Z isset to 0.

As expected, FLCR out-performs RSPR in all 17 traces, because the formerstrives to avoid congested links through the guidance of link criticality andexpected link load. In contrast, RSPR only relies on randomization to avoidcongestion and is thus less effective. The price that FLCR pays for avoidingcongestion is the paths it produces tend to be longer and have a larger hopcount that those produced by RSPR. As a result, the total traffic load injectedby FLCR tends to be higher than that injected by RSPR. Fortunately, thepercentage of additional traffic load due to longer paths is insignificant, around0.5

To explain why the effectiveness difference between FLCR and RSPR varieswith the input traces, we measured the concentration percentage of each inputtrace, which is the percentage of the top heavy-traffic host pairs that account for90% of the total traffic volume in the input trace, and correlated this percentagewith the routing effectiveness difference, as represented by the ratio of congestioncounts (Nc) of FLCR and RSPR, for all 17 input traces. As shown by the solidcurve in Figure 5.5, when an input trace has a lower concentation percentage, thecongestion count ratio tends to be lower, indicating that the routing effectivenessbetween FLCR and RSPR is greater. This is because a lower concentrationpercentage means a higher degree of skewedness in the input workload, and theadvantage of FLCR over RSPR is more pronounced when the input load is moreskewed.

The complexity of full link criticality-based routing is O(L ∗ P ), where Lrepresents number of physical network links and P is number of PM pairs.From the multi-VDC traces, we found that most of the entries in their trafficmatrices are insignificantly small, e.g., the traffic loads of fewer than 5% ofthe host pairs account for more than 90% of the total traffic volume, as shownin Figure 5.5. The solid curve corresponds to FLCR (Z=100), whereas thedotted curve corresponds to the case when applying link criticality-based routingonly to top heavy-traffic host pairs that are responsible for 90% of the totaltraffic volume, i.e. Z = 90. The difference between these two curves is verysmall, indicating that the two configurations have similar routing effectiveness,although the Z=90 case requires much less route computation time than theZ=100 case. More concretely, the number of host pairs in a 500-server network towhich link criticality-based routing is applied is reduced from 250K when Z=100to 12.5K when Z=90. In our current implementation, the route computation

45

time for 12.5K host pairs takes about 10 minutes.

5.4 Packet Forwarding Performance

One concern in the Peregrine architecture is the DS’s throughput required tohandle ARP requests from up to a million hosts. One study based on a traf-fic collection from 2456 hosts [15] showed that there are on average 89 ARPqueries issued per second. A simple extrapolation suggests that around 36KARP queries are expected in a data center consisting of one million hosts. Thisperformance requirement is well below the measured performance on the cur-rent DS implementation, 100K ARP queries per second. Assume a physicalmachine caches ARP entries for one minute and hosts 20 VMs, and a VM com-municates with 100 other distinct VMs continuously. Under this assumption,every PM generates 20K ARP requests every minute, which corresponds to 30ARPs/seconds being directed to the DS. Since the DS is able to handle 100KARPs/sec, we expect the current design can support up to 3.3K physical ma-chines.

Encapsulation and Decapsulation Overhead: Another concern aboutthe Peregrine architecture is the overhead of packet encapsulation and decapsu-lation. We took two physical machines, installed the MIM agent on them, ranfour virtual machines on each of them, and established four TCP connectionsbetween these four pairs of VMs. We measured the throughputs of these fourTCP connections with/without MIM encapsulation and decapsulation. Themeasured throughputs of these TCP connections when MIM is turned on, areonly about 0.5% less than those without MIM. W further measured the latencyfor packet encapsulation and decapsulation inside the kernel. Under the sameconfiguration, packet decapsulation took less than 1usec (99th percentile), andpacket encapsulation took 4usec (99th percentile), because it requires looking upthe ARP cache. For comparison, we also implemented the MIM decapsulationengine in a commodity Ethernet switch, and the packet decapsulation through-put is disappointingly low, around 100 packets/sec, because MIM decapsulationtakes place in the control processor. This is why we choose to perform packetdecapsulation on the physical machines rather than on the switches.

5.5 Fail-over Latency

To measure the fail-over latency, we measured the service disruption time foran UDP connection running on two physical machines of the evaluation testbedwhen one of its underlying links fails. The sender of this UDP connectionsends one packet every msec to the receiver across a TOR switch. We thencounted the number of lost packets due to a link failure and further broke downthe total service disruption time into the following four components: (1) Aneighboring switch of a failed link detects the link failure through polling of itslocal interfaces and sends out an SNMP trap to the RAS. (2) The RAS processes

46

Step 1. switch 2. RAS 3. DS 4. PM TotalTime 20 − 80ms < 20ms < 5ms < 2ms < 77ms

Table 5.1: The breakdown of the fail-over latency of a single link failure in theevaluation testbed

0 5000 10000 15000

01

23

45

Number of affected end hosts

Tim

e (s

econ

ds)

●●

●●●

●

●

●

Figure 5.6: ARP cache update latency increases linearly with the number ofphysical machines whose ARP cache needs to be updated.

the link failure event to identify affected destination hosts and passes them tothe DS. (3) DS updates its ARP database for these affected hosts and sendsout ARP cache updates on them to those physical machines that communicatewith these hosts. (4) The MIM agent on a physical machine updates its ARPcache upon receiving such an ARP cache update message. Table 5.1 shows theaverage time spent in each step from twenty failure runs in which the numberof affected host pairs is fewer than 10. The upshot is that the average fail-overlatency of the Peregrine prototype is around 77 msec.

The only time-varying step in the fail-over latency is step (3), in whichthe DS sends out ARP cache updates to all physical machines caching MACaddresses of those hosts affected by the link failure. Figure 5.6 shows the timetaken by the DS to send out ARP cache updates increases linearly with thenumber of physical machines caching MAC addresses of affected hosts, becausethe DS needs to send them out in sequence. Even when the number of physicalmachines whose ARP cache needs to be updated is 1000, the total fail-overlatency is increased only by an additional 300 msec.

5.6 Conclusion

Recognizing that the internal fabric of a container computer does not need tobe compatible with other legacy IT infrastructures, designers of the ITRI con-tainer computer devised an innovative data center network architecture calledPeregrine, which employs only commodity Ethernet switches as dump packetforwarding engines but removes most of the control plane functionalities in the

47

traditional Ethernet architecture, such as spanning tree, source learning, flood-ing and broadcast-based ARP query, and centralizes the address look-up, rout-ing and fast fail-over intelligence on dedicated servers. We have completed afully operational Peregrine prototype, presented its design and implementationin this paper, and demonstrated the effectiveness and efficiency of the Peregrinearchitecture using simulation and measurement results.

We are currently working on improvements to the Peregrine prototype, in-cluding stress-testing the prototype’s robustness and scalability on a fully popu-lated container computer, and on a multi-container computer set-up, extendingthe DS to a distributed cluster implementation to enhance its scalability andavailability, and porting Peregrine (particularly its packet decapsulation logic)to switches supporting the OpenFlow standard to further increase the numberof end points that a single Peregrine network can span.

48

Bibliography

[1] The juniper networks qfabric architecture: A revolution in data centernetwork design: Flattening the data center architecture. 2011.

[2] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity datacenter network architecture. In Proceedings of the ACM SIGCOMM 2008conference on Data communication, pages 63–74. ACM, 2008.

[3] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat.Hedera: Dynamic flow scheduling for data center networks. In Proceedingsof the 7th USENIX conference on Networked systems design and implemen-tation, pages 19–19. USENIX Association, 2010.

[4] P. Bottorff and S. Haddock. Ieee 802.1 ah-provider backbone bridges, 2007.

[5] C. Clos. A study of non-blocking switching networks. Bell System TechnicalJournal, 32(2):406–424, 1953.

[6] K. Gopalan, T. Chiueh, and Y. Lin. Network-wide load balancing routingwith performance guarantees. In Communications, 2006. ICC’06. IEEEInternational Conference on, volume 2, pages 943–948. IEEE, 2006.

[7] A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri,D. Maltz, P. Patel, and S. Sengupta. Vl2: A scalable and flexible data centernetwork. ACM SIGCOMM Computer Communication Review, 39(4):51–62, 2009.

[8] A. Greenberg, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. Towardsa next generation data center architecture: scalability and commoditiza-tion. In Proceedings of the ACM workshop on Programmable routers forextensible services of tomorrow, pages 57–62. ACM, 2008.

[9] C. Hopps. Analysis of an equal-cost multi-path algorithm. 2000.

[10] A. Inc. Compare and contrast spb and trill. 2010.

[11] C. Inc. Scaling data centers with fabricpath and the cisco fabricpath switch-ing system.

49

[12] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. Thenature of data center traffic: measurements & analysis. In Proceedings ofthe 9th ACM SIGCOMM conference on Internet measurement conference,pages 202–208. ACM, 2009.

[13] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson,J. Rexford, S. Shenker, and J. Turner. Openflow: enabling innovation incampus networks. ACM SIGCOMM Computer Communication Review,38(2):69–74, 2008.

[14] A. Myers, E. Ng, and H. Zhang. Rethinking the service model: Scalingethernet to a million nodes. In Proc. HotNets. Citeseer, 2004.

[15] A. Myers, E. Ng, and H. Zhang. Rethinking the service model: Scalingethernet to a million nodes. In Proc. HotNets. Citeseer, 2004.

[16] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri,S. Radhakrishnan, V. Subramanya, and A. Vahdat. Portland: a scal-able fault-tolerant layer 2 data center network fabric. ACM SIGCOMMComputer Communication Review, 39(4):39–50, 2009.

[17] P. Oppenheimer. Top-down network design. Cisco Press, 2004.

[18] R. Pang, M. Allman, M. Bennett, J. Lee, V. Paxson, and B. Tierney. Afirst look at modern enterprise traffic. In Proceedings of the 5th ACMSIGCOMM conference on Internet Measurement, pages 2–2. USENIX As-sociation, 2005.

[19] C. Perkins, S. Alpert, and B. Woolf. Mobile IP; Design Principles andPractices. Addison-Wesley Longman Publishing Co., Inc. Boston, MA,USA, 1997.

[20] R. Perlman. Rbridges: transparent routing. In INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEE Computer and CommunicationsSocieties, volume 2, pages 1211–1218. IEEE, 2004.

[21] D. Plummer. An ethernet address resolution protocol. Technical report,RFC 826, 1982.

[22] S. Sharma, K. Gopalan, S. Nanda, and T. Chiueh. Viking: A multi-spanning-tree ethernet architecture for metropolitan area and cluster net-works. In INFOCOM 2004. Twenty-third AnnualJoint Conference of theIEEE Computer and Communications Societies, volume 4, pages 2283–2294. IEEE, 2004.

[23] C. Sturdevant. Cisco debuts fabricpath. eWeek, 27(14):34–34, 2010.

[24] D. Thaler and C. Hopps. Multipath issues in unicast and multicast next-hop selection. 2000.

50

[25] J. Touch and R. Perlman. Transparent interconnection of lots of links(trill): Problem and applicability statement. 2009.

[26] A. Vahdat, M. Al-Fares, N. Farrington, R. Mysore, G. Porter, and S. Rad-hakrishnan. Scale-out networking in the data center. Micro, IEEE,30(4):29–41, 2010.

51

Cloud-Scale Data Center Network Architecture€¦ · Cloud-scale data center network architecture...

Documents

Transcript of Cloud-Scale Data Center Network Architecture€¦ · Cloud-scale data center network architecture...