TrueFabricTM: A Fundamental Advance to the State of the ... · Figure 3: Physical topology for...

6
Copyright © 2020. All Rights Reserved. | www.fungible.com 1 TrueFabric TM : A Fundamental Advance to the State of the Art in Data Center Networks TrueFabric is an open standards-based network technology embedded in the Fungible DPU TM . It is designed to improve the performance, economics, reliability and security of data centers built on scale-out principles, and to do so at scales from a few racks to many thousands of racks, all within a single building. It exhibits a set of eight essential properties: scalability across many orders of magnitude, full any-to-any cross-sectional bandwidth, low predictable latency, fairness, congestion avoidance, fault tolerance, end- to-end software defined security, and the use of open standards to deliver excellent economics. No existing technology provides all these properties in a single implementation. TrueFabric therefore opens the door to a single universal data center network that can be used for all networking tasks within a data center, maximizing the value of the network and enabling a truly general and powerful computing infrastructure. Introduction Scale-out architectures have been used in data centers for almost twenty years. The advantages of these architectures were described in a seminal paper by Barroso, Clidarras, and Hölzle titled “The Data Center as a Computer: An Introduction to the Design of Warehouse-Scale Machines”. Today, there is no doubt that in time virtually all data centers, whether small, medium, or large and whether public or private, will be built in this way. Fundamental to the design of scale-out data centers is the idea that all server nodes in the data center are connected by a reliable high-performance local area network. This permits services offered by the data center to be implemented in a way that the loss of individual servers due to failures or planned events does not compromise the services themselves. While there are niche network technologies such as Infiniband, Fiber Channel and RoCE, that lay claim to providing some of the properties of a data center fabric, it is a fact that most scale-out data centers today use TCP/IP over Ethernet as the de facto fabric technology. TCP/IP was invented in the 1970s to solve the problem of connecting computers at worldwide scale over a wide- area network built using diverse network technologies. It was subsequently used to provide connectivity between servers inside a data center—and to do so at unprecedented scale. Given the fact that the physical parameters of a local-area network are some three orders of magnitude or more faster than the wide-area Internet, it is remarkable that TCP/IP worked at all let alone hold the fort for over twenty years. A partial explanation for this is that the TCP software stack has been heavily optimized to keep up with increasing demands on the network. Parallel attempts to provide “offload” of some TCP functions were not very successful because of the difficulty of breaking TCP up cleanly between CPUs and offload engines. Over the last decade, the performance of Ethernet interfaces and SSD devices has improved much faster than that of general- purpose CPUs. This is significant because the vast majority of TCP implementations run in software on such CPUs, and as such they are not keeping up with performance demands. On the application front, the use of a “microservices” architecture that makes pervasive use of remote procedure calls across nodes is now standard practice. Additionally, many new applications require access to very large datasets that must be “sharded” or spread, across server nodes. Both developments on the application front About the Author Pradeep Sindhu Founder and CEO Pradeep is the Co-Founder and CEO of Fungible. Pradeep founded Juniper Networks in February 1996, where he has held several key roles in shaping the company over the years, including being the first CEO and Chairman, then becoming Vice Chairman and CTO, and now Chief Scientist. Pradeep had a hand in the inception, design and development of virtually every product Juniper shipped from 1996 through 2015. Before founding Juniper, Pradeep worked at the Computer Science Lab at Xerox PARC for 11 years. During this period he invented the first cache coherency algorithms for packet switched buses and made fundamental contributions to Sun Microsystem’s high performance multiprocessor servers. Pradeep holds a Bachelors in Electrical Engineering from the Indian Institute of Technology in Kanpur, as well as a Masters in Electrical Engineering from the University of Hawaii. In addition, Pradeep holds both a Masters and a Doctorate in Computer Science from Carnegie Mellon University. Whitepaper

Transcript of TrueFabricTM: A Fundamental Advance to the State of the ... · Figure 3: Physical topology for...

Page 1: TrueFabricTM: A Fundamental Advance to the State of the ... · Figure 3: Physical topology for testing latency under high network loads Traffic Pattern 1024 * (Node to Node) 1024

Copyright © 2020. All Rights Reserved. | www.fungible.com 1

TrueFabricTM: A Fundamental Advance to the State of the Art in Data Center Networks

TrueFabric is an open standards-based network technology embedded in the Fungible DPUTM. It is designed to improve the performance, economics, reliability and security of data centers built on scale-out principles, and to do so at scales from a few racks to many thousands of racks, all within a single building. It exhibits a set of eight essential properties: scalability across many orders of magnitude, full any-to-any cross-sectional bandwidth, low predictable latency, fairness, congestion avoidance, fault tolerance, end-to-end software defined security, and the use of open standards to deliver excellent economics. No existing technology provides all these properties in a single implementation. TrueFabric therefore opens the door to a single universal data center network that can be used for all networking tasks within a data center, maximizing the value of the network and enabling a truly general and powerful computing infrastructure.

IntroductionScale-out architectures have been used in data centers for almost twenty years. The advantages of these architectures were described in a seminal paper by Barroso, Clidarras, and Hölzle titled “The Data Center as a Computer: An Introduction to the Design of Warehouse-Scale Machines”. Today, there is no doubt that in time virtually all data centers, whether small, medium, or large and whether public or private, will be built in this way. Fundamental to the design of scale-out data centers is the idea that all server nodes in the data center are connected by a reliable high-performance local area network. This permits services offered by the data center to be implemented in a way that the loss of individual servers due to failures or planned events does not compromise the services themselves.

While there are niche network technologies such as Infiniband, Fiber Channel and RoCE, that lay claim to providing some of the properties of a data center fabric, it is a fact that most scale-out data centers today use TCP/IP over Ethernet as the de facto fabric technology. TCP/IP was invented in the 1970s to solve the problem of connecting computers at worldwide scale over a wide-area network built using diverse network technologies. It was subsequently used to

provide connectivity between servers inside a data center—and to do so at unprecedented scale. Given the fact that the physical parameters of a local-area network are some three orders of magnitude or more faster than the wide-area Internet, it is remarkable that TCP/IP worked at all let alone hold the fort for over twenty years. A partial explanation for this is that the TCP software stack has been heavily optimized to keep up with increasing demands on the network. Parallel attempts to provide “offload” of some TCP functions were not very successful because of the difficulty of breaking TCP up cleanly between CPUs and offload engines.

Over the last decade, the performance of Ethernet interfaces and SSD devices has improved much faster than that of general-purpose CPUs. This is significant because the vast majority of TCP implementations run in software on such CPUs, and as such they are not keeping up with performance demands. On the application front, the use of a “microservices” architecture that makes pervasive use of remote procedure calls across nodes is now standard practice. Additionally, many new applications require access to very large datasets that must be “sharded” or spread, across server nodes. Both developments on the application front

About the Author

Pradeep Sindhu Founder and CEO

Pradeep is the Co-Founder and CEO of Fungible.

Pradeep founded Juniper Networks in February 1996, where he has held several key roles in shaping the company over the years, including being the first CEO and Chairman, then becoming Vice Chairman and CTO, and now Chief Scientist.

Pradeep had a hand in the inception, design and development of virtually every product Juniper shipped from 1996 through 2015. Before founding Juniper, Pradeep worked at the Computer Science Lab at Xerox PARC for 11 years. During this period he invented the first cache coherency algorithms for packet switched buses and made fundamental contributions to Sun Microsystem’s high performance multiprocessor servers.

Pradeep holds a Bachelors in Electrical Engineering from the Indian Institute of Technology in Kanpur, as well as a Masters in Electrical Engineering from the University of Hawaii. In addition, Pradeep holds both a Masters and a Doctorate in Computer Science from Carnegie Mellon University.

Whitepaper

Page 2: TrueFabricTM: A Fundamental Advance to the State of the ... · Figure 3: Physical topology for testing latency under high network loads Traffic Pattern 1024 * (Node to Node) 1024

Copyright © 2020. All Rights Reserved. | www.fungible.com 2

Whitepaper

increase the volume of network traffic inside data centers. Collectively, the developments on the technology and application fronts have conspired to place tremendous stress on software implementations of network stacks generally, and of TCP/IP over Ethernet particularly. As a result, a significant fraction of general-purpose CPU compute power is spent on interactions with the network, leaving less available for applications. We are now at a point where the approach of building everything on top of a software implementation of TCP/IP is increasingly untenable.

These developments set the stage for the introduction of TrueFabric as a single standards-based network technology to power data centers past the scale-out era into the data-centric era.

TerminologyBefore going into the details of TrueFabric it is helpful to define some basic terms. To begin with, we use the term Fabric in this paper in a more precise manner than is standard industry practice. We use it to refer to interconnect technologies that satisfy a set of minimum requirements: scalability; full any-any cross-sectional bandwidth; fairness; low, predictable latency; congestion avoidance; and error control. These minimum requirements are fundamental and unavoidable when building scale-out infrastructure under existing technology constraints—namely, the performance of individual basic building blocks like CPUs, memory, and IO interfaces cannot be scaled up indefinitely. Our use of the trademarked term TrueFabric is an acknowledgment by us that the word “fabric” has become meaningless through overuse. Nevertheless, in this paper we will use the capitalized term Fabric in the precise manner described above and the trademarked term TrueFabric to refer specifically to Fungible’s technology.

An Ideal Fabric is an interconnect technology that has infinite cross-sectional bandwidth and zero node-to-node latency. However desirable, such a Fabric is clearly not realizable. A Realizable Fabric is one that can be implemented at reasonable cost while providing the defining properties of a Fabric. In a Realizable Fabric, each node is connected at a fixed bandwidth that is independent of the number of nodes in the Fabric (a property called scalability). The Fabric’s cross-sectional bandwidth is finite but scales with the number of nodes, and the latency is the smallest permissible by the laws of physics.

It is also helpful to distinguish between interconnect technologies that implement a memory model from those that implement a network model. The former are used to connect processors to memory using read and write primitives and possibly cache-coherency primitives; the bandwidth and latency requirements for these interconnects are extremely demanding but the scale is necessarily limited. Network interconnects are used to connect whole server nodes to each other using network primitives like send and receive; the bandwidth and latency requirements of these network interconnects are more relaxed, but the scale can be extremely large, encompassing perhaps as many as a million nodes.

Using the above terminology, TrueFabric is a Realizable Fabric that implements a network model. In addition to incorporating the defining properties of a Realizable Fabric, it implements two additional properties that are highly desirable: it is built on open standards and it supports strong security. We note in passing that although TrueFabric is a network interconnect, its performance is more than good enough to implement a memory model on top using RDMA. All the properties of TrueFabric are implemented using a novel Fabric Control Protocol (FCP) built on top of standard UDP/IP over Ethernet.

For small to medium scales, TrueFabric uses a single tier of standard IP/Ethernet Spine switches where server nodes connect directly to Spine switches. For large scales, it uses a two-tier topology consisting of standard IP/Ethernet Spine switches and standard IP/Ethernet Leaf switches located at the top of each rack. In this case server nodes connect to the Leaf switches. Although a greenfield TrueFabric deployment does not need to use more than two tiers of switches for even the largest deployments, FCP can run perfectly well over existing networks containing three or more tiers of switches.

In what follows, we will refer to the Leaf, Spine, and any higher switching layers, collectively as the network core. The network edge is the endpoint of TrueFabric implemented inside the Fungible DPU located inside server nodes1. TrueFabric is the collection of network core and network edge.

Figure 1 below shows an abstract view of a TrueFabric deployment. There are multiple instances of four server types: CPU servers, AI/Analytics servers, SSD servers, and HDD servers. Each server instance contains a Fungible DPU which connects to the network at a fixed bandwidth, say 100GE. Even though each DPU has only a single 100GE interface connecting it to the network core, TrueFabric makes it appear as if there is a dedicated 100GE link from each DPU to every other DPU even at the largest scales of deployment. In fact, there is no experiment a server could conduct that would reveal the network core to be any different than the full mesh shown in the abstract picture.

Figure 1: An abstract view of TrueFabric

1 Read companion paper - The Fungible DPUTM: A New Category of Microprocessor

Page 3: TrueFabricTM: A Fundamental Advance to the State of the ... · Figure 3: Physical topology for testing latency under high network loads Traffic Pattern 1024 * (Node to Node) 1024

Copyright © 2020. All Rights Reserved. | www.fungible.com 3

Whitepaper

• Software Defined Security and Policy: TrueFabric supports end-to-end encryption based on the AES standard. Additionally, a givendeployment can be partitioned through software configuration intoseparate encrypted domains, each of which provides any-to-anyconnectivity for its nodes but forbids traffic from the nodes of onedomain to the nodes of another.

• Open Standards: TrueFabric’s FCP is built on top of standard IPover Ethernet and is fully interoperable with standard TCP/IP overEthernet. This permits off-the-shelf Spine and TOR switches to beused and also enables brownfield deployments where some servernodes have DPUs while others do not. Open standards also allowTrueFabric to deliver excellent economics.

Collectively, these eight properties enable TrueFabric to make substantial improvements to the performance, economics, reliability, and security of scale-out data centers. Their implementation in a single technology represents a fundamental advance to the state of the art in data center networks.

The performance and economic improvements are in part because the network itself has over 3X better price performance than existing networks, owing principally to the ability to run at much higher utilization while still providing excellent latency. While this is a huge improvement, the network typically represents only a fraction (~15%) of the spend in a data center so data center operators tend to diminish even this large improvement. But they do so to their own detriment because the vast majority of the performance and economic improvements due to TrueFabric come not from the direct economic benefit to the network but from the indirect benefit of enabling compute and storage resources to be pooled efficiently across the entire data center and from relieving the network burden from CPUs. TrueFabric enables virtually all resources in the data center to be disaggregated efficiently as shown in figure 2 below.

Properties of TrueFabricTrueFabric exhibits eight properties fundamental to building modern scale-out data centers:

• Scalability: TrueFabric can scale from a small deployment of afew racks of 100GE connected servers to massive deploymentsof hundreds of thousands of servers each using 200GE-400GEinterfaces. All deployments use the same interconnect topology,with small to medium deployments using a single layer of Spineswitches and larger deployments using a Spine layer and a Leaflayer. A deployment can be incrementally expanded withoutbringing down the network for true always-on operation.

• Full Cross-Sectional Bandwidth: TrueFabric supports full cross-sectional bandwidth from any node to any node for standard IPover Ethernet packet sizes, with no constraints on temporal orspatial characteristics of the traffic being carried. Crucially,TrueFabric enables efficient interchange of short, low-latencymessages to support highly interactive communication betweenserver nodes. This type of communication is simply not possiblewith TCP which is a byte-stream protocol. On the other hand,TCP can be implemented very efficiently on top of FCP.

• Low Latency and Low Jitter: TrueFabric provides minimal end-to-end latency between nodes and very tight control of taillatency. Minimal latency between nodes means that trafficalways uses the shortest path between any two nodes. Verytight control of tail latency means that the P99 latency rarelyexceeds 1.5 times the mean latency even at offered loadsexceeding 90% utilization.

• Fairness: Network bandwidth is allocated fairly acrosscontending nodes at microsecond granularity. Furthermore,network bandwidth is also allocated fairly across flowsgenerated by a given node. The bandwidth allocation respectsstandard quality of service levels for IP packets.

• Congestion Avoidance: TrueFabric has built-in activecongestion avoidance, which means that packets are essentiallynever lost due to congestion even when operating at very highoffered loads (> 90%). Notably, the congestion avoidancetechnique does not depend on the core network switches toprovide any features related to congestion control.

• Fault Tolerance: TrueFabric has built-in detection and recoveryfrom packet loss due to any type of network failure includingbut not limited to cable cuts, switch failure due to hardware orsoftware faults, transient errors that cause packet drops insideswitches or at endpoints, transient or permanent failure ofoptoelectronics, and unavoidable random noise on copper oroptical cables. FCP’s error recovery is five orders of magnitudefaster than traditional recovery techniques that depend onrouting protocols.

Figure 2: TrueFabric enables efficient disaggregation of virtually all datacenter resources.

Page 4: TrueFabricTM: A Fundamental Advance to the State of the ... · Figure 3: Physical topology for testing latency under high network loads Traffic Pattern 1024 * (Node to Node) 1024

Copyright © 2020. All Rights Reserved. | www.fungible.com 4

Whitepaper

• One-to-one: each DPU sends its packets to another unique DPU.There is no congestion in this scenario. Packet sizes are pickedfrom the industry standard IMIX profile.

• Random destination: each DPU sends each of its packets to arandom DPU picked from the set of 1024 DPUs. In this case, allthe packets are chosen to be the same size (1KB) to equalize theload at destination DPUs.

• Maximum Incast: All 1024 DPUs send to one victim DPU—this isthe acid test for congestion. Packet sizes are picked from theindustry standard IMIX profile.

The table below shows the fabric utilization, the mean latency, the latency variance, and the P99 latency for each of the three scenarios.

Fabric utilization exceeds 90% in all three scenarios. The mean latency is in the range of 1-2µs; and the absolute value of P99 latency is in the range of 1-3µs. It is worth noting that the Spine and TOR switches account for around 0.5µs of the latency budget, the remainder being spread between the sending and receiving DPUs. Finally, the ratio of P99 latency to mean latency is 1.16, 1.57 and 1.02, respectively. The latency histograms for the three cases are shown in figure 4 below:

Spine 0

6.4T

Spine 1

6.4T

Spine 2

6.4T

Spine 3

6.4T

Spine 31

6.4T

TOR6.4T

DP

Us

– 1

00

Gx2

DPU0

DPU1

DPU2

DPU3

DPU4

DPU5

DPU6

DPU7

DPU8

DPU9

DPU10

DPU12

DPU13

DPU14

DPU15

RACK 0

TOR6.4T

DPU0

DPU1

DPU2

DPU3

DPU4

DPU5

DPU6

DPU7

DPU8

DPU9

DPU10

DPU12

DPU13

DPU14

DPU15

RACK 1

TOR6.4T

DPU0

DPU1

DPU2

DPU3

DPU4

DPU5

DPU6

DPU7

DPU8

DPU9

DPU10

DPU12

DPU13

DPU14

DPU15

RACK 2

TOR6.4T

DPU0

DPU1

DPU2

DPU3

DPU4

DPU5

DPU6

DPU7

DPU8

DPU9

DPU10

DPU12

DPU13

DPU14

DPU15

RACK 63

1024 Servers/DPUs Cluster

Figure 3: Physical topology for testing latency under high network loads

Traffic Pattern 1024 * (Node to Node)

1024 Node to 1024 Node

1024 Nodes to 1 Node

Fabric Utilization 90.7% 93% 90%

Latency Mean 1.84µS 2.10µS 1.71µS

Latency Variance 0.13µS 0.32µS 0.12µS

Latency P99 2.14µS 3.3µS 1.75µS

We measured the end-to-end one-way latency from the input ports of the network unit inside sending DPUs to the output ports of receiving DPUs under three separate traffic scenarios:

We call the ability to efficiently disaggregate most data center resources hyperdisaggregation, in contrast to hyperconvergence—an approach that locates resources inside a single type of server. In the hyperconverged approach, the CPUs inside a node can use local resources effectively, but these resources cannot be pooled across server nodes. As a result, resource usage is substantially less efficient. We estimate this loss of efficiency to be more than 4X by comparing the average utilization of enterprise data centers, which operate below 8% utilization as a consequence of resource stranding, to that of hyperscale data centers which use partial disaggregation to reach utilizations over 30%.

The reliability and security improvements to a data center are a direct result of several of the properties listed above. First, TrueFabric makes fundamental improvements to the reliability of the data center network: it removes congestion as a source of packet loss by avoiding it altogether rather than reacting to it after the fact; it also recovers from all sources of network hardware and software failures, including multiple failures, ensuring reliable operation without the cost typically associated with this level of reliability. As a consequence, TrueFabric also fundamentally improves the overall reliability of the services offered by a data center2. Second, the performance characteristics of TrueFabric mean that erasure coding can be used pervasively to protect all stored data, especially hot data in high performance storage. This allows us to make dramatic improvements to the reliability of stored data without incurring the cost associated with making multiple copies.

On the security front, TrueFabric supports end-to-end encryption of all DPU-DPU traffic in a data center. Further, DPU enabled servers can be divided under software control into disjoint subsets, where each subset forms a separate encrypted security domain. This capability provides the strongest possible security short of providing a physical airgap between sets of servers.

Performance CharacteristicsIn this section we will present the performance characteristics of TrueFabric, with a focus on scenarios where there is significant network congestion. These scenarios are not handled well by existing technologies.

Fabric Latency Under High LoadsFigure 3 shows the simulation setup used for testing different traffic patterns under heavily loaded network conditions. The setup contains 64 identical racks, with each rack consisting of 16 DPUs connected to a 6.4Tbps TOR. The TOR is configured with 32x100GE links connecting to the DPUs (2x100GE per DPU) and 32x100GE links connecting to 32 Spines, 100GE per Spine—in other words, both the TOR and the Spine layers are non-oversubscribed.

2 Recall that in a scale-out data center, the reliability of services rests squarely on the reliability of the network used to connect servers to each other.

Page 5: TrueFabricTM: A Fundamental Advance to the State of the ... · Figure 3: Physical topology for testing latency under high network loads Traffic Pattern 1024 * (Node to Node) 1024

Copyright © 2020. All Rights Reserved. | www.fungible.com 5

Whitepaper

Comparison with RoCEv2 Under CongestionWe compared the performance of TrueFabric with RoCEv2 under conditions of 10:1 incast congestion. The two configurations compared were identical except that one used Mellanox ConnectX-5 NICs in the 10 servers and the other used Fungible’s DPU in the 10 servers. Figure 5 shows the setup:

NU

MB

ER

OF

PA

CK

ET

S

4500 000

400 000

300 000

350 000

250 000

200 000

100 000

150 000

50 000

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

NU

MB

ER

OF

PA

CK

ET

S

14 000

12 000

10 000

8 000

6 000

4 000

2 000

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Figure 4: Latency histograms for three traffic scenarios

NU

MB

ER

OF

PA

CK

ET

S

700 000

600 000

500 000

400 000

300 000

200 000

100 000

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

The first set of measurements show the variation of instantaneous bandwidth allocated to each sender over time under 10:1 incast for TrueFabric versus RoCEv2. See figure 6 below. It is clear that the bandwidth delivered to senders by TrueFabric is very nearly equal to each other and also stable over time. For RoCEv2, the distribution across senders and across time is highly variable.

Baseline FCT (µs) comparison for Non-Congested Network

Receiver servers with Fungible DPU-enabled NIC or Mellanox ConnectX-5 NIC

Sender servers with Fungible DPU-enabled NICor Mellanox ConnectX-5 NIC

Network Switches

S1

S2

S3

S4

S5

S6

S7

S8

Figure 5: Setup to compare 10:1 incast performance of TrueFabric vs. RoCEv2

Figure 6: Bandwidth variation over time for TrueFabric

Page 6: TrueFabricTM: A Fundamental Advance to the State of the ... · Figure 3: Physical topology for testing latency under high network loads Traffic Pattern 1024 * (Node to Node) 1024

Copyright © 2020. All Rights Reserved. | www.fungible.com 6

Whitepaper

Fungible, Inc.3201 Scott Blvd.Santa Clara, CA 95054669-292-5522

An alternative way to view the same data is to measure the P99 tail latency for TrueFabric and RoCEv2 under the 10:1 incast scenario. The end-to-end application latency per flow was 987µs for TrueFabric and 16302µs for RoCEv2, or 16.5X higher. Figure 7 shows the data as a histogram.

ConclusionIn scale-out data centers, the network that connects servers to each other is the key to building the most general and powerful computing facilities at a given cost. This network needs to have a set of properties that are fundamentally important in achieving the goals of high performance, excellent economics, high reliability and strong security. In this paper we have defined these properties precisely and explained how each contributes to achieving the high-level goals.

Neither the current workhorse TCP/IP over Ethernet, nor the more niche technologies of InfiniBand, Fiber Channel and RoCEv2 are capable of providing all of the properties we have identified.

TrueFabric is the industry’s first focused attempt to deliver a single, unified network technology based on open standards that has all of the properties needed to build high performance, economical, reliable and secure data centers across a wide range of scales. As such, it represents a fundamental advance to the state of the art in data center networks.

WP0033.01.02020820

Figure 7: Comparison of P99 tail latency between TrueFabric and RoCEv2