Topic 15: Datacenter Design and Networking

89
15: Datacenter Design and Networking Zubair Nabi [email protected] April 21, 2013 Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 1 / 27

description

Cloud Computing Workshop 2013, ITU

Transcript of Topic 15: Datacenter Design and Networking

Page 1: Topic 15: Datacenter Design and Networking

15: Datacenter Design and Networking

Zubair Nabi

[email protected]

April 21, 2013

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 1 / 27

Page 2: Topic 15: Datacenter Design and Networking

Outline

1 Datacenter Topologies

2 Transport Protocols

3 Network Sharing

4 Wrapping Up

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 2 / 27

Page 3: Topic 15: Datacenter Design and Networking

Outline

1 Datacenter Topologies

2 Transport Protocols

3 Network Sharing

4 Wrapping Up

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 3 / 27

Page 4: Topic 15: Datacenter Design and Networking

Introduction

Datacenters are traditionally designed in the form of a 2/3-level tree

Switching elements become more specialized and faster when we goup the tree structure

A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports

I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches

Aggregation and core switches have 10Gbps ports

Partitioning if switches up the tree go down

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27

Page 5: Topic 15: Datacenter Design and Networking

Introduction

Datacenters are traditionally designed in the form of a 2/3-level tree

Switching elements become more specialized and faster when we goup the tree structure

A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports

I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches

Aggregation and core switches have 10Gbps ports

Partitioning if switches up the tree go down

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27

Page 6: Topic 15: Datacenter Design and Networking

Introduction

Datacenters are traditionally designed in the form of a 2/3-level tree

Switching elements become more specialized and faster when we goup the tree structure

A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the tree

Edge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports

I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches

Aggregation and core switches have 10Gbps ports

Partitioning if switches up the tree go down

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27

Page 7: Topic 15: Datacenter Design and Networking

Introduction

Datacenters are traditionally designed in the form of a 2/3-level tree

Switching elements become more specialized and faster when we goup the tree structure

A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports

I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches

Aggregation and core switches have 10Gbps ports

Partitioning if switches up the tree go down

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27

Page 8: Topic 15: Datacenter Design and Networking

Introduction

Datacenters are traditionally designed in the form of a 2/3-level tree

Switching elements become more specialized and faster when we goup the tree structure

A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports

I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches

Aggregation and core switches have 10Gbps ports

Partitioning if switches up the tree go down

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27

Page 9: Topic 15: Datacenter Design and Networking

Introduction

Datacenters are traditionally designed in the form of a 2/3-level tree

Switching elements become more specialized and faster when we goup the tree structure

A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports

I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches

Aggregation and core switches have 10Gbps ports

Partitioning if switches up the tree go down

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27

Page 10: Topic 15: Datacenter Design and Networking

Introduction

Datacenters are traditionally designed in the form of a 2/3-level tree

Switching elements become more specialized and faster when we goup the tree structure

A three-level tree has a core switch at the root, aggregation switchesin the middle, and edge switches at the leaves of the treeEdge switches have a large number of 1Gbps ports and a smallnumber of 10Gbps ports

I The 1Gbps ports connect end-hosts while 10Gbps ports connect toaggregation switches

Aggregation and core switches have 10Gbps ports

Partitioning if switches up the tree go down

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 4 / 27

Page 11: Topic 15: Datacenter Design and Networking

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 5 / 27

Page 12: Topic 15: Datacenter Design and Networking

Oversubscription

Ideal value of 1:1 – All hosts may potentially communicate with othersat full bandwidth of their interface

5:1 – Only 20% of the bandwidth is available (200Mbps)

Typical datacenter designs are oversubscribed by a factor of 2.5:1(400Mbps) to 8:1 (125Mbps)

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 6 / 27

Page 13: Topic 15: Datacenter Design and Networking

Oversubscription

Ideal value of 1:1 – All hosts may potentially communicate with othersat full bandwidth of their interface

5:1 – Only 20% of the bandwidth is available (200Mbps)

Typical datacenter designs are oversubscribed by a factor of 2.5:1(400Mbps) to 8:1 (125Mbps)

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 6 / 27

Page 14: Topic 15: Datacenter Design and Networking

Oversubscription

Ideal value of 1:1 – All hosts may potentially communicate with othersat full bandwidth of their interface

5:1 – Only 20% of the bandwidth is available (200Mbps)

Typical datacenter designs are oversubscribed by a factor of 2.5:1(400Mbps) to 8:1 (125Mbps)

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 6 / 27

Page 15: Topic 15: Datacenter Design and Networking

Fat-tree Topology

k-ary fat-tree has k pods

Each pod contains two layers of k/2 switches

Each k-port switch in the lower layer is directly connected to k/2 hosts

Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches

(k/2)2 core switches

Each core switch has one port connected to each of the k pods

The ith port of any core switch is connected to pod i

A k-ary fat-tree supports k3/4 hosts

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27

Page 16: Topic 15: Datacenter Design and Networking

Fat-tree Topology

k-ary fat-tree has k pods

Each pod contains two layers of k/2 switches

Each k-port switch in the lower layer is directly connected to k/2 hosts

Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches

(k/2)2 core switches

Each core switch has one port connected to each of the k pods

The ith port of any core switch is connected to pod i

A k-ary fat-tree supports k3/4 hosts

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27

Page 17: Topic 15: Datacenter Design and Networking

Fat-tree Topology

k-ary fat-tree has k pods

Each pod contains two layers of k/2 switches

Each k-port switch in the lower layer is directly connected to k/2 hosts

Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches

(k/2)2 core switches

Each core switch has one port connected to each of the k pods

The ith port of any core switch is connected to pod i

A k-ary fat-tree supports k3/4 hosts

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27

Page 18: Topic 15: Datacenter Design and Networking

Fat-tree Topology

k-ary fat-tree has k pods

Each pod contains two layers of k/2 switches

Each k-port switch in the lower layer is directly connected to k/2 hosts

Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches

(k/2)2 core switches

Each core switch has one port connected to each of the k pods

The ith port of any core switch is connected to pod i

A k-ary fat-tree supports k3/4 hosts

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27

Page 19: Topic 15: Datacenter Design and Networking

Fat-tree Topology

k-ary fat-tree has k pods

Each pod contains two layers of k/2 switches

Each k-port switch in the lower layer is directly connected to k/2 hosts

Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches

(k/2)2 core switches

Each core switch has one port connected to each of the k pods

The ith port of any core switch is connected to pod i

A k-ary fat-tree supports k3/4 hosts

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27

Page 20: Topic 15: Datacenter Design and Networking

Fat-tree Topology

k-ary fat-tree has k pods

Each pod contains two layers of k/2 switches

Each k-port switch in the lower layer is directly connected to k/2 hosts

Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches

(k/2)2 core switches

Each core switch has one port connected to each of the k pods

The ith port of any core switch is connected to pod i

A k-ary fat-tree supports k3/4 hosts

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27

Page 21: Topic 15: Datacenter Design and Networking

Fat-tree Topology

k-ary fat-tree has k pods

Each pod contains two layers of k/2 switches

Each k-port switch in the lower layer is directly connected to k/2 hosts

Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches

(k/2)2 core switches

Each core switch has one port connected to each of the k pods

The ith port of any core switch is connected to pod i

A k-ary fat-tree supports k3/4 hosts

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27

Page 22: Topic 15: Datacenter Design and Networking

Fat-tree Topology

k-ary fat-tree has k pods

Each pod contains two layers of k/2 switches

Each k-port switch in the lower layer is directly connected to k/2 hosts

Each of the remaining k/2 ports is connected to k/2 of the k ports of theaggregation switches

(k/2)2 core switches

Each core switch has one port connected to each of the k pods

The ith port of any core switch is connected to pod i

A k-ary fat-tree supports k3/4 hosts

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 7 / 27

Page 23: Topic 15: Datacenter Design and Networking

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 8 / 27

Page 24: Topic 15: Datacenter Design and Networking

DCell

Uses a recursively defined structure to interconnect servers

Each server connects to different levels of DCells through multiple links

High-level DCells are built recursively from many low-level ones

Fault tolerant as there is no single point of failure

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 9 / 27

Page 25: Topic 15: Datacenter Design and Networking

DCell

Uses a recursively defined structure to interconnect servers

Each server connects to different levels of DCells through multiple links

High-level DCells are built recursively from many low-level ones

Fault tolerant as there is no single point of failure

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 9 / 27

Page 26: Topic 15: Datacenter Design and Networking

DCell

Uses a recursively defined structure to interconnect servers

Each server connects to different levels of DCells through multiple links

High-level DCells are built recursively from many low-level ones

Fault tolerant as there is no single point of failure

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 9 / 27

Page 27: Topic 15: Datacenter Design and Networking

DCell

Uses a recursively defined structure to interconnect servers

Each server connects to different levels of DCells through multiple links

High-level DCells are built recursively from many low-level ones

Fault tolerant as there is no single point of failure

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 9 / 27

Page 28: Topic 15: Datacenter Design and Networking

Structure

Uses servers with multiple network ports and mini-switches toconstruct its recursive structure

DCell0 is the building block to construct larger DCellsI Consists of n servers and a mini-switch

High-level DCells are built recursively from many low-level ones

DCell1 constructed using n+1 DCell0s

The same applies to DCellk

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27

Page 29: Topic 15: Datacenter Design and Networking

Structure

Uses servers with multiple network ports and mini-switches toconstruct its recursive structureDCell0 is the building block to construct larger DCells

I Consists of n servers and a mini-switch

High-level DCells are built recursively from many low-level ones

DCell1 constructed using n+1 DCell0s

The same applies to DCellk

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27

Page 30: Topic 15: Datacenter Design and Networking

Structure

Uses servers with multiple network ports and mini-switches toconstruct its recursive structureDCell0 is the building block to construct larger DCells

I Consists of n servers and a mini-switch

High-level DCells are built recursively from many low-level ones

DCell1 constructed using n+1 DCell0s

The same applies to DCellk

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27

Page 31: Topic 15: Datacenter Design and Networking

Structure

Uses servers with multiple network ports and mini-switches toconstruct its recursive structureDCell0 is the building block to construct larger DCells

I Consists of n servers and a mini-switch

High-level DCells are built recursively from many low-level ones

DCell1 constructed using n+1 DCell0s

The same applies to DCellk

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27

Page 32: Topic 15: Datacenter Design and Networking

Structure

Uses servers with multiple network ports and mini-switches toconstruct its recursive structureDCell0 is the building block to construct larger DCells

I Consists of n servers and a mini-switch

High-level DCells are built recursively from many low-level ones

DCell1 constructed using n+1 DCell0s

The same applies to DCellk

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27

Page 33: Topic 15: Datacenter Design and Networking

Structure

Uses servers with multiple network ports and mini-switches toconstruct its recursive structureDCell0 is the building block to construct larger DCells

I Consists of n servers and a mini-switch

High-level DCells are built recursively from many low-level ones

DCell1 constructed using n+1 DCell0s

The same applies to DCellk

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 10 / 27

Page 34: Topic 15: Datacenter Design and Networking

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 11 / 27

Page 35: Topic 15: Datacenter Design and Networking

Outline

1 Datacenter Topologies

2 Transport Protocols

3 Network Sharing

4 Wrapping Up

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 12 / 27

Page 36: Topic 15: Datacenter Design and Networking

TCP and UDP

TCP: Connection-oriented with reliability, ordering, and congestioncontrol

UDP: Connectionless with no ordering, reliability, or congestion control

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 13 / 27

Page 37: Topic 15: Datacenter Design and Networking

TCP and UDP

TCP: Connection-oriented with reliability, ordering, and congestioncontrol

UDP: Connectionless with no ordering, reliability, or congestion control

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 13 / 27

Page 38: Topic 15: Datacenter Design and Networking

TCP and Datacenter Networks

Communication between different nodes is thought of as just opening aTCP connection between them

Common sockets API

But TCP was designed for a wide-area network

Clearly, a datacenter is not a wide-area network

Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small

I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27

Page 39: Topic 15: Datacenter Design and Networking

TCP and Datacenter Networks

Communication between different nodes is thought of as just opening aTCP connection between them

Common sockets API

But TCP was designed for a wide-area network

Clearly, a datacenter is not a wide-area network

Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small

I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27

Page 40: Topic 15: Datacenter Design and Networking

TCP and Datacenter Networks

Communication between different nodes is thought of as just opening aTCP connection between them

Common sockets API

But TCP was designed for a wide-area network

Clearly, a datacenter is not a wide-area network

Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small

I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27

Page 41: Topic 15: Datacenter Design and Networking

TCP and Datacenter Networks

Communication between different nodes is thought of as just opening aTCP connection between them

Common sockets API

But TCP was designed for a wide-area network

Clearly, a datacenter is not a wide-area network

Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small

I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27

Page 42: Topic 15: Datacenter Design and Networking

TCP and Datacenter Networks

Communication between different nodes is thought of as just opening aTCP connection between them

Common sockets API

But TCP was designed for a wide-area network

Clearly, a datacenter is not a wide-area network

Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)

For example, due to the low RTT, the congestion window for each flowis very small

I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27

Page 43: Topic 15: Datacenter Design and Networking

TCP and Datacenter Networks

Communication between different nodes is thought of as just opening aTCP connection between them

Common sockets API

But TCP was designed for a wide-area network

Clearly, a datacenter is not a wide-area network

Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small

I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27

Page 44: Topic 15: Datacenter Design and Networking

TCP and Datacenter Networks

Communication between different nodes is thought of as just opening aTCP connection between them

Common sockets API

But TCP was designed for a wide-area network

Clearly, a datacenter is not a wide-area network

Significantly different bandwidth-delay product, round-trip time (RTT),and retransmission timeout (RTO)For example, due to the low RTT, the congestion window for each flowis very small

I As a result, flow recovery through TCP fast retransmit is impossible,leading to poor net throughput

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 14 / 27

Page 45: Topic 15: Datacenter Design and Networking

More problems for TCP

In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofseconds

In commodity switches the buffer pool is shared by all interfacesI If long flows hog the memory, queues can build up for the short flows

Many-to-one communication patterns can lead to TCP throughputcollapse or incast

I This can cause overall application throughput to decrease by up to 90%

In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs

I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27

Page 46: Topic 15: Datacenter Design and Networking

More problems for TCP

In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces

I If long flows hog the memory, queues can build up for the short flows

Many-to-one communication patterns can lead to TCP throughputcollapse or incast

I This can cause overall application throughput to decrease by up to 90%

In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs

I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27

Page 47: Topic 15: Datacenter Design and Networking

More problems for TCP

In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces

I If long flows hog the memory, queues can build up for the short flows

Many-to-one communication patterns can lead to TCP throughputcollapse or incast

I This can cause overall application throughput to decrease by up to 90%

In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs

I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27

Page 48: Topic 15: Datacenter Design and Networking

More problems for TCP

In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces

I If long flows hog the memory, queues can build up for the short flows

Many-to-one communication patterns can lead to TCP throughputcollapse or incast

I This can cause overall application throughput to decrease by up to 90%

In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs

I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27

Page 49: Topic 15: Datacenter Design and Networking

More problems for TCP

In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces

I If long flows hog the memory, queues can build up for the short flows

Many-to-one communication patterns can lead to TCP throughputcollapse or incast

I This can cause overall application throughput to decrease by up to 90%

In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs

I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27

Page 50: Topic 15: Datacenter Design and Networking

More problems for TCP

In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces

I If long flows hog the memory, queues can build up for the short flows

Many-to-one communication patterns can lead to TCP throughputcollapse or incast

I This can cause overall application throughput to decrease by up to 90%

In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs

I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27

Page 51: Topic 15: Datacenter Design and Networking

More problems for TCP

In production data centers, due to the widely-varying mix ofapplications, congestion in the network can last from 10s to 100s ofsecondsIn commodity switches the buffer pool is shared by all interfaces

I If long flows hog the memory, queues can build up for the short flows

Many-to-one communication patterns can lead to TCP throughputcollapse or incast

I This can cause overall application throughput to decrease by up to 90%

In virtualized environments, the time sharing of resources increasesthe latency faced by the VMs

I This latency can be orders of magnitude higher than the RTT betweenhosts inside a datacenter, leading to slow progress of TCP connections

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 15 / 27

Page 52: Topic 15: Datacenter Design and Networking

Reaction

Some large-scale deployments have abandoned TCP altogether

For instance, Facebook now uses a custom UDP transport

It might be a “kitchen-sink” solution but it is sub-optimal in a datacenterenvironment

Over the years, a number of alternatives have been proposed

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 16 / 27

Page 53: Topic 15: Datacenter Design and Networking

Reaction

Some large-scale deployments have abandoned TCP altogether

For instance, Facebook now uses a custom UDP transport

It might be a “kitchen-sink” solution but it is sub-optimal in a datacenterenvironment

Over the years, a number of alternatives have been proposed

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 16 / 27

Page 54: Topic 15: Datacenter Design and Networking

Reaction

Some large-scale deployments have abandoned TCP altogether

For instance, Facebook now uses a custom UDP transport

It might be a “kitchen-sink” solution but it is sub-optimal in a datacenterenvironment

Over the years, a number of alternatives have been proposed

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 16 / 27

Page 55: Topic 15: Datacenter Design and Networking

Reaction

Some large-scale deployments have abandoned TCP altogether

For instance, Facebook now uses a custom UDP transport

It might be a “kitchen-sink” solution but it is sub-optimal in a datacenterenvironment

Over the years, a number of alternatives have been proposed

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 16 / 27

Page 56: Topic 15: Datacenter Design and Networking

Datacenter TCP (DCTCP)

Uses Explicit Congestion Notifications (ECN) from switches to performactive queue management-based congestion control

Switches set the congestion experienced flag in packets whenever thebuffer occupancy exceeds a small threshold

DCTCP uses this information to reduce the size of the window basedon a fraction of the marked packets

Enables it to react quickly to queue build and avoid buffer pressure

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 17 / 27

Page 57: Topic 15: Datacenter Design and Networking

Datacenter TCP (DCTCP)

Uses Explicit Congestion Notifications (ECN) from switches to performactive queue management-based congestion control

Switches set the congestion experienced flag in packets whenever thebuffer occupancy exceeds a small threshold

DCTCP uses this information to reduce the size of the window basedon a fraction of the marked packets

Enables it to react quickly to queue build and avoid buffer pressure

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 17 / 27

Page 58: Topic 15: Datacenter Design and Networking

Datacenter TCP (DCTCP)

Uses Explicit Congestion Notifications (ECN) from switches to performactive queue management-based congestion control

Switches set the congestion experienced flag in packets whenever thebuffer occupancy exceeds a small threshold

DCTCP uses this information to reduce the size of the window basedon a fraction of the marked packets

Enables it to react quickly to queue build and avoid buffer pressure

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 17 / 27

Page 59: Topic 15: Datacenter Design and Networking

Datacenter TCP (DCTCP)

Uses Explicit Congestion Notifications (ECN) from switches to performactive queue management-based congestion control

Switches set the congestion experienced flag in packets whenever thebuffer occupancy exceeds a small threshold

DCTCP uses this information to reduce the size of the window basedon a fraction of the marked packets

Enables it to react quickly to queue build and avoid buffer pressure

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 17 / 27

Page 60: Topic 15: Datacenter Design and Networking

Multipath TCP (MPTCP)

Establishes multiple subflows over different paths between a pair ofend-hosts

These subflows operate under a single TCP connection

The fraction of the total congestion window for each flow is determinedby its speed

Moves traffic away from the most congested paths

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 18 / 27

Page 61: Topic 15: Datacenter Design and Networking

Multipath TCP (MPTCP)

Establishes multiple subflows over different paths between a pair ofend-hosts

These subflows operate under a single TCP connection

The fraction of the total congestion window for each flow is determinedby its speed

Moves traffic away from the most congested paths

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 18 / 27

Page 62: Topic 15: Datacenter Design and Networking

Multipath TCP (MPTCP)

Establishes multiple subflows over different paths between a pair ofend-hosts

These subflows operate under a single TCP connection

The fraction of the total congestion window for each flow is determinedby its speed

Moves traffic away from the most congested paths

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 18 / 27

Page 63: Topic 15: Datacenter Design and Networking

Multipath TCP (MPTCP)

Establishes multiple subflows over different paths between a pair ofend-hosts

These subflows operate under a single TCP connection

The fraction of the total congestion window for each flow is determinedby its speed

Moves traffic away from the most congested paths

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 18 / 27

Page 64: Topic 15: Datacenter Design and Networking

tcpcrypt

Backwards compatible enhancement to TCP that aims to efficientlyand transparently provide encrypted communication to applications

Uses a custom key exchange protocol that leverages the TCP optionsfield

Like SSL, to reduce the cost of connection setup for short-lived flows, itenables cryptographic state from one TCP connection to bootstrapsubsequent ones

Applications can also be made aware of the presence of tcpcrypt tonegate redundant encryption

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 19 / 27

Page 65: Topic 15: Datacenter Design and Networking

tcpcrypt

Backwards compatible enhancement to TCP that aims to efficientlyand transparently provide encrypted communication to applications

Uses a custom key exchange protocol that leverages the TCP optionsfield

Like SSL, to reduce the cost of connection setup for short-lived flows, itenables cryptographic state from one TCP connection to bootstrapsubsequent ones

Applications can also be made aware of the presence of tcpcrypt tonegate redundant encryption

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 19 / 27

Page 66: Topic 15: Datacenter Design and Networking

tcpcrypt

Backwards compatible enhancement to TCP that aims to efficientlyand transparently provide encrypted communication to applications

Uses a custom key exchange protocol that leverages the TCP optionsfield

Like SSL, to reduce the cost of connection setup for short-lived flows, itenables cryptographic state from one TCP connection to bootstrapsubsequent ones

Applications can also be made aware of the presence of tcpcrypt tonegate redundant encryption

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 19 / 27

Page 67: Topic 15: Datacenter Design and Networking

tcpcrypt

Backwards compatible enhancement to TCP that aims to efficientlyand transparently provide encrypted communication to applications

Uses a custom key exchange protocol that leverages the TCP optionsfield

Like SSL, to reduce the cost of connection setup for short-lived flows, itenables cryptographic state from one TCP connection to bootstrapsubsequent ones

Applications can also be made aware of the presence of tcpcrypt tonegate redundant encryption

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 19 / 27

Page 68: Topic 15: Datacenter Design and Networking

Deadline-Driven Delivery (D3)

Targets applications with distributed workflow and latency targets

Such applications associate a deadline with each network flow and theflow is only useful if the deadline is met

Applications expose flow deadline and size information which isexploited by end hosts to request rates from routers along the data path

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 20 / 27

Page 69: Topic 15: Datacenter Design and Networking

Deadline-Driven Delivery (D3)

Targets applications with distributed workflow and latency targets

Such applications associate a deadline with each network flow and theflow is only useful if the deadline is met

Applications expose flow deadline and size information which isexploited by end hosts to request rates from routers along the data path

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 20 / 27

Page 70: Topic 15: Datacenter Design and Networking

Deadline-Driven Delivery (D3)

Targets applications with distributed workflow and latency targets

Such applications associate a deadline with each network flow and theflow is only useful if the deadline is met

Applications expose flow deadline and size information which isexploited by end hosts to request rates from routers along the data path

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 20 / 27

Page 71: Topic 15: Datacenter Design and Networking

Outline

1 Datacenter Topologies

2 Transport Protocols

3 Network Sharing

4 Wrapping Up

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 21 / 27

Page 72: Topic 15: Datacenter Design and Networking

Introduction

Network resources are shared amongst the tenants, which can lead tocontention and other undesired behaviour

Network performance isolation between tenants can be an importanttool for:

I Minimizing disruption from legitimate tenants that run network-intensiveworkloads

I Protecting against malicious tenants that launch DoS attacks

The standard methodology to ensure isolation is to use VLANs

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 22 / 27

Page 73: Topic 15: Datacenter Design and Networking

Introduction

Network resources are shared amongst the tenants, which can lead tocontention and other undesired behaviourNetwork performance isolation between tenants can be an importanttool for:

I Minimizing disruption from legitimate tenants that run network-intensiveworkloads

I Protecting against malicious tenants that launch DoS attacks

The standard methodology to ensure isolation is to use VLANs

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 22 / 27

Page 74: Topic 15: Datacenter Design and Networking

Introduction

Network resources are shared amongst the tenants, which can lead tocontention and other undesired behaviourNetwork performance isolation between tenants can be an importanttool for:

I Minimizing disruption from legitimate tenants that run network-intensiveworkloads

I Protecting against malicious tenants that launch DoS attacks

The standard methodology to ensure isolation is to use VLANs

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 22 / 27

Page 75: Topic 15: Datacenter Design and Networking

Introduction

Network resources are shared amongst the tenants, which can lead tocontention and other undesired behaviourNetwork performance isolation between tenants can be an importanttool for:

I Minimizing disruption from legitimate tenants that run network-intensiveworkloads

I Protecting against malicious tenants that launch DoS attacks

The standard methodology to ensure isolation is to use VLANs

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 22 / 27

Page 76: Topic 15: Datacenter Design and Networking

Virtual LAN

Acts like an ordinary LAN but end-hosts do no necessarily have to bephysically connected to the same segment

Nodes are grouped together by the VLAN

Broadcasts can also be sent within the same VLAN

VLAN membership information is inserted into Ethernet frames

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 23 / 27

Page 77: Topic 15: Datacenter Design and Networking

Virtual LAN

Acts like an ordinary LAN but end-hosts do no necessarily have to bephysically connected to the same segment

Nodes are grouped together by the VLAN

Broadcasts can also be sent within the same VLAN

VLAN membership information is inserted into Ethernet frames

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 23 / 27

Page 78: Topic 15: Datacenter Design and Networking

Virtual LAN

Acts like an ordinary LAN but end-hosts do no necessarily have to bephysically connected to the same segment

Nodes are grouped together by the VLAN

Broadcasts can also be sent within the same VLAN

VLAN membership information is inserted into Ethernet frames

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 23 / 27

Page 79: Topic 15: Datacenter Design and Networking

Virtual LAN

Acts like an ordinary LAN but end-hosts do no necessarily have to bephysically connected to the same segment

Nodes are grouped together by the VLAN

Broadcasts can also be sent within the same VLAN

VLAN membership information is inserted into Ethernet frames

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 23 / 27

Page 80: Topic 15: Datacenter Design and Networking

Rate-limiting End-hosts

In Xen the network bandwidth available to each domU can be ratelimited

Can be used to implement basic QoS

The virtual interface is simply rate-limited

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 24 / 27

Page 81: Topic 15: Datacenter Design and Networking

Rate-limiting End-hosts

In Xen the network bandwidth available to each domU can be ratelimited

Can be used to implement basic QoS

The virtual interface is simply rate-limited

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 24 / 27

Page 82: Topic 15: Datacenter Design and Networking

Rate-limiting End-hosts

In Xen the network bandwidth available to each domU can be ratelimited

Can be used to implement basic QoS

The virtual interface is simply rate-limited

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 24 / 27

Page 83: Topic 15: Datacenter Design and Networking

Outline

1 Datacenter Topologies

2 Transport Protocols

3 Network Sharing

4 Wrapping Up

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 25 / 27

Page 84: Topic 15: Datacenter Design and Networking

The End

In reverse order:

1 Cloud stacks be used to turn clusters and datacenters into private andpublic clouds

2 Virtualization of computation, storage, and networking can allow manytenants to co-exist

3 Most data does not fit the relational model and is more suited forNoSQL stores

4 Data-intensive, task-parallel frameworks abstract away the details ofdistribution, work allocation, sychronization, concurreny, andcommunication; Perfect match for the cloud

5 The future is Big Data and Cloud Computing!

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 26 / 27

Page 85: Topic 15: Datacenter Design and Networking

The End

In reverse order:

1 Cloud stacks be used to turn clusters and datacenters into private andpublic clouds

2 Virtualization of computation, storage, and networking can allow manytenants to co-exist

3 Most data does not fit the relational model and is more suited forNoSQL stores

4 Data-intensive, task-parallel frameworks abstract away the details ofdistribution, work allocation, sychronization, concurreny, andcommunication; Perfect match for the cloud

5 The future is Big Data and Cloud Computing!

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 26 / 27

Page 86: Topic 15: Datacenter Design and Networking

The End

In reverse order:

1 Cloud stacks be used to turn clusters and datacenters into private andpublic clouds

2 Virtualization of computation, storage, and networking can allow manytenants to co-exist

3 Most data does not fit the relational model and is more suited forNoSQL stores

4 Data-intensive, task-parallel frameworks abstract away the details ofdistribution, work allocation, sychronization, concurreny, andcommunication; Perfect match for the cloud

5 The future is Big Data and Cloud Computing!

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 26 / 27

Page 87: Topic 15: Datacenter Design and Networking

The End

In reverse order:

1 Cloud stacks be used to turn clusters and datacenters into private andpublic clouds

2 Virtualization of computation, storage, and networking can allow manytenants to co-exist

3 Most data does not fit the relational model and is more suited forNoSQL stores

4 Data-intensive, task-parallel frameworks abstract away the details ofdistribution, work allocation, sychronization, concurreny, andcommunication; Perfect match for the cloud

5 The future is Big Data and Cloud Computing!

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 26 / 27

Page 88: Topic 15: Datacenter Design and Networking

The End

In reverse order:

1 Cloud stacks be used to turn clusters and datacenters into private andpublic clouds

2 Virtualization of computation, storage, and networking can allow manytenants to co-exist

3 Most data does not fit the relational model and is more suited forNoSQL stores

4 Data-intensive, task-parallel frameworks abstract away the details ofdistribution, work allocation, sychronization, concurreny, andcommunication; Perfect match for the cloud

5 The future is Big Data and Cloud Computing!

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 26 / 27

Page 89: Topic 15: Datacenter Design and Networking

References

1 Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. Ascalable, commodity data center network architecture. In Proceedingsof the ACM SIGCOMM 2008 conference on Data communication(SIGCOMM ’08). ACM, New York, NY, USA, 63-74.

2 Chuanxiong Guo, Haitao Wu, Kun Tan, Lei Shi, Yongguang Zhang, andSongwu Lu. 2008. Dcell: a scalable and fault-tolerant networkstructure for data centers. In Proceedings of the ACM SIGCOMM 2008conference on Data communication (SIGCOMM ’08). ACM, New York,NY, USA, 75-86.

Zubair Nabi 15: Datacenter Design and Networking April 21, 2013 27 / 27